# Welcome to Randal Douc's wiki

A collaborative site on maths but not only!

• Theatre
• Research
• Teaching

### Miscellanous

world:datascience:basics


2017/10/07 23:39 ·


2019/02/13 08:39 ·

# From machine learning basics to Feed Forward Neural Networks

Typical machine learning problems can be decomposed into two different classes.

(Classification). The problem is to learn whether an individual from a given state space $\mathbb{R}^p$ belongs to some class. The focus is usually set on learning with a known number $M$ of classes so that an individual is associated with a label in $\{1,\ldots,M\}$. The statistical model is then given by $(X,Y)\in\mathbb{R}^p\times \{1,\ldots,M\}$ and the objective is to define a function $f: \mathbb{R}^p \to \{1,\ldots,M\}$, called classifier, such that $f(X)$ is the best prediction of $Y$ in a given context.

(Regression). The observation associated with $X$ is assumed to be given by $$Y = f(X) + \varepsilon\eqsp,$$ where $\varepsilon$ is a centered noise independent of $X$. The statistical model is then given by $(X,Y)\in\mathbb{R}^p\times \rset^m$ and the objective is to define the best estimator of $f$ in a given context.

An element of $\mathbb{R}^p$ contains all the features the label prediction or the regression estimate has to be based on.

### Loss and risk functions

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space. Assume that $(X,Y)$ is a couple of random variables defined on $(\Omega,\mathcal{F},\mathbb{P})$ and taking values in $\mathbb{R}^p\times\{1,\ldots,M\}$ or $\mathbb{R}^p\times \rset^m$ where $\mathbb{R}^p$ is a given state space. In the case of nonparametric models, it is not assumed that the joint law of $(X,Y)$ belongs to any parametric or semiparametric family of models. The best prediction is defined as $$h_{*} \in \underset{h\in\mathcal{H}}{\argmin}\;\left\{\mathcal{R}(h)\right\} \quad \mbox{where} \quad \mathcal{R}(h)= \mathbb{E}[\ell(Y,h(X))]\eqsp,$$ and $\ell$ is a loss function measuring the goodness of the prediction of $Y$ by $h(X)$ and $\mathcal{H}$ is a chosen set of possible candidates. Some widespread choices of loss function are:

(Classification):$\quad\ell(Y,h(X)) = \left|Y-h(X)\right|^2$.

(Regression): $\quad\ell(Y,h(X)) = \indi{Y\neq h(X)}$.

In most cases, the risk $\mathcal{R}$ cannot be computed nor minimized, it is instead estimated by the empirical classification risk defined as $$\mathcal{R}_n(h) = \frac{1}{n}\sum_{i=1}^n \ell(Y_i,h(X_i))\eqsp,$$ where $(X_i,Y_i)_{1\leqslant i\leqslant n}$ are independent observations with the same distribution as $(X,Y)$. The classification problem then boilds down to solving $$\widehat h_{n} \in \underset{h\in\mathcal{H}}{\argmin}\;\mathcal{R}_n(h) \eqsp.$$ In this context, several practical and theoretical challenges arise from the minimization of the empirical classification risk.

### Gentle start - classification with a mixture of two Gaussian distributions

In this first example, consider a parametric model, that is, the joint distribution of $(X,Y)$ is assumed to belong to a family of distributions parametrized by a vector $\theta$ with real components. For $k\in\{-1,1\}$, write $\pi_k = \mathbb{P}(Y = k)$. Assume that conditionally on the event $\{Y = k\}$, $X$ has a Gaussian distribution with mean $\mu_k \in\mathbb{R}^p$ and covariance matrix $\Sigma\in \mathbb{R}^{p\times p}$. The probability density function of $X$ given that $\{Y=k\}$ is $g_{k}:x\mapsto \sqrt{\mathrm{det}\left(2\pi\Sigma\right)}\mathrm{exp}\left\{-\frac{1}{2}\left(x-\mu_k\right)'\Sigma^{-1}\left(x-\mu_k\right)\right\}\eqsp.$ In this case, the parameter $\theta=(\pi_1, \mu_1,\mu_{-1}, \Sigma)$ belongs to the set $\Theta= [0,1] \times \mathbb{R}^d \times \mathbb{R}^d \times \mathbb{R}^{d \times d}$. The parameter $\pi_{-1}$ is not part of the components of $\theta$ since $\pi_{-1}=1-\pi_{1}$.

The explicit computation of $\mathbb{P}(Y=1 | X)$ writes \begin{align*} \mathbb{P}\left(Y=1\middle | X\right) = \frac{\pi_1g_1(X)}{\pi_1g_1(X) + \pi_{-1}g_{-1}(X)} = \frac{1}{1 + \frac{\pi_{-1}g_{-1}(X)}{\pi_1g_1(X)}} = \sigma\left(\log(\pi_1/\pi_{_1}) + \log(g_1(X)/g_{-1}(X)\right)\eqsp, \end{align*} where $\sigma: x\mapsto (1 + \rme^{-x})^{-1}$ is the sigmoid function. Then, $\mathbb{P}\left(Y=1\middle | X\right) = \sigma\left( x'\omega+ b\right)\eqsp,$ where \begin{align*} \omega &= \Sigma^{-1}\left(\mu_{1} - \mu_{-1}\right)\eqsp,\\ b &= \log(\pi_1/\pi_{_1}) + \frac{1}{2}\left(\mu_{1}+\mu_{-1}\right)'\Sigma^{-1}\left(\mu_{-1}-\mu_{1}\right)\eqsp. \end{align*} When $\Sigma$ and $\mu_1$ and $\mu_{-1}$ are unknown, this classifier cannot be computed explicitely. We will approximate it using the observations. Assume that $(X_i,Y_i)_{1\leqslant i\leqslant n}$ are independent observations with the same distribution as $(X,Y)$. The loglikelihood of these observations is given by \begin{align*} \log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right) &=\sum_{i=1}^n \log \mathbf{p}_{\theta} \left(X_{i},Y_{i}\right)\eqsp,\\ &= - \frac{nd}{2} \log(2\pi)+\sum_{i=1}^n\sum_{k\in\{-1,1\}}\indi{Y_i=k}\left(\log \pi_{k} -\frac{\log \det \Sigma}{2} - \frac{1}{2}\left(X_i - \mu_{k}\right)'\Sigma^{-1}\left(X_i - \mu_{k}\right)\right) \eqsp,\\ &= - \frac{nd}{2} \log(2\pi)-\frac{n}2 \log\det\Sigma + \left(\sum_{i=1}^n \indi{Y_i=1}\right)\log \pi_1 + \left(\sum_{i=1}^n \indi{Y_i=-1}\right)\log (1-\pi_{1})\\ & \hspace{1cm}- \frac{1}{2}\sum_{i=1}^n \indi{Y_i=1}\left(X_i - \mu_{1}\right)'\Sigma^{-1}\left(X_i - \mu_{1}\right) - \frac{1}{2}\sum_{i=1}^n \indi{Y_i=-1}\left(X_i - \mu_{-1}\right)'\Sigma^{-1}\left(X_i - \mu_{-1}\right)\eqsp. \end{align*} Maximizing the log likelihood function to estimate $\theta$ is equivalent to minimizing the empirical cross-entropy risk function: $$\theta \mapsto -\frac{1}{n}\sum_{i=1}^n\sum_{k\in\{-1,1\}}\indi{Y_i=k}\left(\log \pi_{k} -\frac{\log \det \Sigma}{2} - \frac{1}{2}\left(X_i - \mu_{k}\right)'\Sigma^{-1}\left(X_i - \mu_{k}\right)\right) \eqsp.$$ The gradient of $\log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right)$ with respect to $\theta$ is therefore given by \begin{align*} \frac{\partial \log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right)}{\partial \pi_1} &= \left(\sum_{i=1}^n\indi{Y_i=1}\right)\frac{1}{\pi_1} - \left(\sum_{i=1}^n \indi{Y_i=-1}\right)\frac{1}{1-\pi_{1}}\eqsp,\\ \frac{\partial \log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right)}{\partial \mu_1} &= \sum_{i=1}^n \indi{Y_i=1}\left(2\Sigma^{-1}X_i - 2\Sigma^{-1}\mu_{1}\right)\eqsp,\\ \frac{\partial \log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right)}{\partial \mu_{-1}} &= \sum_{i=1}^n \indi{Y_i=-1}\left(2\Sigma^{-1}X_i - 2\Sigma^{-1}\mu_{-1}\right)\eqsp,\\ \frac{\partial \log \mathbf{p}_{\theta}\left(X_{1:n},Y_{1:n}\right)}{\partial \Sigma^{-1}} &= \frac{n}{2}\Sigma - \frac{1}{2}\sum_{i=1}^n \indi{Y_i=1}\left(X_i - \mu_{1}\right)\left(X_i - \mu_{1}\right)' - \frac{1}{2}\sum_{i=1}^n \indi{Y_i=-1}\left(X_i - \mu_{-1}\right)\left(X_i - \mu_{-1}\right)'\eqsp. \end{align*} The maximum likelihood estimator (i.e. the cross-entropy based estimator) is defined as the only parameter such that all these equations are set to $0$. For $k\in\{-1,1\}$, it is given by $\widehat \pi_k^n = \frac{1}{n}\sum_{i=1}^n\indi{Y_i=k}\eqsp,\quad \widehat \mu_k^n = \frac{1}{\sum_{i=1}^n\indi{Y_i=k}}\sum_{i=1}^n \indi{Y_i=k}\,X_i\eqsp, \quad \widehat\Sigma^n = \frac{1}{n}\sum_{i=1}^n \left(X_i - \widehat \mu_{Y_i}^n\right)\left(X_i - \widehat \mu_{Y_i}^n\right)'\eqsp.$

### Relaxing the Gaussian assumption - Logistic regression

In some situations, it may be too restrictive to assume that the joint distribution of $(X,Y)$ belongs to a parametric family. One of the most widespread model is the logistic regression which is defined by $\mathbb{P}\left(Y=1\middle |X\right) = \sigma(X'\omega + b)\eqsp,$ where $b\in\rset$, $\omega\in\rset^d$ and for all $x\in\rset^d$, The parameter $\theta$ is thus $\theta=(b,\omega) \in \rset \times \mathbb{R}^p$.

When $b$ and $\omega$ are unknown, this quantity cannot be computed explicitely and is approximated using the observations. Assume that $(X_i,Y_i)_{1\leqslant i\leqslant n}$ are independent observations with the same distribution as $(X,Y)$. The conditional likelihood of the observations $Y_{1:n}$ given $X_{1:n}$ is: \begin{align*} \mathbf{p}_{\theta}\left(Y_{1:n}\middle |X_{1:n}\right) =\prod_{i=1}^n \mathbf{p}_{\theta}\left(Y_{i} \middle |X_{i}\right)&= \prod_{i=1}^n \left(\sigma_{}(X_i)\right)^{(1+Y_i)/2}\left(1-\sigma(X_i)\right)^{(1-Y_i)/2}\eqsp,\\ &= \prod_{i=1}^n \left(\frac{\rme^{b + \langle \omega;X_i\rangle}}{1+\rme^{b + \langle \omega;X_i\rangle}}\right)^{(1+Y_i)/2}\left( \frac{1}{1+\rme^{b + \langle \omega;X_i\rangle}}\right)^{(1-Y_i)/2}\eqsp. \end{align*} The associated conditional loglikelihood is therefore \begin{align*} \log \mathbf{p}_{\theta}\left(Y_{1:n}\middle |X_{1:n}\right) &=\sum_{i=1}^n \left\{\frac{1+Y_i}{2}\log\left(\frac{\rme^{b + \langle \omega;X_i\rangle}}{1+\rme^{b + \langle \omega;X_i\rangle}}\right) + \frac{1-Y_i}{2}\log\left(\frac{1}{1+\rme^{b + \langle \omega;X_i\rangle}}\right) \right\}\eqsp,\\ &= \sum_{i=1}^n \left\{ \frac{1+Y_i}{2}\left(b + \langle \omega;X_i\rangle\right) - \log\left(1+\rme^{b+ \langle \omega;X_i\rangle}\right)\right\}\eqsp. \end{align*} Note again that maximizing this loglikelihood is equivalent to minimizing the empirical cross-entropy. It cannot be done explictly yet numerous numerical optimization methods are available to maximize $(\omega,b)\mapsto \log \mathbf{p}_{\theta}\left(Y_{1:n}\middle |X_{1:n}\right)$ (see next session on a up-to-date overview of gradient based algorithms).

### The multilayer perceptron - Feed Forward Neural Networks

The first mathematical model for a neuron was the Threshold Logic Unit (McCulloch and Pitts, 1943), with Boolean inputs and outputs. The response associated with an input $x\in\{0,1\}^d$ is defined as $f: x\mapsto \indi{\omega\sum_{j=1}^dx_j + b \geqslant 0}$. This construction allows to build any boolean function from elementary units $x\vee y = \indi{x+y-1/2 \geqslant 0}\eqsp,\quad x\wedge y = \indi{x+y-3/2 \geqslant 0}\quad\mathrm{and}\quad 1-x = \indi{-x+1/2 \geqslant 0}$ This elementary model can be extended to the Perceptron with real valued inputs (Rosenblatt, 1957) by writing $f: x\mapsto \indi{\sum_{j=1}^d\omega_jx_j + b \geqslant 0}$. In this case, the nonlinear activation function is $\sigma: x \mapsto \indi{x\geqslant 0}$ and the ouput defined as: $f:x \mapsto \sigma(\omega'x + b)\eqsp.$ Linear discriminant analysis and logistic regression are other instances with the sigmoid activation function. The perceptron weakens the modeling assumptions of LDA or logistic regression and composed in parallel $q$ of these perceptron units to produce the output. Then, $x = z_0\in\mathbb{R}^p$, $b_1\in \mathbb{R}^q$, $\omega_1\in\mathbb{R}^{p\times q}$ and $z_1 = \sigma(\omega_1'x + b)\eqsp,$ with $\sigma$ the elementwise activation function. The multi-layer perceptron, also known as the fully connected feedforward network, connects these units in series. For a given number $L$ of layers, $z_1 = \sigma(\omega_1'x + b_1) \eqsp, \quad z_2 = \sigma(\omega_2'z_1 + b_2)\eqsp,\quad\ldots\eqsp,\quad z_L = \sigma(\omega_L'z_{L-1} + b_L)\eqsp.$ As there is no modelling assumptions anymore, virtually any activation function may be used. The relu activation $x \mapsto \mathrm{max}(0,x)$ and its extensions are the default recommendation in modern implementations (Jarrettet al., 2009), (Nair and Hinton, 2010), (Glorot et al., 2011), etc. One of the major motivation arises from the gradient based parameter optimization which is numerically more stable with relu, (see next session on a up-to-date overview of gradient based algorithms). Assume that the network contains $L$ layers, then the output layer is of the form: $z_L = \sigma(\omega_L'z_{L-1} + b_L)\eqsp.$ The choice of this last activation function greatly relies on the task the network is assumed to perform.

(Biclass classification). The output $z_L$ is the estimate of the probability that the class is $1$ given the input $x$. The common choice in this case is the sigmoid function, one of the main reason is due to the gradient descent algorithm used to optimize the parameters, cf. next sessions.

(Multiclass classification). The output $z_L$ is the estimate of the probability that the class is $k$ for all $1\leqslant k\leqslant M$, given the input $x$. The common choice in this case is the softmax function: for all $1\leqslant i\leqslant r$, $\sigma(z)_i = \frac{\rme^{z_i}}{\sum_{j=1}^r\rme^{z_j}}\eqsp.$

### Universal approximation properties

The universal approximation theorem sets a theoretical guarentee that feedforward networks with hidden layers provide a universal approximation framework. The first result of (Horniket al., 1989) and (Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer can approximate any Borel measurable function from one finite-dimensional space to another, provided that the network is given enough hidden units. For a given activation function $\sigma:\mathbb{R}\to\mathbb{R}$, the set of neural networks with one hidden layer and a linear output associated with $\sigma$ is given by: $\mathcal{N}_{\sigma} = \left\{f:\mathbb{R}^p\to \mathbb{R}\eqsp;\eqsp \mathrm{there\;exist\;} q\geqslant 1\eqsp, (c_1,\ldots,c_q, b_1,\ldots,b_q)\in\mathbb{R}^{2q}\eqsp,(\omega_1,\ldots,\omega_q)\in\mathbb{R}^{pq}\eqsp \mathrm{such\;that}\eqsp f: x\mapsto \sum_{j=1}^q c_j\sigma(\langle \omega_j;x\rangle + b_j)\right\}\eqsp.$

(Horniket al., 1989), (Cybenko, 1989). Let $\sigma:\mathbb{R}\to \mathbb{R}$ be a continuous activation function such that $\lim_{x\to \infty}\sigma(x) = 1$ and $\lim_{x\to -\infty}\sigma(x) = 0$. Then, $\mathcal{N}_{\sigma}$ is dense in $\mathcal{C}([0,1]^p,\rset)$ for the topology of the supremum norm.

While the original theorems were first stated in terms of units with activation functions that saturate for both very negative and very positive arguments, universal approximation theorems have also been proved for a wider class of activation functions, which includes the now commonly used rectified linear unit (Leshno et al., 1993).

(Leshno et al., 1993). Assume that $\sigma:\mathbb{R}\to \mathbb{R}$ is continuous and is not a polynomial activation function. Then, $\mathcal{N}_{\sigma}$ is dense in $\mathcal{C}(\rset^p,\rset)$ for the topology of the supremum norm on compact subsets.

According to the universal approximation theorem, there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be. In (Barron, 1993), the authors provides some bounds on the size of a single-layer network needed to approximate a broad class of functions. Unfortunately, in the worst case, an exponential number of hidden units (possibly with one hidden unit corresponding to each input configuration that needs to be distinguished) may be required.

(Barron, 1993). Let $r>0$ and $\mu$ be a probability distribution on the closed ball centered at 0 and with radius $r$ denoted by $\mathsf{B}_r$. Let $\sigma:\mathbb{R}\to \mathbb{R}$ be a bounded and measurable activation function such that $\lim_{x\to \infty}\sigma(x) = 1$ and $\lim_{x\to -\infty}\sigma(x) = 0$. Assume that $f:\mathbb{R}^p\to \mathbb{R}$ is such that its Fourier transform $\widetilde f$ satisties $\int_{\mathbb{R}^p} |\omega||\widetilde{f}(w)|\rmd \omega<\infty$. Then, for all $q\geqslant 1$, there exist a Feedforward neural network model with one layer of $q$ sigmoidal units: $f_q: x\mapsto \sum_{j=1}^q c_j\sigma(\langle \omega_j;x\rangle + b_j) + c_0$ and $C>0$ such that $\int_{\mathsf{B}_r}\left(f(x) - f_q(x)\right)^2\mu(\rmd x) \leqslant \frac{C}{q}\eqsp.$