Welcome to Randal Douc's wiki

A collaborative site on maths but not only!

User Tools

Site Tools


world:datascience:basics
2023/11/14 18:37
2019/02/13 08:39 · douc

From machine learning basics to Feed Forward Neural Networks

Typical machine learning problems can be decomposed into two different classes.

(Classification). The problem is to learn whether an individual from a given state space belongs to some class. The focus is usually set on learning with a known number of classes so that an individual is associated with a label in . The statistical model is then given by and the objective is to define a function , called classifier, such that is the best prediction of in a given context.

(Regression). The observation associated with is assumed to be given by where is a centered noise independent of . The statistical model is then given by and the objective is to define the best estimator of in a given context.

An element of contains all the features the label prediction or the regression estimate has to be based on.

Loss and risk functions

Let be a probability space. Assume that is a couple of random variables defined on and taking values in or where is a given state space. In the case of nonparametric models, it is not assumed that the joint law of belongs to any parametric or semiparametric family of models. The best prediction is defined as and is a loss function measuring the goodness of the prediction of by and is a chosen set of possible candidates. Some widespread choices of loss function are:

(Classification):.

(Regression): .

In most cases, the risk cannot be computed nor minimized, it is instead estimated by the empirical classification risk defined as where are independent observations with the same distribution as . The classification problem then boilds down to solving In this context, several practical and theoretical challenges arise from the minimization of the empirical classification risk.

Gentle start - classification with a mixture of two Gaussian distributions

In this first example, consider a parametric model, that is, the joint distribution of is assumed to belong to a family of distributions parametrized by a vector with real components. For , write . Assume that conditionally on the event , has a Gaussian distribution with mean and covariance matrix . The probability density function of given that is In this case, the parameter belongs to the set . The parameter is not part of the components of since .

The explicit computation of writes where is the sigmoid function. Then, where When and and are unknown, this classifier cannot be computed explicitely. We will approximate it using the observations. Assume that are independent observations with the same distribution as . The loglikelihood of these observations is given by Maximizing the log likelihood function to estimate is equivalent to minimizing the empirical cross-entropy risk function: The gradient of with respect to is therefore given by The maximum likelihood estimator (i.e. the cross-entropy based estimator) is defined as the only parameter such that all these equations are set to . For , it is given by

Relaxing the Gaussian assumption - Logistic regression

In some situations, it may be too restrictive to assume that the joint distribution of belongs to a parametric family. One of the most widespread model is the logistic regression which is defined by where , and for all , The parameter is thus .

When and are unknown, this quantity cannot be computed explicitely and is approximated using the observations. Assume that are independent observations with the same distribution as . The conditional likelihood of the observations given is: The associated conditional loglikelihood is therefore Note again that maximizing this loglikelihood is equivalent to minimizing the empirical cross-entropy. It cannot be done explictly yet numerous numerical optimization methods are available to maximize (see next session on a up-to-date overview of gradient based algorithms).

The multilayer perceptron - Feed Forward Neural Networks

The first mathematical model for a neuron was the Threshold Logic Unit (McCulloch and Pitts, 1943), with Boolean inputs and outputs. The response associated with an input is defined as . This construction allows to build any boolean function from elementary units This elementary model can be extended to the Perceptron with real valued inputs (Rosenblatt, 1957) by writing . In this case, the nonlinear activation function is and the ouput defined as: Linear discriminant analysis and logistic regression are other instances with the sigmoid activation function. The perceptron weakens the modeling assumptions of LDA or logistic regression and composed in parallel of these perceptron units to produce the output. Then, , , and with the elementwise activation function. The multi-layer perceptron, also known as the fully connected feedforward network, connects these units in series. For a given number of layers, As there is no modelling assumptions anymore, virtually any activation function may be used. The relu activation and its extensions are the default recommendation in modern implementations (Jarrettet al., 2009), (Nair and Hinton, 2010), (Glorot et al., 2011), etc. One of the major motivation arises from the gradient based parameter optimization which is numerically more stable with relu, (see next session on a up-to-date overview of gradient based algorithms). Assume that the network contains layers, then the output layer is of the form: The choice of this last activation function greatly relies on the task the network is assumed to perform.

(Biclass classification). The output is the estimate of the probability that the class is given the input . The common choice in this case is the sigmoid function, one of the main reason is due to the gradient descent algorithm used to optimize the parameters, cf. next sessions.

(Multiclass classification). The output is the estimate of the probability that the class is for all , given the input . The common choice in this case is the softmax function: for all ,

Universal approximation properties

The universal approximation theorem sets a theoretical guarentee that feedforward networks with hidden layers provide a universal approximation framework. The first result of (Horniket al., 1989) and (Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer can approximate any Borel measurable function from one finite-dimensional space to another, provided that the network is given enough hidden units. For a given activation function , the set of neural networks with one hidden layer and a linear output associated with is given by:

(Horniket al., 1989), (Cybenko, 1989). Let be a continuous activation function such that and . Then, is dense in for the topology of the supremum norm.

While the original theorems were first stated in terms of units with activation functions that saturate for both very negative and very positive arguments, universal approximation theorems have also been proved for a wider class of activation functions, which includes the now commonly used rectified linear unit (Leshno et al., 1993).

(Leshno et al., 1993). Assume that is continuous and is not a polynomial activation function. Then, is dense in for the topology of the supremum norm on compact subsets.

According to the universal approximation theorem, there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be. In (Barron, 1993), the authors provides some bounds on the size of a single-layer network needed to approximate a broad class of functions. Unfortunately, in the worst case, an exponential number of hidden units (possibly with one hidden unit corresponding to each input configuration that needs to be distinguished) may be required.

(Barron, 1993). Let and be a probability distribution on the closed ball centered at 0 and with radius denoted by . Let be a bounded and measurable activation function such that and . Assume that is such that its Fourier transform satisties . Then, for all , there exist a Feedforward neural network model with one layer of sigmoidal units: and such that

world/datascience/basics.txt · Last modified: 2022/03/16 07:40 (external edit)