• No results found

On-line learning in neural networks with ReLU activations

N/A
N/A
Protected

Academic year: 2021

Share "On-line learning in neural networks with ReLU activations"

Copied!
44
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

On-line learning in neural networks with ReLU activations

Michiel Straat, University of Groningen September 22, 2018

Supervisors Michael Biehl Kerstin Bunte

(2)

Contents

1 Introduction 3

2 On-line and off-line learning of a rule 4

3 The Rectified Linear Unit (ReLU) 7

3.1 Advantages . . . 7

3.1.1 Efficient evaluation . . . 8

3.1.2 Non saturation . . . 8

3.1.3 Better biological motivation and sparse representation . . . 8

3.2 Disadvantages and remedies . . . 9

4 Theoretical learning dynamics in ReLU networks 11 4.1 The Statistical Physics of (online) Learning . . . 11

4.2 Equations of motion: student-teacher overlaps . . . 14

4.3 Equations of motion: student-student overlaps . . . 17

4.4 Generalization error . . . 18

4.5 Equations under site-symmetry assumptions . . . 19

4.5.1 Generalization error . . . 20

4.6 Equations of motion in Erf neural networks . . . 20

4.7 ReLU student and Erf teacher . . . 22

4.8 Erf student and ReLU teacher . . . 23

4.9 Dropout regularization . . . 23

4.10 The learning rate in the small regime . . . 24

4.11 Simulating learning in ReLU networks . . . 25

5 Experiments 26 5.1 The ReLU perceptron . . . 26

5.2 ReLU network K=M =2 . . . 31

5.2.1 Fixed points and plateau length . . . 32

5.2.2 Comparison with erf . . . 34

5.3 ReLU network K=M =3 and site symmetry . . . 35

5.4 Overrealizable case: K=3, M=2 . . . 37

5.5 Unrealizable case: K=2, M=3 . . . 38

6 Discussion 42 7 Outlook 43 7.1 Two-layer ReLU network learning dynamics for larger η and learning rate adaptation 43 7.2 Modifications of learning and regularization . . . 43

7.3 Potentially extending the theory to more layers . . . 43

(3)

1. Introduction

The Statistical Physics of Learning has been a successful method for describing the learning dynamics for machine learning models, in particular neural networks. The framework uses techniques from statistical physics that describe large physical systems in order to describe machine learning models that usually involve many adaptable weights, for instance in deep learning the number of weights can be on the order of hundreds of millions. Instead of concentrating on each individual weight, statistical physics techniques aggregate the individual parameters in the form of order parameters.

These order parameters describe the state of the system more concisely and make the large system therefore more interpretable. Not only do the order parameters provide better insight in the state of a neural network, it is also possible to formulate updates of the order parameters directly, for example as done in [14] for the gradient descent algorithm or in [4] for the Hebbian learning algorithm. In the thermodynamic limit, in which the system size is taken infinitely large, it becomes possible to formulate those update rules as continuous differential equations for the evolution of the order parameters. The set of differential equations for the order parameters provides a useful tool to study the behavior of the model and algorithm theoretically, in order to gain a deeper understanding of the learning process. This can then possibly be used to reduce unwanted effects in the learning dynamics by an adaptation of the algorithm, in order to improve real-world learning scenarios.

Different learning scenarios have already been studied within the framework, such as online learning in neural networks [3, 14, 5] and off-line/batch learning in neural networks [2]. The framework allows for incorporation of extensions of the learning scenario, for example as in [4] where drifting concepts are formulated and studied or regularization techniques such as dropout can be formulated and studied. Besides studying learning behavior, one could also study the behavior of neural networks of different architectures, using various choices of activation functions in the neurons. For example, in [3] and in [14], two-layered neural networks are studied which use the sigmoidal Erf function in the hidden units.

Quite recently, the rectifier activation function (Rectified Linear Unit: ReLU) has become popular in deep learning applications, for mostly the reason that the activation function empirically yields better performance than sigmoidal activations, see [10] for example. Although the rectifier has known theoretical advantages which will also be discussed in this thesis, there is still a lack of rigorous mathematical arguments as to why ReLU networks have the ability to learn faster and show better performance.

In order to gain a deeper understanding of the learning behavior of ReLU neural networks, we will formulate the gradient descent dynamics of the order parameters in ReLU networks using The Statistical Physics of Learning framework. We study the dynamics in the on-line learning scenario and compare the learning behavior in ReLU neural networks to the behavior in neural networks which use the sigmoidal Erf activation function.

In Section 2, we will give a summary of online learning in neural networks and the definition of the learning dynamics. Section 3 discusses advantages and disadvantages of the rectifier activation function. In Section 4, we derive the differential equations for the order parameters in ReLU neural networks and formulate the corresponding generalization error, which are the main results of the thesis. In Section 5, we show theoretical- and simulation results for ReLU neural networks with various architectures and compare those results with results obtained from learning with similar Erf networks.

(4)

ξ1

ξ2

ξN

.. .

h1=g(w1· ξ)

h2=g(w2· ξ)

v · h Output w11

w12

w21

w22 wN1 wN 2

v1

v2 Hidden

layer Input

layer

Output layer

Figure 1: Two-layer network

2. On-line and off-line learning of a rule

In this section, we will discuss general on-line learning of an unknown rule defined with a soft- committee machine, particularly in case of the gradient descent algorithm.

The rule is defined by the architecture and the set of weights B of a teacher network. Here we consider two-layered architectures as in Figure 1: The first layer is the N -dimensional input layer and the second layer consists of M hidden units. The set of weights of the teacher is in this case given by the matrix B ∈RN ×M. In this matrix, column n contains the weights associated with the nth hidden unit and row i contains the weights associated with the ith input dimension. For particular instances of weights B and input vector ξµ, each hidden unit n retrieves as input the scalar value yn=Bn· ξµ. An activation function g(·)is used in the hidden neurons to map the input activation to an output activation. The output activation for neuron n is therefore g(yn). The output activations are then combined into the final output. In general, the output activations can be combined in a linear combination using weights v ∈RM. In a soft-committee machine, the sum of output activations is taken, corresponding to vn=1 for each neuron n. Summarizing, the output of the described teacher network with unit hidden to output weights for an input ξµ is:

τ(ξµ) =

M

X

n=1

g(yn(ξµ)), (1)

where yn(ξµ)is the input activation of the nth hidden neuron in the teacher network:

yn(ξµ) =Bn· ξµ. (2)

In the rest of the thesis, we refer to the teacher hidden neuron n as "teacher n". Every input ξµ now has a rule value associated with it, according to Equation (1). This is denoted by tuples (ξµ, τ(ξµ)). In fact, we have obtained an infinitely large dataset of possible input vectors with corresponding rule labels:

D={(ξ, τ(ξ))}ξ (3)

We consider a second two-layer neural network called the student, which has K hidden neurons and weights J ∈RN ×K. To the student, the rule is unknown, just as in applications where a neural network learns a particular mapping which is really unknown. We denote the student output for

(5)

an input ξµ as σ(J , ξµ):

σ(J , ξµ) =

K

X

i=1

g(xi(ξµ)), (4)

where xi(ξµ)is the input activation of the ith hidden neuron in the teacher network:

xi(ξµ) =Ji· ξµ. (5)

The goal of a learning algorithm adapts the student weights J ∈RN ×K in order to stay as close as possible to the output of the rule for every possible input ξ. For an input ξµ, the disagreement between student and teacher is expressed as the error:

(J , ξµ) =1

2[τ(ξµ)− σ(J , ξµ)]2. (6)

The learning algorithm adapts the weights J in order to reduce (J , ξ)for inputs ξ.

In off-line learning scenarios, a fixed-size dataset is available of P examples, from which the student can learn:

Dtrain={(ξµ, τ(ξµ)), µ=1, 2, ..., P } . (7)

The dataset of P examples is called the training set. The total error associated with a set of weights J is the mean over the training examples:

εt(J) = 1 P

P

X

µ=1

(J , ξµ), (8)

which is minimized by presenting the examples from the dataset in a particular order. One example is batch learning, which presents the entire training set or a random subset of it and updates the student weights J according to the selected learning algorithm. Stochastic learning is another example, which randomly selects an example from the training set and updates the weights J using this example and the selected learning algorithm. Given a set of examples which the learning algorithm has never seen:

Dtest={(ξt, τ(ξt)), t=1, 2, ..., T } , (9)

the generalization error is defined as the mean error made by the student on Dtest:

εg(J) = 1 T

T

X

t=1

(J , ξt). (10)

In on-line learning, new examples ξ are presented in a stream, where each presentation leads to an adaptation of the student weights. There is no explicit distinction between training- and generalization error. The average error made by the student is defined as the generalization error:

εg(J) =h(J , ξ)iξ, (11)

where the average denoted by h·i is taken over the distribution of inputs. It is the expected error that the student will make given an arbitrary example ξµ. At learning step µ in the on-line learning scenario, the student weights J are updated according to a specific learning algorithm f :

Jµ+1=Jµ+ 1

Nf(Jµ, ξµ), (12)

(6)

where Jµ are the weights at timestep µ and Jµ+1 are the updated weights. Note that the update is scaled by the network size N . In this thesis, we will concentrate on the gradient descent algorithm:

f(Jµ, ξµ) =−η∇J(J , ξµ), (13)

where η is the learning rate that controls the size of the step done in the negative direction of the gradient and ∇J(J , ξµ)is the gradient of the error with respect to all the student weights J for the current example only. The full update can then be written as:

Jµ+1=Jµη

NJ(Jµ, ξµ)

=Jµ+ η

N[τ(ξµ)− σ(Jµ, ξµ)]∇Jσ(Jµ, ξµ).

(14)

For a two-layered student with K hidden neurons in the second layer and a differentiable activation function g(x), the updates of the weights for the ith hidden unit upon presentation of an example ξµ is then:

Jiµ+1=Jiµ+ η

iξµ, (15)

where δi is defined as:

δi=g0(xi)[τ(ξµ)− σ(Jµ, ξµ)]. (16) It is understood that the specific form of Equation (15) depends on the choice of activation function, as the activation function g(x)and its first derivative g0(x)appear explicitly in the R.H.S. of the update equation. Possible choices of activation function are the sigmoidal Erf function:

g(x) =erf

 x

√2



, g0(x) = r2

πex22 , (17)

of which an illustration can be seen in Figure 2 and the rectifier activation function (ReLU):

g(x) =(x), g0(x) =θ(x), (18)

where θ(x)is the step function defined as:

θ(x) =

(1, for x ≥ 0 0, for x < 0

) .

The ReLU function is shown in Figure 3. The two activation functions have clearly different properties, which will be discussed in Section 3. One property of the ReLU that is obvious from its definition is its discontinuity of the derivative at ReLU0(0), but in this case one can choose to update or not.

When K =M , i.e. the number of hidden units in the student is equal to the number of hidden units in the teacher, the two architectures match and hence the student has the ability to fully represent the rule. If the learning algorithm reaches this desired solution for the student, then εg=0. It is not guaranteed that the learning algorithm will find the perfect solution in all learning scenarios, because the algorithm can be stuck in a local minimum of the cost function, for instance, if batch learning is used on a training set. In our on-line learning scenario, since the stream of examples is infinite and the examples are independently drawn from the distributions, the perfect solution will be reached when student and teacher have matching architectures. When K < M and Bn 6=0, ∀n, the student has less complexity than the teacher and therefore the student cannot represent the rule. When K > M , the student is more complex than the teacher. In practice, we often do not know the complexity of the rule that needs to be learned, hence studying these cases is of interest. In the rest of the thesis, these three learning scenarios will also be called realizable, unrealizable and overrealizable, respectively, terms that also appear in the rest of the literature.

(7)

-4 -2 2 4

-1.0 -0.5 0.5 1.0 Erf activation function

(a) The Erf activation function

-4 -2 2 4

0.2 0.4 0.6 0.8 Derivative of Erf

(b) First derivative of Erf Figure 2: The sigmoidal erf activation function and its Gaussian derivative.

3. The Rectified Linear Unit (ReLU)

In this section, we discuss the properties of the rectifier function and its potential advantages and disadvantages. We discuss conjectures about the advantages of the function, empirical successes of the usage of the function and mathematical properties about the function.

There are several ways to express the rectifier function. One can be found in Equation (18) and alternatives are g(x) = x+ and g(x) =max(0, x). The latter is often used in implementations, as it involves one comparison and computing the ReLU function for its argument is therefore very efficient. Furthermore, the computation gives exact results. This is in contrast to popular methods for the computation of the erf function, which are not exact and increase in computational complexity for increasing desired accuracy [7].

The activation function ReLU : R → R≥0 is a non-linear function that consists of two linear parts: For theR<0 part of its domain, the tangent to the function is zero and for theR>0 part the tangent is one. Hence, the rectifier function’s derivative has only two values in its image:

ReLU0 :R → {0, 1}. At ReLU(0) the function is discontinuous and therefore the derivative is mathematically not defined, however a value of 0 or 1 for ReLU0(0)can be chosen.

To compare these properties with sigmoidal activation, the activation function Erf:R →[−1, 1] is more non-linear and smooth over its entire domain. It has a smooth Gaussian derivative with a maximum at Erf0(0). From the definition of δi in Equation (16), the effect of the differences in derivatives between ReLU and Erf can be seen. When rectifiers are used, upon presentation of an example ξµ, the weights Jiµ are updated according to η/N[τ − σ]ξµ if xi> 0 or they are not updated at all if xi< 0. When using Erf units, the magnitude of the updates varies continuously, with a maximum for xi=0 and continuous exponential decrease for xi< 0 and xi> 0.

3.1. Advantages

Here we will list some of the key potential advantages of the ReLU function compared to sigmoidal activation.

(8)

-4 -2 2 4 x 1

2 3 4 5 x θ(x) ReLU activation function

(a) The ReLU activation function

-4 -2 2 4 x

0.2 0.4 0.6 0.8 1.0 θ(x) Derivative of ReLU

(b) First derivative of ReLU Figure 3: The ReLU activation function and its derivative.

3.1.1. Efficient evaluation

One advantage of the ReLU function over sigmoidal activation is its computational efficiency.

For learning in deep architectures, upon presentation of one example ξµ, many neurons are first evaluated to compute the output σ(Jµ, ξµ)and in the backpropagation of error the gradient is evaluated an equal number of times. For deep learning, many examples ξµ are required to avoid over-fitting, so the described computation needs to be executed many times. For this reason, the efficiency of evaluating the activation function is important and using ReLU activation in deep architectures can reduce training time.

3.1.2. Non saturation

As can be seen in Figure 3b, the gradient of the ReLU remains constant for all input activation values xi≥ 0. In contrast, the gradient of the sigmoidal function satures quickly in both directions, as can be seen in Figure 2b. This is known as the vanishing gradient problem and it can significantly slow down the speed of training. For instance, in [11], the authors notice that a four-layer convolutional neural net with ReLU activation reaches a 25% training error six times faster in terms of epochs in stochastic gradient descent than the same architecture with tanh activation on a large image classification task.

3.1.3. Better biological motivation and sparse representation

Several models exist for the activity of biological neurons and networks of biological neurons exist.

These vary in the amount of detail modeled, some modeling the biophysical processes and others are more abstract. A popular and simple model is the Integrate-and-Fire model, which does not model the biological processes that lead to the firing of a neuron[8]. In the model, an action potential occurs when a threshold value of the membrane potential is reached under influence of an input current. The membrane potential is then simply reset to the resting value of the potential, after which the membrane potential builds up again. The frequency of firing is dependent on the input current applied. Several extensions of the models have been proposed, including ones that model the refractory period limiting the firing frequency of a neuron for constant input current, and also ones that include time-dependency of the membrane potential found in biological neurons, such as the leaky integrate-and-fire model.

(9)

10 20 30 40 50 Input current I 0.05

0.10 0.15 0.20 0.25 Firing rate f(I) (Hz)

Behavior of Leaky Firing Rate neuron (Vth=15)

Including Δabs

Without Δabs

Figure 4: A model for the activity of a biological neuron in terms of spikes per second (firing rate) under influence of a constant input current I. The rest of the parameters are threshold θ=15, R=1 and∆abs=5.

In [9], a definition of the firing rate for leaky integrate-and-fire neurons is given:

f(I0) =



abs+τmln RI0

RI0− θ

−1

, (19)

where ∆abs is the refractory period (ms), τmis the time constant, R is the resistance of the linear resistor in the circuit, I0is the applied current and θ is the threshold potential at which the neuron fires. For further details, see for example [9]. The point of discussing this model is how it models biological neuron activity, as shown in Figure 4. This model clearly shows a one-sided response for the biological neuron, for values of the input current less than the threshold. For ∆abs =0, the activity of the neuron is more linear than for∆abs > 0. The one-sided ReLU activation for artificial neurons models the biological neuron better than the sigmoidal activation or the anti-symmetric and therefore biologically implausible hyperbolic tangent activation.

Brain studies have shown that on average, ≤ 15% of the neurons are simultaneously active to reduce energy consumption in the brain, as action potentials and other postsynaptic effects are predicted to consume 81% of the energy[1]. Because of the one-sided property of the ReLU function where the function is exactly zero in half of its domain, ReLU activation has the tendency to introduce sparsity into the representation of the rule, conforming more to biological neural networks than sigmoid activation. Upon random initializaton of student weights, it is expected that on average half of the hidden units in the network are not activated because the input activations to unit i will satisfy xi< 0 half of the time.

3.2. Disadvantages and remedies

A potential disadvantage of the ReLU function is its zero gradient when the input activation xi upon presentation of an example ξµ is negative. Such inactive neurons do not get an update in this case. If for some hidden neurons, this occurs for all input vectors ξµ, then such a neuron will never be be updated and therefore never activate. For this reason, such a unit is known as a dead ReLU in the literature. To circumvent the emergence of dead ReLU’s, leaky ReLU’s have been

(10)

-4 -2 2 4 1

2 3 4 5 Softplus function

Softplus Derivative Softplus

(a) The Softplus function

-2 -1 1 2

0.2 0.4 0.6 0.8 1.0 Leaky ReLU

Leaky ReLU Derivative Leaky ReLU

(b) Leaky ReLU Figure 5: Two popular variants on the default ReLU activation function.

proposed[12], which have a small gradient for negative input:

ReLULeaky(x) =

(x, for x ≥ 0 ax, for x < 0

) .

An example of the function for a =0.01 is shown in Figure 5b. In the same paper it is found that leaky ReLU’s show more or less the same performance on a large word recognition task.

The only benefit found was slightly faster convergence, probably due to the non-zero gradient for negative input. At the same time, there is evidence that leaky ReLU’s can improve classification performance in other applications. In [16], improved performance when using leaky ReLU’s with a higher gradient for negative inputs(a=0.2)is observed on image classification tasks. However, the authors do not mention that the sparsity of the representation suffers from using leaky ReLU’s, but it can be concluded from the results that sparsity, although desirable, may not be the most important property that explains the good performance of rectifiers in neural networks.

The ReLU function is not differentiable at ReLU(0), because of the discontinuity. The Softplus function as shown in Figure 5a is a smooth approximation of the ReLU that is differentiable at zero and also has a non-zero gradient for negative input. However, there is evidence that the softplus function does not improve the convergence in training and does not reach better performance, for example see [10]. Moreover, it should be mentioned that the softplus function comes with a higher computational cost, as the function and its derivative consist of an exponential.

Although the variants of the vanilla ReLU may yield better performance in some specific applications, the results do not seem to be consistently better. It also depends on the requirements: For example, if sparsity is desired in a specific model, then using the default ReLU is probably best.

Despite the potential disadvantages of the ReLU, the biological plausibility, the induced sparse representation and the non vanishing gradient may explain the better performance of ReLU deep neural networks compared to deep neural nets with sigmoidal activation, as observed in numerous experiments. One example is found in [10] where reasonably deep neural architectures are trained to perform image classification tasks and a comparison is made between ReLU, tanh and Softplus activation. The authors find that deep ReLU architectures consistently perform better on image classification tasks than hyperbolic tangent activation. At the same time, on the MNIST digit dataset, the average sparsity in the better performing ReLU neural networks is 83.40%, which implies that only 16.6% of the 3000 hidden neurons in the architecture had a non-zero output. In [12], deep rectifier architectures show better performance than deep hyperbolic tangent architectures on word recognition tasks.

In the next section, we will formulate and study the behavior of learning in two-layer ReLU neural networks theoretically using The Statistical Physics of Learning framework.

(11)

4. Theoretical learning dynamics in ReLU networks

In this section, learning by online gradient descent in a two-layer ReLU neural network is formulated theoretically using The Statistical Physics of Learning framework.

4.1. The Statistical Physics of (online) Learning

The student-teacher formalism for learning a rule was already discussed in Section 2. In this formalism, the teacher defines the rule to be learned by the student. Although the student does not know anything about the rule, having a teacher that defines the rule allows for studying how the student learns the rule, among other benefits. In Section 2, we also discussed the difference between on-line learning and off-line learning. Here, we formulate the learning dynamics in the on-line learning scenario, where a stream of independent examples {ξµ, τ(ξµ)} ∈ {RN,R} are presented to the student and the system size N → ∞, i.e. we study the system in the thermodynamic limit, which allows to investigate the evolution of the system exactly by means of differential equations.

The weight vectors of the student and teacher will therefore also be N -dimensional vectors.

For the theoretical analysis, a distribution of the components of ξµ is assumed. Similar to [3], we take input components of zero-mean and unit variance:, i.e.:

ξi=N(0, 1), i ∈[1, 2, ..., N], (20)

but other choices of distributions satisfying hξii=0 and hξ2ii=1 are possible as well, without any modifications to the framework. We repeat the definitions of the input activations to the hidden layer here:

xi=Ji· ξ, yn=Bn· ξ, i ∈[1, 2, ..., K], n ∈[1, 2, ..., M],

where xi is the input activation to student i and yn is the input activation to teacher n. In the thermodynamic limit N → ∞, the above input activations are variables defined in terms of large sums of many scaled Gaussians N(0, 1). By the Central Limit Theorem (CLT), the input activation variables themselves become correlated Gaussians. For an overview of the most influential proofs in the history of the CLT, see [6]. The joint Gaussian of the variables(xi, yn)TRK+M, P(xi, yn)∈R, is defined by the covariance matrix consisting of all covariances between pairs of input activations. The covariance between a pair of student-student input activations hxixki is denoted as Qik, the covariance between a pair of student-teacher activations hxiyni as Rin and the covariance between a pair of teacher-teacher activations hynymi as Tnm. Then, the covariance matrix between all pairs of input activations is defined as the square matrix:

Σ=Qik Rin Rni Tnm



R(K+M )×(K+M ). (21)

Note that the Tnm covariances define properties of the rule. The covariance between student input activation i and student input activation k, Qik, is then defined as:

Qik =

N

X

j=1 N

X

l=1

JijJkljξli .

Note that the assumed i.i.d. components of the input vectors ξ ∈RN imply:

jξli=

(1, for j=l 0, for j 6=l

) .

(12)

Hence only the terms in the above double-sum where j=l survive, and we write:

Qik =

N

X

j=1

JijJkj =Ji· Jk. (22)

Similarly, the covariances between the student-teacher input activations Rin are defined as:

Rin=

N

X

j=1 N

X

m=1

JijBnmjξmi=

N

X

j=1

JijBnj =Ji· Bn. (23)

Lastly, the covariances between the teacher input activations Tnmare defined as:

Tnm=

N

X

i=1 N

X

j=1

BniBmjjξmi=

N

X

j=1

BnjBmj=Bn· Bm. (24)

As can be seen from the above equations, the covariances between different input activations reduce to scalar products between the corresponding weight vectors, resulting from the choice of i.i.d.

input components.

The covariances Qik and Rinare the so-called order parameters of the system, since they summarize the state of a very large system. For example, for a two-layer neural network as considered here with K hidden neurons in the hidden layer, this gives KN adaptable weights, which is a very large number of weights for large N . The description of the system state in terms of order parameters is much more intuitive: There are K(K+1)/2+KM order parameters, where M is the number of hidden neurons in the teacher.

In Section 2 in Equation (15), the weight updates in the student according to the gradient descent algorithm are given upon presentation of an example ξµ at training step µ. Since the order parameters are functions of the weights J , the update of the order parameters upon presentation of an example can be given directly, using the definition of the updated weights in Equation (15):

Rµ+1in =Jiµ+1· Bn

= (Jiµ+ η

µiξµ)· Bn

=Rinµ + η

µiξµ· Bn

=Rinµ + η µiyµn,

and therefore the change in order parameters Rin upon presentation of the example ξµ is:

Rµ+1in − Rµin = η

iµyµn. (25)

The update for the order parameters Qik can also be written using the updates of the weights J in Equation (15):

Qµ+1ik =Jiµ+1· Jkµ+1

= (Jiµ+ η

iµξµ)(Jkµ+ η

µkxiµ)

=Jiµ· Jkµ+ η

N(Jiµ· ξµδµk +Jkµ· ξµδiµ) + η

2

N2δiµδkµµ|2

=Qµik+ η

N(xµiδµk +xµkδµi) +η

2

iµδkµ, µ|2=N .

(13)

Which gives the change in order parameters Qik upon presentation of the example ξµ:

Qµ+1ik − Qµik = η

N(xµiδkµ+xµkδiµ) +η

2

iµδkµ. (26)

So far, networks with any system size N ≥ 1 may be studied. However, in the thermodynamic limit N → ∞, we can regard the discrete learning time µ as continuous by defining time as α=µ/N (infinitesimal changes dα= N1)and in this limit the order parameters are self-averaging [13]. Therefore, exact continuous learning dynamics in the limit N → ∞ can be formulated as differential equations in terms of the averages:

dRin

=ηhδiyni , (27)

dQik

=η(hxiδki+hxkδii) +η2iδki , (28) where the averages are taken over the K+M dimensional Gaussian joint distribution of input activation variables(xi, yn)∈RK+M, which are only exact for systems where N → ∞.

A definition for the generalization error was given in Equation (11), where it is defined as the expected prediction error (Equation (6)) made by the student on an example ξµ. This average can be evaluated as:

εg(J) =h(J , ξ)iξ

= 1

2h[σ(J , ξ)− τ(ξ)]2i

= 1

2(J , ξ)2− 2σ(J , ξ)τ(ξ) +τ(ξ)2i

= 1 2

K

X

i=1 K

X

j=1

hg(xi)g(xj)i − 2

K

X

i=1 M

X

m=1

hg(xi)g(ym)i+

M

X

m=1 M

X

n=1

hg(ym)g(yn)i

.

(29)

The above equation consists only of two-dimensional averages. We define the general two- dimensional averages as:

I2(u, v) =hg(u)g(v)i= Z

−∞

Z

−∞

g(u)g(v)P(u, v)dudv , (30)

where P(u, v)is the joint distribution with covariance matrix:

Σ=11 σ12 σ21 σ22

 ,

which is the relevant submatrix of the full covariance matrix of the input activation variables given in Equation (21).

For more details, see for example [3, 5, 14]. Note that in Equation (27), Equation (28) and Equation (29), the specific form of the averages is determined by the activation function used in the network. For some activation functions, they result in analytical expressions, but it is not a given that this is the case for all possible activation functions. In the next section we will see that it is possible to obtain exact analytical learning dynamics for the ReLU activation function in the limit η → 0 and N → ∞.

(14)

4.2. Equations of motion: student-teacher overlaps

In this section we derive the equations of motion for student-teacher overlaps Rinfrom Equation (27) for ReLU activation. The presentation follows closely the work of Saad and Solla [14].

Substituting the definition of δi in Equation (27) and re-arranging, we obtain:

∂Rin

∂α =ηhδiyni (31)

=η

* g0(xi)

M

X

m=1

g(ym)−

K

X

j=1

g(xj)

yn +

(32)

=η

*

g0(xi)yn

M

X

m=1

g(ym)− g0(xi)yn

K

X

j=1

g(xj) +

(33)

=η

*

g0(xi)yn

M

X

m=1

g(ym) +

*

g0(xi)yn

K

X

j=1

g(xj) +

 (34)

=η

* M X

m=1

g0(xi)yng(ym) +

*K X

j=1

g0(xi)yng(xj) +

 (35)

=η

M

X

m=1

hg0(xi)yng(ym)i −

K

X

j=1

hg0(xi)yng(xj)i

. (36)

The above expression consists of sums containing two- and three- dimensional averages of Gaussian variables, of the form I3(u, v, w) = hg0(u)vg(w)i. For the activation function, our choice is g(x) =ReLU(x) =(x)and therefore correspondingly g0(x) =θ(x), as defined in Equation (18).

For this choice, the I3 terms are formulated as:

I3(u, v, w) =(u)vwθ(w)i . (37)

The three-dimensional averages are evaluated using the three-dimensional Gaussian P3(u, v, w). Since all the variables are zero-centered, these three-dimensional joint distributions are only dependent on the covariance matrixΣ3:

Σ3=

σ11 σ12 σ13 σ21 σ22 σ23 σ31 σ32 σ33

, (38)

which is obtained from the relevant submatrix of the full covariance matrix, consisting of order parameters as given in Equation (21). The general joint distribution is then formulated as:

P3(u, v, w,Σ3) = p 1

()33|exp

−1

2 u v w Σ−13

u v w

. (39)

The averages in Equation (37) are then evaluated by:

I3(u, v, w,Σ3) =(u)vwθ(w)i= Z

0

Z 0

Z

−∞

vwP3(u, v, w,Σ3)dvdwdu . (40)

In [14], it is proven that the three-dimensional Gaussian P3(u, v, w)used to compute the three- dimensional averages can also be used to compute the two-dimensional averages. In this case,

(15)

the three-dimensional Gaussian is formulated with a singular three-dimensional covariance ma- trix.

We find that the integral from Equation (37) can be expressed analytically in terms of the elements of the three-dimensional covariance matrix:

I3(u, v, w) =(u)vwθ(w)i= σ12

q

σ11σ33− σ132 2πσ11

+

σ23sin−1

σ13

σ11σ33



+σ23

4 , (41)

for σ11σ33≥ σ132 .

The analytic expression for the averages given above can then be used to formulate the dynamics of the order parameters Rin, by substituting the appropriate expressions into the equation:

∂Rin

∂α =η

M

X

m=1

(xi)ynymθ(ym)i −

K

X

j=1

(xi)ynxjθ(xj)i

. (42)

We first treat the summationPM

m=1(xi)ynymθ(ym)i from Equation (42). In the M − 1 averages where the variables are distinct, the integration is done over the three-dimensional Gaussian distributions as formulated in Equation (39) with the corresponding covariance matrix:

Σ3=

hxixii hxiyni hxiymi hxiyni hynyni hynymi hxiymi hynymi hymymi

=

Qii Rin Rim Rin Tnn Tnm Rim Tnm Tmm

.

In the remaining case where m=n, the average to evaluate is hθ(xi)ynynθ(yn)i=I3(i, n, n), with the corresponding singular covariance matrix:

Σ˜3=

Qii Rin Rin

Rin Tnn Tnn Rin Tnn Tnn

By substituting the covariance matrices for each term into Equation (41), we write:

M

X

m=1

Rin

q

QiiTmm− R2im 2πQii

+

Tnmsin−1

Rim

QiiTmm



+Tnm

4

 (43)

Note that the above equation requires the order parameters to satisfy QiiTmm ≥ R2im. For all physically possible choices of the order parameters this always holds by the following geometric argument:

QiiTmm≥ R2im Ji· Ji∗ Bm· Bm≥(Ji· Bm)2

||Ji||2||Bm||2≥ ||Ji||2||Bm||2cos[∠(Ji, Bm)]2 1 ≥ cos[∠(Ji, Bm)]2.

Because cos2(·) ∈[0, 1], QiiTmm ≥ R2im is always true for all physically possible choices of the order parameters.

(16)

We now treat the summationPK

j=1(xi)ynxjθ(xj)i in a similar fashion. In K − 1 averages, there are three distinct Gaussian variables xi, yn and xj with corresponding covariance matrix:

Σ3=

hxixii hxiyni hxixji hxiyni hynyni hynxji hxixji hynxji hxjxji

=

Qii Rin Qij Rin Tnn Rjn Qij Rjn Qjj

.

In the remaining case where j =i, the average to evaluate is hθ(xi)ynxiθ(xi)i=I3(i, n, i), with corresponding covariance matrix:

Σ˜3=

Qii Rin Qii Rin Tnn Rin Qii Rin Qii

By substituting the covariance matrices for each term into ??, we write:

K

X

j=1

Rinq

QiiQjj− Q2ij 2πQii

+

Rjnsin−1

 Qij

QiiQjj



+Rjn

4

(44)

By the same geometric argument, QiiQjj ≥ Q2ij must always hold for all physically possible choices of order parameters:

QiiQjj ≥ Q2ij Ji· Ji∗ Jj· Jj ≥(Ji· Jj)2

||Ji||2||Jj||2≥ ||Ji||2||Jj||2cos[∠(Ji, Jj)]2 1 ≥ cos[∠(Ji, Jj)]2.

By substituting the expression for the summations from Equations (43) and (44) in Equation (42) and simplifying the terms, we get the equations of motion for the student-teacher overlaps Rin in case ReLU functions are used in the hidden units:

∂Rin

∂α =η

" M X

m=1

Rin

q

QiiTmm− Rim2

2πQii +

Tnmsin−1

Rim

QiiTmm



+Tnm

4

K

X

j=1

Rinq

QiiQjj− Q2ij

2πQii +

Rjnsin−1

√Qij

QiiQjj



+Rjn

4

#

(45)

Referenties

GERELATEERDE DOCUMENTEN

4.3.3 To establish the prescribing patterns of antidepressants in children and adolescents in South Africa, with regard to prescribed daily dosages, using

In the lake budget, differences do occur (Table 1 ): less water is flowing from groundwater into the lake upon pumping at FBP (from 11.3 to 10.8 and from 8.6 to 7.3 Mm 3 /yr in the

To determine the relation between Beat Filter scores and gesture type, all 154 visible gestures from the corpus were independently annotated for gesture type by three annotators:

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

For the video segmenta- tion test cases, the results obtained by the proposed I-MSS-KSC algorithm is compared with those of the incremental K -means (IKM) ( Chakraborty &amp;

The novelty lies on the derivation of a cost function that also assimilates the time instants into the set of parameters to be optimized allows for the learning of the desired

By using this model, it is possible to predict the cluster membership of unseen (also called out-of-sample) points allowing model selec- tion in a learning framework, i.e.,

In the years 2014 and 2016, the Center for Research on Prejudice carried out two public opinion surveys tracing the changes that occurred in the frequency and place of