On-line learning in neural networks with ReLU activations

(1)

On-line learning in neural networks with ReLU activations

Michiel Straat, University of Groningen September 22, 2018

Supervisors Michael Biehl Kerstin Bunte

(2)

1. Introduction

The Statistical Physics of Learning has been a successful method for describing the learning dynamics for machine learning models, in particular neural networks. The framework uses techniques from statistical physics that describe large physical systems in order to describe machine learning models that usually involve many adaptable weights, for instance in deep learning the number of weights can be on the order of hundreds of millions. Instead of concentrating on each individual weight, statistical physics techniques aggregate the individual parameters in the form of order parameters.

These order parameters describe the state of the system more concisely and make the large system therefore more interpretable. Not only do the order parameters provide better insight in the state of a neural network, it is also possible to formulate updates of the order parameters directly, for example as done in [14] for the gradient descent algorithm or in [4] for the Hebbian learning algorithm. In the thermodynamic limit, in which the system size is taken infinitely large, it becomes possible to formulate those update rules as continuous differential equations for the evolution of the order parameters. The set of differential equations for the order parameters provides a useful tool to study the behavior of the model and algorithm theoretically, in order to gain a deeper understanding of the learning process. This can then possibly be used to reduce unwanted effects in the learning dynamics by an adaptation of the algorithm, in order to improve real-world learning scenarios.

Different learning scenarios have already been studied within the framework, such as online learning in neural networks [3, 14, 5] and off-line/batch learning in neural networks [2]. The framework allows for incorporation of extensions of the learning scenario, for example as in [4] where drifting concepts are formulated and studied or regularization techniques such as dropout can be formulated and studied. Besides studying learning behavior, one could also study the behavior of neural networks of different architectures, using various choices of activation functions in the neurons. For example, in [3] and in [14], two-layered neural networks are studied which use the sigmoidal Erf function in the hidden units.

Quite recently, the rectifier activation function (Rectified Linear Unit: ReLU) has become popular in deep learning applications, for mostly the reason that the activation function empirically yields better performance than sigmoidal activations, see [10] for example. Although the rectifier has known theoretical advantages which will also be discussed in this thesis, there is still a lack of rigorous mathematical arguments as to why ReLU networks have the ability to learn faster and show better performance.

In order to gain a deeper understanding of the learning behavior of ReLU neural networks, we will formulate the gradient descent dynamics of the order parameters in ReLU networks using The Statistical Physics of Learning framework. We study the dynamics in the on-line learning scenario and compare the learning behavior in ReLU neural networks to the behavior in neural networks which use the sigmoidal Erf activation function.

In Section 2, we will give a summary of online learning in neural networks and the definition of the learning dynamics. Section 3 discusses advantages and disadvantages of the rectifier activation function. In Section 4, we derive the differential equations for the order parameters in ReLU neural networks and formulate the corresponding generalization error, which are the main results of the thesis. In Section 5, we show theoretical- and simulation results for ReLU neural networks with various architectures and compare those results with results obtained from learning with similar Erf networks.

(4)

ξ₁

ξ₂

ξ_N

.. .

h₁=g(w₁· ξ)

h₂=g(w₂· ξ)

v · h Output w₁₁

w12

w21

w₂₂ w^N¹ wN 2

v₁

v² Hidden

layer Input

layer

Output layer

Figure 1: Two-layer network

2. On-line and off-line learning of a rule

In this section, we will discuss general on-line learning of an unknown rule defined with a soft- committee machine, particularly in case of the gradient descent algorithm.

The rule is defined by the architecture and the set of weights B of a teacher network. Here we consider two-layered architectures as in Figure 1: The first layer is the N -dimensional input layer and the second layer consists of M hidden units. The set of weights of the teacher is in this case given by the matrix B ∈R^{N ×M}. In this matrix, column n contains the weights associated with the nth hidden unit and row i contains the weights associated with the ith input dimension. For particular instances of weights B and input vector ξ^µ, each hidden unit n retrieves as input the scalar value y_n=B_n· ξ^µ. An activation function g(·)is used in the hidden neurons to map the input activation to an output activation. The output activation for neuron n is therefore g(y_n). The output activations are then combined into the final output. In general, the output activations can be combined in a linear combination using weights v ∈R^M. In a soft-committee machine, the sum of output activations is taken, corresponding to v_n=1 for each neuron n. Summarizing, the output of the described teacher network with unit hidden to output weights for an input ξ^µ is:

τ(ξ^µ) =

M

X

n=1

g(y_n(ξ^µ)), (1)

where y_n(ξ^µ)is the input activation of the nth hidden neuron in the teacher network:

y_n(ξ^µ) =B_n· ξ^µ. (2)

In the rest of the thesis, we refer to the teacher hidden neuron n as "teacher n". Every input ξ^µ now has a rule value associated with it, according to Equation (1). This is denoted by tuples (ξ^µ, τ(ξ^µ)). In fact, we have obtained an infinitely large dataset of possible input vectors with corresponding rule labels:

D={(ξ, τ(ξ))}_ξ (3)

We consider a second two-layer neural network called the student, which has K hidden neurons and weights J ∈R^{N ×K}. To the student, the rule is unknown, just as in applications where a neural network learns a particular mapping which is really unknown. We denote the student output for

(5)

an input ξ^µ as σ(J , ξ^µ):

σ(J , ξ^µ) =

K

X

i=1

g(x_i(ξ^µ)), (4)

where x_i(ξ^µ)is the input activation of the ith hidden neuron in the teacher network:

x_i(ξ^µ) =J_i· ξ^µ. (5)

The goal of a learning algorithm adapts the student weights J ∈R^{N ×K} in order to stay as close as possible to the output of the rule for every possible input ξ. For an input ξ^µ, the disagreement between student and teacher is expressed as the error:

(J , ξ^µ) =¹

2[τ(ξ^µ)− σ(J , ξ^µ)]². (6)

The learning algorithm adapts the weights J in order to reduce (J , ξ)for inputs ξ.

In off-line learning scenarios, a fixed-size dataset is available of P examples, from which the student can learn:

D_train={(ξ^µ, τ(ξ^µ)), µ=1, 2, ..., P } . (7)

The dataset of P examples is called the training set. The total error associated with a set of weights J is the mean over the training examples:

εt(J) = ¹ P

P

X

µ=1

(J , ξ^µ), (8)

which is minimized by presenting the examples from the dataset in a particular order. One example is batch learning, which presents the entire training set or a random subset of it and updates the student weights J according to the selected learning algorithm. Stochastic learning is another example, which randomly selects an example from the training set and updates the weights J using this example and the selected learning algorithm. Given a set of examples which the learning algorithm has never seen:

D_test={(ξ^t, τ(ξ^t)), t=1, 2, ..., T } , (9)

the generalization error is defined as the mean error made by the student on Dtest:

ε_g(J) = ¹ T

T

X

t=1

(J , ξ^t). (10)

In on-line learning, new examples ξ are presented in a stream, where each presentation leads to an adaptation of the student weights. There is no explicit distinction between training- and generalization error. The average error made by the student is defined as the generalization error:

ε_g(J) =h(J , ξ)i_ξ, (11)

where the average denoted by h·i is taken over the distribution of inputs. It is the expected error that the student will make given an arbitrary example ξ^µ. At learning step µ in the on-line learning scenario, the student weights J are updated according to a specific learning algorithm f :

J^µ+1=J^µ+ ¹

Nf(J^µ, ξ^µ), (12)

(6)

where J^µ are the weights at timestep µ and J^µ+1 are the updated weights. Note that the update is scaled by the network size N . In this thesis, we will concentrate on the gradient descent algorithm:

f(J^µ, ξ^µ) =−η∇_J(J , ξ^µ), (13)

where η is the learning rate that controls the size of the step done in the negative direction of the gradient and ∇_J(J , ξ^µ)is the gradient of the error with respect to all the student weights J for the current example only. The full update can then be written as:

J^µ+1=J^µ− η

N∇_J(J^µ, ξ^µ)

=J^µ+ ^η

N[τ(ξ^µ)− σ(J^µ, ξ^µ)]∇_Jσ(J^µ, ξ^µ).

(14)

For a two-layered student with K hidden neurons in the second layer and a differentiable activation function g(x), the updates of the weights for the ith hidden unit upon presentation of an example ξ^µ is then:

J_i^µ+1=J_i^µ+ ^η

Nδ_iξ^µ, (15)

where δ_i is defined as:

δi=g⁰(xi)[τ(ξ^µ)− σ(J^µ, ξ^µ)]. (16) It is understood that the specific form of Equation (15) depends on the choice of activation function, as the activation function g(x)and its first derivative g⁰(x)appear explicitly in the R.H.S. of the update equation. Possible choices of activation function are the sigmoidal Erf function:

g(x) =erf

x

√2

, g⁰(x) = r2

πe⁻^x2² , (17)

of which an illustration can be seen in Figure 2 and the rectifier activation function (ReLU):

g(x) =xθ(x), g⁰(x) =θ(x), (18)

where θ(x)is the step function defined as:

θ(x) =

(1, for x ≥ 0 0, for x < 0

) .

The ReLU function is shown in Figure 3. The two activation functions have clearly different properties, which will be discussed in Section 3. One property of the ReLU that is obvious from its definition is its discontinuity of the derivative at ReLU⁰(0), but in this case one can choose to update or not.

When K =M , i.e. the number of hidden units in the student is equal to the number of hidden units in the teacher, the two architectures match and hence the student has the ability to fully represent the rule. If the learning algorithm reaches this desired solution for the student, then ε_g=0. It is not guaranteed that the learning algorithm will find the perfect solution in all learning scenarios, because the algorithm can be stuck in a local minimum of the cost function, for instance, if batch learning is used on a training set. In our on-line learning scenario, since the stream of examples is infinite and the examples are independently drawn from the distributions, the perfect solution will be reached when student and teacher have matching architectures. When K < M and B_n 6=0, ∀n, the student has less complexity than the teacher and therefore the student cannot represent the rule. When K > M , the student is more complex than the teacher. In practice, we often do not know the complexity of the rule that needs to be learned, hence studying these cases is of interest. In the rest of the thesis, these three learning scenarios will also be called realizable, unrealizable and overrealizable, respectively, terms that also appear in the rest of the literature.

(7)

-4 -2 2 4

-1.0 -0.5 0.5 1.0 Erf activation function

(a) The Erf activation function

-4 -2 2 4

0.2 0.4 0.6 0.8 Derivative of Erf

(b) First derivative of Erf Figure 2: The sigmoidal erf activation function and its Gaussian derivative.

3. The Rectified Linear Unit (ReLU)

In this section, we discuss the properties of the rectifier function and its potential advantages and disadvantages. We discuss conjectures about the advantages of the function, empirical successes of the usage of the function and mathematical properties about the function.

There are several ways to express the rectifier function. One can be found in Equation (18) and alternatives are g(x) = x⁺ and g(x) =max(0, x). The latter is often used in implementations, as it involves one comparison and computing the ReLU function for its argument is therefore very efficient. Furthermore, the computation gives exact results. This is in contrast to popular methods for the computation of the erf function, which are not exact and increase in computational complexity for increasing desired accuracy [7].

The activation function ReLU : R → R≥0 is a non-linear function that consists of two linear parts: For theR<0 part of its domain, the tangent to the function is zero and for theR>0 part the tangent is one. Hence, the rectifier function’s derivative has only two values in its image:

ReLU⁰ :R → {0, 1}. At ReLU(0) the function is discontinuous and therefore the derivative is mathematically not defined, however a value of 0 or 1 for ReLU⁰(0)can be chosen.

To compare these properties with sigmoidal activation, the activation function Erf:R →[−1, 1] is more non-linear and smooth over its entire domain. It has a smooth Gaussian derivative with a maximum at Erf⁰(0). From the definition of δ_i in Equation (16), the effect of the differences in derivatives between ReLU and Erf can be seen. When rectifiers are used, upon presentation of an example ξ^µ, the weights J_i^µ are updated according to η/N[τ − σ]ξ^µ if x_i> 0 or they are not updated at all if x_i< 0. When using Erf units, the magnitude of the updates varies continuously, with a maximum for xi=0 and continuous exponential decrease for xi< 0 and xi> 0.

3.1. Advantages

Here we will list some of the key potential advantages of the ReLU function compared to sigmoidal activation.

(8)

-4 -2 2 4 x 1

2 3 4 5 x θ(x) ReLU activation function

(a) The ReLU activation function

-4 -2 2 4 x

0.2 0.4 0.6 0.8 1.0 θ(x) Derivative of ReLU

(b) First derivative of ReLU Figure 3: The ReLU activation function and its derivative.

3.1.1. Efficient evaluation

One advantage of the ReLU function over sigmoidal activation is its computational efficiency.

For learning in deep architectures, upon presentation of one example ξ^µ, many neurons are first evaluated to compute the output σ(J^µ, ξ^µ)and in the backpropagation of error the gradient is evaluated an equal number of times. For deep learning, many examples ξ^µ are required to avoid over-fitting, so the described computation needs to be executed many times. For this reason, the efficiency of evaluating the activation function is important and using ReLU activation in deep architectures can reduce training time.

3.1.2. Non saturation

As can be seen in Figure 3b, the gradient of the ReLU remains constant for all input activation values xi≥ 0. In contrast, the gradient of the sigmoidal function satures quickly in both directions, as can be seen in Figure 2b. This is known as the vanishing gradient problem and it can significantly slow down the speed of training. For instance, in [11], the authors notice that a four-layer convolutional neural net with ReLU activation reaches a 25% training error six times faster in terms of epochs in stochastic gradient descent than the same architecture with tanh activation on a large image classification task.

3.1.3. Better biological motivation and sparse representation

Several models exist for the activity of biological neurons and networks of biological neurons exist.

These vary in the amount of detail modeled, some modeling the biophysical processes and others are more abstract. A popular and simple model is the Integrate-and-Fire model, which does not model the biological processes that lead to the firing of a neuron[8]. In the model, an action potential occurs when a threshold value of the membrane potential is reached under influence of an input current. The membrane potential is then simply reset to the resting value of the potential, after which the membrane potential builds up again. The frequency of firing is dependent on the input current applied. Several extensions of the models have been proposed, including ones that model the refractory period limiting the firing frequency of a neuron for constant input current, and also ones that include time-dependency of the membrane potential found in biological neurons, such as the leaky integrate-and-fire model.

(9)

10 20 30 40 50 Input current I 0.05

0.10 0.15 0.20 0.25 Firing rate f(I) (Hz)

Behavior of Leaky Firing Rate neuron (Vth=15)

Including Δabs

Without Δ_abs

Figure 4: A model for the activity of a biological neuron in terms of spikes per second (firing rate) under influence of a constant input current I. The rest of the parameters are threshold θ=15, R=1 and∆abs=5.

In [9], a definition of the firing rate for leaky integrate-and-fire neurons is given:

f(I₀) =

∆abs+τ_mln RI0

RI₀− θ

−1

, (19)

where ∆^abs is the refractory period (ms), τ_mis the time constant, R is the resistance of the linear resistor in the circuit, I₀is the applied current and θ is the threshold potential at which the neuron fires. For further details, see for example [9]. The point of discussing this model is how it models biological neuron activity, as shown in Figure 4. This model clearly shows a one-sided response for the biological neuron, for values of the input current less than the threshold. For ∆abs =0, the activity of the neuron is more linear than for∆abs > 0. The one-sided ReLU activation for artificial neurons models the biological neuron better than the sigmoidal activation or the anti-symmetric and therefore biologically implausible hyperbolic tangent activation.

Brain studies have shown that on average, ≤ 15% of the neurons are simultaneously active to reduce energy consumption in the brain, as action potentials and other postsynaptic effects are predicted to consume 81% of the energy[1]. Because of the one-sided property of the ReLU function where the function is exactly zero in half of its domain, ReLU activation has the tendency to introduce sparsity into the representation of the rule, conforming more to biological neural networks than sigmoid activation. Upon random initializaton of student weights, it is expected that on average half of the hidden units in the network are not activated because the input activations to unit i will satisfy x_i< 0 half of the time.

3.2. Disadvantages and remedies

A potential disadvantage of the ReLU function is its zero gradient when the input activation x_i upon presentation of an example ξ^µ is negative. Such inactive neurons do not get an update in this case. If for some hidden neurons, this occurs for all input vectors ξ^µ, then such a neuron will never be be updated and therefore never activate. For this reason, such a unit is known as a dead ReLU in the literature. To circumvent the emergence of dead ReLU’s, leaky ReLU’s have been

(10)

-4 -2 2 4 1

2 3 4 5 Softplus function

Softplus Derivative Softplus

(a) The Softplus function

-2 -1 1 2

0.2 0.4 0.6 0.8 1.0 Leaky ReLU

Leaky ReLU Derivative Leaky ReLU

(b) Leaky ReLU Figure 5: Two popular variants on the default ReLU activation function.

proposed[12], which have a small gradient for negative input:

ReLU_Leaky(x) =

(x, for x ≥ 0 ax, for x < 0

) .

An example of the function for a =0.01 is shown in Figure 5b. In the same paper it is found that leaky ReLU’s show more or less the same performance on a large word recognition task.

The only benefit found was slightly faster convergence, probably due to the non-zero gradient for negative input. At the same time, there is evidence that leaky ReLU’s can improve classification performance in other applications. In [16], improved performance when using leaky ReLU’s with a higher gradient for negative inputs(a=0.2)is observed on image classification tasks. However, the authors do not mention that the sparsity of the representation suffers from using leaky ReLU’s, but it can be concluded from the results that sparsity, although desirable, may not be the most important property that explains the good performance of rectifiers in neural networks.

The ReLU function is not differentiable at ReLU(0), because of the discontinuity. The Softplus function as shown in Figure 5a is a smooth approximation of the ReLU that is differentiable at zero and also has a non-zero gradient for negative input. However, there is evidence that the softplus function does not improve the convergence in training and does not reach better performance, for example see [10]. Moreover, it should be mentioned that the softplus function comes with a higher computational cost, as the function and its derivative consist of an exponential.

Although the variants of the vanilla ReLU may yield better performance in some specific applications, the results do not seem to be consistently better. It also depends on the requirements: For example, if sparsity is desired in a specific model, then using the default ReLU is probably best.

Despite the potential disadvantages of the ReLU, the biological plausibility, the induced sparse representation and the non vanishing gradient may explain the better performance of ReLU deep neural networks compared to deep neural nets with sigmoidal activation, as observed in numerous experiments. One example is found in [10] where reasonably deep neural architectures are trained to perform image classification tasks and a comparison is made between ReLU, tanh and Softplus activation. The authors find that deep ReLU architectures consistently perform better on image classification tasks than hyperbolic tangent activation. At the same time, on the MNIST digit dataset, the average sparsity in the better performing ReLU neural networks is 83.40%, which implies that only 16.6% of the 3000 hidden neurons in the architecture had a non-zero output. In [12], deep rectifier architectures show better performance than deep hyperbolic tangent architectures on word recognition tasks.

In the next section, we will formulate and study the behavior of learning in two-layer ReLU neural networks theoretically using The Statistical Physics of Learning framework.

(11)

4. Theoretical learning dynamics in ReLU networks

In this section, learning by online gradient descent in a two-layer ReLU neural network is formulated theoretically using The Statistical Physics of Learning framework.

4.1. The Statistical Physics of (online) Learning

The student-teacher formalism for learning a rule was already discussed in Section 2. In this formalism, the teacher defines the rule to be learned by the student. Although the student does not know anything about the rule, having a teacher that defines the rule allows for studying how the student learns the rule, among other benefits. In Section 2, we also discussed the difference between on-line learning and off-line learning. Here, we formulate the learning dynamics in the on-line learning scenario, where a stream of independent examples {ξ^µ, τ(ξ^µ)} ∈ {R^N,R} are presented to the student and the system size N → ∞, i.e. we study the system in the thermodynamic limit, which allows to investigate the evolution of the system exactly by means of differential equations.

The weight vectors of the student and teacher will therefore also be N -dimensional vectors.

For the theoretical analysis, a distribution of the components of ξ^µ is assumed. Similar to [3], we take input components of zero-mean and unit variance:, i.e.:

ξ_i=N(0, 1), i ∈[1, 2, ..., N], (20)

but other choices of distributions satisfying hξ_ii=0 and hξ²_ii=1 are possible as well, without any modifications to the framework. We repeat the definitions of the input activations to the hidden layer here:

x_i=J_i· ξ, y_n=B_n· ξ, i ∈[1, 2, ..., K], n ∈[1, 2, ..., M],

where x_i is the input activation to student i and y_n is the input activation to teacher n. In the thermodynamic limit N → ∞, the above input activations are variables defined in terms of large sums of many scaled Gaussians N(0, 1). By the Central Limit Theorem (CLT), the input activation variables themselves become correlated Gaussians. For an overview of the most influential proofs in the history of the CLT, see [6]. The joint Gaussian of the variables(xi, yn)^T ∈R^K+M, P(x_i, y_n)∈R, is defined by the covariance matrix consisting of all covariances between pairs of input activations. The covariance between a pair of student-student input activations hx_ix_ki is denoted as Q_ik, the covariance between a pair of student-teacher activations hx_iy_ni as R_in and the covariance between a pair of teacher-teacher activations hy_ny_mi as T_nm. Then, the covariance matrix between all pairs of input activations is defined as the square matrix:

Σ=^Q^ik ^Rⁱⁿ R_ni T_nm

∈R(K+M )×(K+M ). (21)

Note that the T_nm covariances define properties of the rule. The covariance between student input activation i and student input activation k, Q_ik, is then defined as:

Q_ik =

N

X

j=1 N

X

l=1

JijJ_klhξjξ_li .

Note that the assumed i.i.d. components of the input vectors ξ ∈R^N imply:

hξjξ_li=

(1, for j=l 0, for j 6=l

) .

(12)

Hence only the terms in the above double-sum where j=l survive, and we write:

Q_ik =

N

X

j=1

J_ijJ_kj =J_i· J_k. (22)

Similarly, the covariances between the student-teacher input activations Rin are defined as:

Rin=

N

X

j=1 N

X

m=1

JijBnmhξjξmi=

N

X

j=1

JijBnj =Ji· Bn. (23)

Lastly, the covariances between the teacher input activations T_nmare defined as:

T_nm=

N

X

i=1 N

X

j=1

B_niB_mjhξ_jξ_mi=

N

X

j=1

B_njB_mj=B_n· B_m. (24)

As can be seen from the above equations, the covariances between different input activations reduce to scalar products between the corresponding weight vectors, resulting from the choice of i.i.d.

input components.

The covariances Q_ik and R_inare the so-called order parameters of the system, since they summarize the state of a very large system. For example, for a two-layer neural network as considered here with K hidden neurons in the hidden layer, this gives KN adaptable weights, which is a very large number of weights for large N . The description of the system state in terms of order parameters is much more intuitive: There are K(K+1)/2+KM order parameters, where M is the number of hidden neurons in the teacher.

In Section 2 in Equation (15), the weight updates in the student according to the gradient descent algorithm are given upon presentation of an example ξ^µ at training step µ. Since the order parameters are functions of the weights J , the update of the order parameters upon presentation of an example can be given directly, using the definition of the updated weights in Equation (15):

R^µ+1_in =J_i^µ+1· B_n

= (J_i^µ+ ^η

Nδ^µ_iξ^µ)· B_n

=R_in^µ + ^η

Nδ^µ_iξ^µ· B_n

=R_in^µ + ^η Nδ^µ_iy^µ_n,

and therefore the change in order parameters R_in upon presentation of the example ξ^µ is:

R^µ+1_in − R^µ_in = ^η

Nδ_i^µy^µ_n. (25)

The update for the order parameters Q_ik can also be written using the updates of the weights J in Equation (15):

Q^µ+1_ik =J_i^µ+1· J_k^µ+1

= (J_i^µ+ ^η

Nδ_i^µξ^µ)(J_k^µ+ ^η

Nδ^µ_kxi^µ)

=J_i^µ· J_k^µ+ ^η

N(J_i^µ· ξ^µδ^µ_k +J_k^µ· ξ^µδ_i^µ) + ^η

2

N²δ_i^µδ_k^µ|ξ^µ|²

=Q^µ_ik+ ^η

N(x^µ_iδ^µ_k +x^µ_kδ^µ_i) +^η

2

Nδ_i^µδ_k^µ, |ξ^µ|²=N .

(13)

Which gives the change in order parameters Q_ik upon presentation of the example ξ^µ:

Q^µ+1_ik − Q^µ_ik = ^η

N(x^µ_iδ_k^µ+x^µ_kδ_i^µ) +^η

2

Nδ_i^µδ_k^µ. (26)

So far, networks with any system size N ≥ 1 may be studied. However, in the thermodynamic limit N → ∞, we can regard the discrete learning time µ as continuous by defining time as α=µ/N (infinitesimal changes dα= _N¹)and in this limit the order parameters are self-averaging [13]. Therefore, exact continuous learning dynamics in the limit N → ∞ can be formulated as differential equations in terms of the averages:

dR_in

dα =ηhδiyni , (27)

dQ_ik

dα =η(hxiδ_ki+hx_kδ_ii) +η²hδiδ_ki , (28) where the averages are taken over the K+M dimensional Gaussian joint distribution of input activation variables(x_i, y_n)∈R^K+M, which are only exact for systems where N → ∞.

A definition for the generalization error was given in Equation (11), where it is defined as the expected prediction error (Equation (6)) made by the student on an example ξ^µ. This average can be evaluated as:

εg(J) =h(J , ξ)i_ξ

= ¹

2h[σ(J , ξ)− τ(ξ)]²i

= ¹

2hσ(J , ξ)²− 2σ(J , ξ)τ(ξ) +τ(ξ)²i

= ¹ 2





K

X

i=1 K

X

j=1

hg(x_i)g(x_j)i − 2

K

X

i=1 M

X

m=1

hg(x_i)g(y_m)i+

M

X

m=1 M

X

n=1

hg(y_m)g(y_n)i



.

(29)

The above equation consists only of two-dimensional averages. We define the general two- dimensional averages as:

I₂(u, v) =hg(u)g(v)i= Z ∞

−∞

Z ∞

−∞

g(u)g(v)P(u, v)dudv , (30)

where P(u, v)is the joint distribution with covariance matrix:

Σ=^σ¹¹ ^σ¹² σ₂₁ σ₂₂

,

which is the relevant submatrix of the full covariance matrix of the input activation variables given in Equation (21).

For more details, see for example [3, 5, 14]. Note that in Equation (27), Equation (28) and Equation (29), the specific form of the averages is determined by the activation function used in the network. For some activation functions, they result in analytical expressions, but it is not a given that this is the case for all possible activation functions. In the next section we will see that it is possible to obtain exact analytical learning dynamics for the ReLU activation function in the limit η → 0 and N → ∞.

(14)

4.2. Equations of motion: student-teacher overlaps

In this section we derive the equations of motion for student-teacher overlaps R_infrom Equation (27) for ReLU activation. The presentation follows closely the work of Saad and Solla [14].

Substituting the definition of δ_i in Equation (27) and re-arranging, we obtain:

∂R_in

∂α =ηhδ_iy_ni (31)

=η

* g⁰(x_i)





M

X

m=1

g(y_m)−

K

X

j=1

g(x_j)



y_n +

(32)

=η

*

g⁰(x_i)y_n

M

X

m=1

g(y_m)− g⁰(x_i)y_n

K

X

j=1

g(x_j) +

(33)

=η





*

g⁰(x_i)y_n

M

X

m=1

g(y_m) +

−

*

g⁰(x_i)y_n

K

X

j=1

g(x_j) +

 (34)

=η





* _M X

m=1

g⁰(x_i)y_ng(y_m) +

−

*_K X

j=1

g⁰(x_i)y_ng(x_j) +

 (35)

=η





M

X

m=1

hg⁰(x_i)y_ng(y_m)i −

K

X

j=1

hg⁰(x_i)y_ng(x_j)i



. (36)

The above expression consists of sums containing two- and three- dimensional averages of Gaussian variables, of the form I₃(u, v, w) = hg⁰(u)vg(w)i. For the activation function, our choice is g(x) =ReLU(x) =xθ(x)and therefore correspondingly g⁰(x) =θ(x), as defined in Equation (18).

For this choice, the I3 terms are formulated as:

I3(u, v, w) =hθ(u)vwθ(w)i . (37)

The three-dimensional averages are evaluated using the three-dimensional Gaussian P3(u, v, w). Since all the variables are zero-centered, these three-dimensional joint distributions are only dependent on the covariance matrixΣ3:

Σ3=





σ₁₁ σ₁₂ σ₁₃ σ₂₁ σ₂₂ σ₂₃ σ₃₁ σ₃₂ σ₃₃



, (38)

which is obtained from the relevant submatrix of the full covariance matrix, consisting of order parameters as given in Equation (21). The general joint distribution is then formulated as:

P₃(u, v, w,Σ3) = _p ¹

(2π)³|Σ3|exp



−1

2 u v w Σ⁻¹₃



 u v w







. (39)

The averages in Equation (37) are then evaluated by:

I₃(u, v, w,Σ3) =hθ(u)vwθ(w)i= Z ∞

0

Z ∞ 0

Z ∞

−∞

vwP₃(u, v, w,Σ3)dvdwdu . (40)

In [14], it is proven that the three-dimensional Gaussian P3(u, v, w)used to compute the three- dimensional averages can also be used to compute the two-dimensional averages. In this case,

(15)

the three-dimensional Gaussian is formulated with a singular three-dimensional covariance matrix.

We find that the integral from Equation (37) can be expressed analytically in terms of the elements of the three-dimensional covariance matrix:

I₃(u, v, w) =hθ(u)vwθ(w)i= σ12

q

σ11σ33− σ₁₃² 2πσ11

+

σ₂₃sin⁻¹

σ13

√σ11σ33

2π +^σ²³

4 , (41)

for σ₁₁σ₃₃≥ σ₁₃² .

The analytic expression for the averages given above can then be used to formulate the dynamics of the order parameters R_in, by substituting the appropriate expressions into the equation:

∂R_in

∂α =η





M

X

m=1

hθ(x_i)y_ny_mθ(y_m)i −

K

X

j=1

hθ(x_i)y_nx_jθ(x_j)i



. (42)

We first treat the summationPM

m=1hθ(x_i)y_ny_mθ(y_m)i from Equation (42). In the M − 1 averages where the variables are distinct, the integration is done over the three-dimensional Gaussian distributions as formulated in Equation (39) with the corresponding covariance matrix:

Σ3=





hx_ix_ii hx_iy_ni hx_iy_mi hx_iy_ni hy_ny_ni hy_ny_mi hx_iy_mi hy_ny_mi hy_my_mi





=





Q_ii R_in R_im R_in T_nn T_nm R_im T_nm T_mm



.

In the remaining case where m=n, the average to evaluate is hθ(x_i)y_ny_nθ(y_n)i=I₃(i, n, n), with the corresponding singular covariance matrix:

Σ˜3=





Qii Rin Rin

R_in T_nn T_nn R_in T_nn T_nn





By substituting the covariance matrices for each term into Equation (41), we write:

M

X

m=1



 Rin

q

QiiTmm− R²_im 2πQii

+

Tnmsin⁻¹

Rim

√Q_iiTmm

2π +^T^nm

4



 (43)

Note that the above equation requires the order parameters to satisfy Q_iiTmm ≥ R²_im. For all physically possible choices of the order parameters this always holds by the following geometric argument:

Q_iiT_mm≥ R²_im J_i· Ji∗ Bm· Bm≥(J_i· Bm)²

||Ji||²||Bm||²≥ ||Ji||²||Bm||²cos[∠(^Ji, Bm)]² 1 ≥ cos[∠(^Ji, Bm)]².

Because cos²(·) ∈[0, 1], QiiT_mm ≥ R²_im is always true for all physically possible choices of the order parameters.

(16)

We now treat the summationPK

j=1hθ(x_i)y_nx_jθ(x_j)i in a similar fashion. In K − 1 averages, there are three distinct Gaussian variables x_i, yn and x_j with corresponding covariance matrix:

Σ3=





hxix_ii hxiy_ni hxix_ji hx_iy_ni hy_ny_ni hy_nx_ji hx_ix_ji hy_nx_ji hx_jx_ji





=





Q_ii R_in Q_ij R_in Tnn R_jn Qij Rjn Qjj



.

In the remaining case where j =i, the average to evaluate is hθ(x_i)y_nx_iθ(x_i)i=I₃(i, n, i), with corresponding covariance matrix:

Σ˜3=





Q_ii R_in Q_ii R_in Tnn R_in Qii Rin Qii





By substituting the covariance matrices for each term into ??, we write:

K

X

j=1





 R_inq

Q_iiQ_jj− Q²_ij 2πQii

+

R_jnsin⁻¹

Q_ij

√QiiQjj

2π +^R^jn

4







(44)

By the same geometric argument, Q_iiQ_jj ≥ Q²_ij must always hold for all physically possible choices of order parameters:

Q_iiQ_jj ≥ Q²_ij J_i· J_i∗ J_j· J_j ≥(J_i· J_j)²

||J_i||²||J_j||²≥ ||J_i||²||J_j||²cos[∠(^Ji, J_j)]² 1 ≥ cos[∠(^Ji, J_j)]².

By substituting the expression for the summations from Equations (43) and (44) in Equation (42) and simplifying the terms, we get the equations of motion for the student-teacher overlaps Rin in case ReLU functions are used in the hidden units:

∂R_in

∂α =η

" _M X

m=1



 R_in

q

Q_iiT_mm− R_im²

2πQ_ii +

T_nmsin⁻¹

R_im

√QiiTmm

2π +^T^nm

4





−

K

X

j=1





 R_inq

Q_iiQ_jj− Q²_ij

2πQ_ii +

Rjnsin⁻¹

√Qij

Q_iiQ_jj

2π +^R^jn

4







#

(45)

On-line learning in neural networks with ReLU activations