Simple Saddle-Free Newton Descent with a Pseudo Hessian

(1)

MSc Artificial Intelligence

Master Thesis

Simple Saddle-Free Newton Descent

with a Pseudo-Hessian

by

Eli de Smet

10385657

July 18, 2020

48 ECTS November 2019 - July 2020

Supervisor:

Dr. Daniel Worrall

Assessor:

Dr. Patrick Forr´

e

(2)

Abstract

Most second-order gradient methods are barely used in practice in the context of high-dimensional machine-learning problems because of their computational costs. However, curvature information could be used to actively move away from saddle points, which could help with minimizing the objective function. In this thesis, we introduce the pseudo-Hessian, an analogue to the Hessian, which is a diagonal matrix by design and computationally inex-pensive to work with. We then demonstrate that it can be used in practice in combination with the Saddle-Free Newton method—a method for escaping saddle points. This new tool could result in faster convergence and lower function values, and offers a promising new research direction for gradient-descent methods.

(3)

1—Introduction

Minimizing objective functions using gradient-descent methods has become very relevant in the last few decades, specifically in the field of artificial intelligence. Given a data set and a neural-network architecture, training a neural network is simply minimizing a real-valued function f in its parameters θ:

θ∗= arg min

θ f (θ). (1.1)

The difficulties in optimizing neural networks lie in the abundant use of non-linearities, also known as activation functions, the dimensionality of the problem, which is often very high, and the cost of a single gradient evaluation. Analytical solutions usually do not exist, but even numerical optimization is difficult. Often gradient-descent algorithms are one of the few options available.

Current gradient-descent methods do not address one important characteristic of high-dimensional functions: saddle points. First-order methods like SGD, Momentum [Rumelhart et al., 1986], RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014] have difficulty escaping them, because saddle points are characterized by large plateaus. Second-order methods such as L-BFGS [Liu and Nocedal, 1989] and K-FAC [Martens and Grosse, 2015] can escape large plateaus easily, but are attracted by saddle points. This is especially relevant in the context of high-dimensional problems like neural networks, where the ratio of saddle points to local minima tends to increase exponentially with the dimensionality of the problem.

The Saddle-Free Newton method addresses this problem by using the Hessian to determine a direction that actively moves away from saddle points. Experimentally, Dauphin et al. [2014b] have shown that this can allow it to find minima that are multiple orders of mag-nitude smaller than that those found with other gradient-descent methods. Despite this success, it is still not commonly used, because estimating and working with the Hessian is computationally expensive. As a result, the Saddle-Free Newton method and other second-order methods are hampered from becoming a standard way to train neural networks. We try to address this problem by first noting the following:

• Second-order behavior is linearly dependent on first-order behavior—the gradients— under certain regularity conditions.

• There exists a closed-form method for extracting linear trends from data with noise— linear regression—which can be adapted for an online setting.

• In practice it is physically impossible to work with a full Hessian, even if it were easy to calculate. A low-dimensional classification problem like MNIST, solved with LeNet, requires approximately 3.3 × 106 _{parameters. The corresponding Hessian at any point}

at float32 precision requires approximately 20 terabytes [Alain et al., 2019]. These points lead to the main contributions of this thesis:

(6)

1. We introduce a new tool in the context of gradient-optimisation methods: pseudo-Hessian estimation by linear regression of the gradients. The pseudo-pseudo-Hessian is an analogue to the Hessian. It is a diagonal matrix by design and inexpensive to estimate. Note that although it is a diagonal matrix, it is not estimating the diagonal of the true Hessian.

2. We demonstrate that the pseudo-Hessian can be used to amend the Saddle-Free Newton method, resulting in two algorithms: Pseudo Saddle-Free Newton method (PSFN) and Pseudo Saddle-Free Newton method stabilized using Armijo line search (PSFN+Armijo).

We first introduce all the necessary background needed to understand our method in Chapter 2, before outlining the method in Chapter 3. In Chapter 4 we motivate and outline our experiments. We show that PSFN performs well on the Powell Singular function [Powell, 1962], Stochastic Rosenbrock function [Yang and Deb, 2010] and MNIST classification using neural networks. In Chapter 5 we suggest some potential future research into gradient-descent methods based on the pseudo-Hessian. We conclude that PSFN looks to be a promising method for improving gradient-descent algorithms.

(7)

2—Background

In order to understand the methods proposed in this thesis, we must first introduce three concepts:

1. Saddle-Free Newton method by Dauphin et al. [2014b],

2. Armijo line search adapted by Vaswani et al. [2019] for a stochastic setting,

3. The hassle of Hessian estimation of high-dimensional objective functions like neural networks.

In this chapter we will build up an intuitive understanding of all three topics as a foundation for our method.

2.1 Gradient-descent methods

A gradient-descent method is an algorithm that solves the problem θ∗= arg min

θ f (θ), (2.1)

where f : RD_{→ R is the objective function that we try to minimize in the parameter θ ∈ R}D_.

Note that in order to work with a gradient-descent method we assume the differentiability of the function f . Moreover in this thesis we will assume that f is at least twice continuously differentiable.

The descent method starts with a best guess θ0 and from there updates the parameters

iteratively to find the minimum:

θt= θt−1− αtdt, for all t ∈ 1, 2, 3, . . . (2.2)

where αt > 0 is the step size and dt ∈ RD is the descent direction based on the gradient

at the current point during step t. There are many gradient methods; the most popular for training neural networks are SGD, SGD with Momentum [Rumelhart et al., 1986], RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]. We will only be discussing Newton’s Method in detail, because it is the core of our method.

2.1.1 Stochasticity and mini-batches

In the case of neural networks the objective function is typically

f (θ) = 1 n n X i=1 fi(θ), (2.3)

where fiis the objective function for the ithobject in the data set. Data sets and the

(8)

function with respect to the parameters—very costly. This forces us to use mini-batches (a random subset of the batch) to find a decent direction, with

gt= ∇θft(θt) (2.4)

and ft(θt) =_n1_bPi∈bfi(θt) is the average over a random mini-batch b at time t. This makes

the descent stochastic by nature, where

E(ft(θt)) = ft(θ). (2.5)

In the case of gradient descent we can guarantee that f (θt+1) < f (θt), which is lost in the

case of stochastic gradient descent, but still true in the expectation. In practice we find that the stochasticity actually helps with the initial convergence speed and it prevents getting stuck in local minima [Masters and Luschi, 2018].

2.2 Newton’s Method

Newton’s method uses a local quadratic approximation to find the best descent direction. The quadratic approximation is done using a truncated second-order Taylor expansion:

f (θ) ≈ f (θt) + gt>(θ − θt) +

1

2(θ − θt)

>_H

t(θ − θt) (2.6)

for θ close to θt, where gt = gt(θt) is the gradient and Ht = Ht(θt) is the Hessian of the

objective function f at θt. We can find the optimal update step in the following way:

∇θf (θ) = 0 =⇒ gt+ Ht(θ − θt) ≈ 0 (2.7)

=⇒ θt+1= θt− Ht−1gt. (2.8)

The resulting descent direction, dt = Ht−1gt, assumes that the quadratic approximation

holds reasonably well for the given objective function. If it does not we might end up with a poor update; we address this problem in Section 2.2.2.

2.2.1 Quadratic convergence

In order to discuss how quickly Newton’s method can converge, we first introduce some definitions. Suppose we have a convergent sequence {θt} and we define θ∞ = limt→∞θt.

This sequence converges linearly if lim

t→∞

||θt+1− θ∞||

||θt− θ∞||

≤ r, r ∈ (0, 1). (2.9) The sequence converges superlinearly if

lim

t→∞

||θt+1− θ∞||

||θt− θ∞||

= 0. (2.10)

Quadratic convergence occurs when lim

t→∞

||θt+1− θ∞||

||θt− θ∞||2

≤ r, r ∈ (0, ∞). (2.11)

In Figure 2.1 we give a comparison of three sequences with different convergence rates, to give an impression of the difference in convergence rates. Under certain conditions Newton’s method converges quadratically, in contrast with gradient descent, which can only converge at a linear rate under the best conditions [Overton, 2017]. This motivates our interest in Newton’s method within the class of gradient-descent methods.

(9)

Figure 2.1: A visual comparison of the different convergence rates. The sequence given by θt= 1₂

t

, converges linearly. Superlinear convergence occurs with θt= 1_t t

. The sequence with θt= 1_t

2t

converges quadratically. All three sequences have θ∞= 0.

2.2.2 Modifications

A significant problem with Newton’s method is that it does not always move in the direction of a minimum for every objective function. How do we guarantee movement towards a minimum?

We start by introducing a step size αt, resulting in

θt+1= θt− αtHt−1gt. (2.12)

Then we combine a first-order Taylor approximation with the update step above.

f (θt+1) = f (θt− αtHt−1gt) (2.13)

= f (θt) + g>t(θt+1− θt) + O(|θt+1− θt|2) (2.14)

= f (θt) − αtgTtH −1

t gt+ O(α2t) (2.15)

Two simple conditions that guarantee that f (θt+1) < f (θt) are:

1. g_tTH_t−1gt > 0, which holds when Ht is positive definite, which clearly does not hold

for every objective function at every θt. In those cases we have to replace Ht−1 by a

matrix Mt that is positive definite. Some possible solutions are:

(a) Mt= I, which gives us gradient descent.

(b) If Ht is a diagonal matrix, then Mt(i, i) = max

,∂f_∂θ22

i

−1

, with > 0.

(c) Damping Htby choosing Mt= (Ht+ τ I)−1, where τ > 0 is chosen such that Mt

is positive definite.

(d) Mt= |Ht|−1, where |Ht| is the matrix found by taking the absolute value of the

eigenvalues of H. We are most interested in this solution; the reason for this is described in detail in Section 2.3.

2. We want αtsmall enough that the second term dominates the third term in Equation

(2.15). How to choose an αtthat is small enough that it guarantees a move towards a

minimum, while being big enough to be computationally relevant? A solution for this is Armijo line search and described in detail in Section 2.5.

(10)

2.3 Saddle points in high-dimensional spaces

Saddle points in high-dimensional spaces are an oft-looked over minimization obstacle. When we think and visualize optimization problems, we tend to do this with one or two-dimensional examples. Although useful, they give us a wrong intuition about the presence of saddle points in optimization problems. In the case of high-dimensional problems like neural networks, this has resulted in the belief that, when an algorithm is unable to go deeper, it is because it has gotten stuck in a local minimum.

The authors of the Saddle-Free Newton method, Dauphin et al. [2014a], argue that what are commonly perceived to be local minima are actually saddle points, because the ratio of saddle points to local minima tends to increase exponentially with the dimensionality of a problem. The authors also verify this claim by looking at the landscape of a small multilayer perceptron.

The confusion not only arises from our wrong intuitions, but also empirically because of our descent methods. Saddle points are often surrounded by plateaus of small curvature [Dauphin et al., 2014a], and gradient-descent methods, although moving in the right di-rection, are slow in overcoming plateaus. On the other hand, a second-order method like Newton’s method escapes plateaus easily but is attracted by saddle points, since it seeks out stationary points, not minima.

There are quite a few solutions that have been proposed for overcoming this tendency of Newton’s method, some of which we mentioned in Section 2.2.2. However, these solutions have significant drawbacks. For example, adding a constant to the diagonal until all eigen-values of the Hessian are positive often leads to very small step sizes. Another solution is to ignore negative curvature, such as in Hessian-Free optimization [Martens, 2010] and L-BFGS [Liu and Nocedal, 1989]. Although doing this makes it less likely to end up in a saddle point, it also becomes impossible to escape one. Furthermore, K-FAC [Martens and Grosse, 2015] pretends that a non-convex objective function is locally convex. TONGA [Roux et al., 2008], which uses the convariance matrix of the gradients, also ignores negative curvature.

Dauphin et al. [2014b] state that there is much to be gained in being able to escape saddle points more effectively. They claim that this is because of the relative prevalence of saddle points as opposed to minima in high-dimensional problems like neural networks. They give the following intuition as to why this is the case. For any critical point, the chance that all the directions around it have positive curvature becomes increasingly small with the number of dimensions. This also means that, when all directions around a critical point have positive curvature, the chance of it being a global minimum becomes increasingly large with the number of dimensions.

2.4 Saddle-Free Newton method

The Saddle-Free Newton method uses the following descent direction:

dt= |Ht|−1gt, (2.16)

where |Ht| is the matrix obtained by taking the absolute value of the eigenvalues of Ht.

Although this idea was suggested earlier by at least Nocedal and Wright [2006] and Murray [2010], it was not explored empirically and exploited before the introduction of the Saddle-Free Newton method.

To make sense of the Saddle-Free Newton method, we need to have an understanding of two things:

(11)

• Why using |Ht| instead of Ht might help escape plateaus.

2.4.1 Newton’s method, eigenvalues and eigenvectors

This section has been inspired by the work of Goh [2017]. Suppose we have a quadratic objective function

f (θ) =1 2(θ − θ

∗₎>_{H(θ − θ}∗_{) + constant,} _(2.17)

where θ ∈ RD

, H ∈ RD×D _{and θ}∗ _{is the stationary point. In this case}

gt= ∇θtf (θt) = H(θt− θ

∗_). _(2.18)

Suppose we want to minimize our objective function using gradient descent as follows:

θt+1= θt− αgt (2.19)

= θt− αH(θt− θ∗). (2.20)

Remember that we can decompose the Hessian as

H = V ΣVT, (2.21)

where V is the orthogonal matrix composed of the eigenvectors and Σ is a diagonal matrix composed of the eigenvalues λi. Using the eigenvectors V , we can do a change of basis

wt= VT(θt− θ∗), which results in:

θt+1= θt− αH(θt− θ∗) (2.22) ⇐⇒ θt+1− θ∗= θt− θ∗− αH(θt− θ∗) (2.23) ⇐⇒ V>(θt+1− θ∗) = V>(θt− θ∗) − αV>H(θ − θ∗) (2.24) ⇐⇒ wt+1= wt− αΣwt. (2.25)

There are two interesting thing to note. When wt= 0, we have found the stationary point,

and since Σ is a diagonal matrix we can look at all the elements individually. This gives us w_t+1i = w_ti− αλiwit (2.26)

= (1 − αλi)wit (2.27)

= (1 − αλi)t+1w0i, (2.28)

where wi

tis the ith element of wt.

From this we can learn two things:

• Gradient descent converges in the direction of the ith _{eigenvector if |1 − αλ}

i| < 1. This

is the case when λi> 0 and 0 < α < _λ2

i. If any λi< 0, then gradient descent will not

converge to the stationary point; this is the case with saddle points and a maximum. • From this we see that the convergence speed of gradient descent directly depends on the smallest and largest eigenvalues. We can try to speed up convergence by increasing α, but we are restricted by the largest λi. This is also why the condition number of

the Hessian κ(H) = λD

λ1 (where λDis the largest and λ1the smallest eigenvalue of H),

(12)

We can reuse the above derivation to understand Newton’s method in the context of quadratic functions. With Newton’s method we update with

θt+1= θt− αH−1gt (2.29)

= θt− α(θ − θ∗). (2.30)

If we perform the change of basis again, this results in

wt+1i = wti− αwti (2.31)

= (1 − α)w_ti (2.32)

= (1 − α)t+1wi0 (2.33)

We have multiplied the descent direction in the eigenvector basis by 1

λ. This does two things:

• It makes the convergence rate independent of the eigenvalues, allowing for an increase in the rate of convergence, since we can converge uniformly along all eigenvectors. • It always converges to the stationary point, no matter what the signs of the λi are.

In the case of a quadratic function this works perfectly: we simply take α = 1.

2.4.2 Saddle-Free Newton method and the absolute value of

eigen-values

We have learned in the previous section that gradient descent performs badly when the condition number κ(H) is large. A high value of κ(H) indicates that there is large differ-ence between the largest and the smallest eigenvalues, indicating a plateau in at least one direction. Newton’s method ameliorates this issue by rescaling the gradient by a factor _λ1

i

in the direction of the ith eigenvector. This compensates for the fact that the gradient was expected to increase very little in a particular direction, by making it take a larger step in that direction.

The problem with Newton’s method is that if λi < 0 this operation also changes the direction

of the gradient in the direction of that eigenvector. This means that the algorithm will ascend instead of descending in that direction.

The creators of the Saddle-Free Newton method [Dauphin et al., 2014b] saw a quick fix to solve this problem by rescaling with _|λ1

i| instead of

1

λi. The result is a direction that

minimizes the objective function and takes bigger steps in the presence of a plateau, but does not seek saddle points.

2.5 Armijo line search

In the gradient-descent framework, we update θtin the following way:

θt+1= θt− αtdt,

where αtis the step size and dtis the search direction. Line search is a standard technique

for finding a good value of αt.

A line search tries a sequence of possible values of αt, until a stopping condition is met. The

line-search literature has a wide range of methods, but a popular variant is backtracking line search using the Armijo condition:

f (θt− αtdt) ≤ f (θt) − σαtgtTdt, (2.34)

(13)

In practice it is very simple: given β ∈ (0, 1) and σ ∈ (0, 1), choose as αtthe largest element

of {1, β, β2_{, . . .} that satisfies the Armijo condition as described in Equation (2.34).}

The Armijo condition limits the step size to a region which approximately satisfies a first-order Taylor approximation. How tightly the first-first-order Taylor approximation has to hold is regulated by σ. For σ = 1 the approximation has to hold perfectly, while for σ = 0 the requirement becomes that f (θt− αdt) ≤ f (θt).

Although linear search methods were originally meant for deterministic settings, Vaswani et al. [2019] demonstrate that it can be adapted for a stochastic setting. They do this by adapting the Armijo rule, setting:

fit(θt− αdt) ≤ fit(θt) − σαg

T

itdt, (2.35)

where fit and git are the average function evaluation and average gradient for mini-batch i

at time t. This adaptation does not require any extra mini-batches or gradient evaluations during the line search.

2.5.1 Heuristic line search

In practice αtis often very close to αt−1. We can take advantage of this quality by using the

following heuristic. We choose αtas the largest element of {γαt−1, βγαt−1, β2γαt−1, . . .} for

some γ > 1. We do not start backtracking from αt−1, because this would lead to α always

decreasing; instead we start with something larger than αt−1. This allows αtto grow, while

still limiting the search space to something near αt−1.

2.6 Hessian estimation

Calculating the full Hessian is infeasible for all high-dimensional objective functions, at least because of its memory requirements of O(D2), where D is the dimensionality of the problem. Even a relatively low-dimensional classification problem like MNIST, solved with LeNet, requires approximately 3.3 × 106 parameters. The corresponding Hessian at any point at float32 precision requires approximately 20 terabytes. In contrast, state-of-the-art algorithms like GPT-3 [Brown et al., 2020] have 175 billion parameters.

This forces us to use an estimate of the Hessian, instead of calculating the full Hessian. Many second-order methods try to avoid estimating the Hessian at all, because working with the Hessian itself is also costly.

Methods like L-BFGS [Liu and Nocedal, 1989] directly approximate the inverse of the Hes-sian using the past m updates of (θt, gt). This circumvents the costly Hessian inversion.

The Hessian-Free method [Martens, 2010] goes one step further, never even constructing the Hessian by using the Pearlmutter trick in combination with conjugate gradient descent [Shewchuk et al., 1994]. Pearlmutter [1994] showed that it is relatively inexpensive— about the cost of calculating the gradient—to estimate the Hessian multiplied by a vector v of choice.

In the case of the Saddle-Free Newton method, both these ideas are not an option, because we need an estimate of the Hessian to be such that we can:

• perform an eigenvalue decomposition of Ht: Ht = V ΣV−1, where V is the matrix

that spans all the eigenvectors of Ht and Σ is the diagonal matrix composed of the

corresponding eigenvalues, • calculate |Ht| = V |Σ|V−1,

(14)

Dauphin et al. [2014b] use a method similar to Krylov subspace descent [Vinyals and Povey, 2012]. They claim to do the following. Instead of minimizing f (θ) they choose to minimize a lower-dimensional variant ˆf (α) = f (θ + α ˜V ), where ˜V is composed of the k Krylov subspace vectors. The Krylov subspace vectors are found using Lanczos iteration of the Hessian and span the k biggest eigenvectors of the Hessian with high probability. The authors set k = 500 for their experiments, resulting in a subspace small enough that Saddle-Free Newton descent is feasible within the subspace.

However, their explanation leaves much to be desired. There appear to be many steps missing both in their explanation and in their pseudo-code, making it ambiguous and hard to analyze. What we do know is that the memory requirements of storing the Krylov vectors is O(kD), where D is the dimensionality of the problem. Caculating the Krylov vectors with Lanczos requires at least O(kD) arithmetical operations. In order to do the exact Saddle-Free Newtons method within the subspace requires an eigenvector decomposition with computational cost of O(k3_{), where it is plausible that k}3_{> D. In the case of Dauphin}

et al. [2014b], k = 500, so k3_{= 125 × 10}6_{, larger than the number of parameters for many}

neural networks. Overall, we conclude from what we do understand of the algorithm that it is not inexpensive.

We would have liked to give a more extensive review of how second-order methods solve the problem of requiring an estimate of the Hessian, but because of the complexity of the subject this would require an entire thesis of its own. We hope that this brief review has communicated clearly that working with the Hessian is complex and computationally expensive. If not, we highly recommend reading any of the citations.

Duda [2019] notes that second-order behavior is linearly dependent on first-order behavior and that linear trends can be estimated using linear regression. In Chapter 3, our method section, we will first introduce this solution in a univariate setting and then extend it to the multivariate case by introducing the pseudo-Hessian, an analogue to the Hessian. The result is a method that is simple to implement and inexpensive to calculate.

(15)

3—Method

In this section we will introduce our gradient-descent algorithm: the Pseudo Saddle-Free Newton method (PSFN). The algorithm consists of two major parts:

1. Estimating the pseudo-Hessian.

2. A Saddle-Free Newton direction based on the estimated pseudo-Hessian. We will first discuss both parts, before bringing them together in our method.

3.1 Estimating the pseudo-Hessian

3.1.1 Univariate Hessian estimation

This section has been inspired by the work of Duda [2019]. We assume that the objective function f can be estimated locally (in a point θt) by a quadratic function ˆf . That is to say,

ˆ f (θt) =

1

2ht(θt− ct)

2_{+ constant,} _(3.1)

with ht, ct∈ R, with gradient

∇θf (θˆ t) = ht(θt− ct). (3.2)

The task is to find htand ctsuch that ˆf makes for a good local approximation of f . We do

this by minimizing the mean-squared error: (h∗t, c∗t) = arg min (ht,ct) t X i=1 wi(gi− ht(θi− ct))2, (3.3)

where gi= ∇θf (θi) and wiare weights chosen such that they reflects the relevance of {θi, gi}

at time t. Many options are possible for wi: for instance, uniform weights wi = 1/t and

exponential moving averages wi = βi, β ∈ (0, 1). We solve for (h∗t, c∗t) in Equation (3.3), by

first introducing some convenient notation. We define st:= t X i=1 wi, θt:= t X i=1 wiθi, gt:= t X i=1 wigi, (3.4) θ2 t := t X i=1 wiθ2i, θgt:= t X i=1 wiθigi. (3.5)

This leads to:

h∗_t =stθgt− gtθt stθt2− θt 2 , c ∗ t = θ2 tgt− θtθgt θtgt− sθgt . (3.6)

(16)

In the case where we use exponential moving averages, we can update our values in an iterative fashion: st= βst−1+ 1, (3.7) θt= βθt−1+ θt, (3.8) gt= βgt−1+ gt, (3.9) θ2t = βθt−12 + gt, (3.10) θgt= βθgt−1+ θgt. (3.11)

This allows us to estimate the Hessian by keeping track of five exponential moving averages and performing a simple calculation.

3.1.2 Multivariate pseudo-Hessian estimation

Hessians have one major undesirable quality for high-dimensional problems; they are com-putationally expensive to calculate and to work with. For more on this see Section 2.6. Furthermore, if we want to use the Hessian to do Saddle-Free Newton descent as described in Section 2.3, this requires us to

• perform an eigenvalue decomposition of Ht,

• calculate |Ht| = V |λ|V−1, where V is the matrix that spans all the eigenvectors and

|λ| is a diagonal matrix with the corresponding absolute eigenvalues, • calculate |Ht|−1.

To solve for this computational overhead, we will not be using the Hessian, but what we call the pseudo-Hessian ˜Ht. We will assume the objective function f can locally be approximated

by ˆ f (θt) = 1 2(θt− ct) >_H_˜ t(θt− ct) + constant, (3.12)

where ˜Ht is a diagonal matrix.

In the case of ˜Ht, calculating the Saddle-Free Newton direction becomes inexpensive:

• Since ˜Htis a diagonal matrix, an eigenvalue decomposition is not needed; the diagonal

values are the eigenvalues.

• We can calculate |Ht| by taking the absolute value of the diagonal elements.

• We can calculate |Ht|−1 by taking the inverse of the diagonal elements.

We recognize that the pseudo-Hessian cannot be as expressive as the true Hessian, since we restrict ˜H to be a diagonal matrix. Still, it is important to note that the pseudo-Hessian

˜

Ht is not the diagonal of the Hessian Ht. In fact, it should be more informative than the

diagonal of the Hessian, since the diagonal of the Hessian is not meant to be used on its own.

Although we cannot produce any theoretical guarantees as to the value of ˜H, we will explore its value experimentally in Section 4.

Estimating the pseudo-Hessian

Analogous to the univariate case, we assume that the objective function f can be estimated locally by ˆ f (θt) = 1 2(θt− ct) >_H_˜ t(θt− ct) + constant, (3.13)

(17)

with ˜Ht ∈ RD×D a diagonal matrix, and ct ∈ RD. Here, D is the dimensionality of the

problem, i.e. θt∈ RD. The gradient is

∇θf (θˆ t) = ˜Ht(θt− ct). (3.14)

The task is to find ˜Htand ctsuch that it makes for a good local approximation. We do this

by minimizing the mean squared error: ( ˜H_t∗, c∗_t) = arg min ( ˜Ht,ct) t X i=1 wi D X j=1 gi,j− ˜Ht(θi− ct) j 2 (3.15) = arg min ( ˜Ht,ct) t X i=1 wi D X j=1 gi,j− ˜Ht,j(θi− ct) 2 , (3.16)

where ˜Ht,j is row j of ˜Ht. Remembering that ˜Htis a diagonal matrix, we define ht,j as the

jth element on the diagonal ˜Htand rewrite Equation (3.16) as

( ˜H_t∗, c∗_t) = arg min (ht,j,ct,j) t X i=1 wi D X j=1 (gi,j− ht,j(θi,j− ct,j))2 (3.17) = arg min (ht,j,ct,j) D X j=1 t X i=1 wi(gi− ht,j(θi,j− ct,j)) 2 , (3.18) which gives us (h∗t,j, c∗t,j) = arg min (ht,j,ct,j) t X i=1 wi(gi− ht,j(θi,j− ct,j)) 2 . (3.19)

We are allowed to move from Equation (3.18) to Equation (3.19) because the elements dependent on j are independent of each other, and wi ≥ 0 and (gi− ht,j(θi,j− ct,j))

2

> 0 for all (i, j).

This results in the same problem as in Equation (3.3) for which we know the solution—see Equation (3.6). We will re-use this solution, with a few adaptations such that it can be easily implemented:

• For the exponential moving averages in Equations (3.7) to (3.11), we do element-wise multiplication and addition.

• We cast the diagonal of ˜Htinto a vector htwith ht=stθgt−gtθt stθ2t−θt

2 , where all calculations

are done element-wise.

We recognize that what we have found is a very convenient solution to our problem; we can do all our calculations element-wise, which is computationally inexpensive. This is a lucky happenstance, since it follows solely from requiring ˜Htto be a diagonal matrix.

3.2 A Saddle-Free Newton direction based on the

esti-mated pseudo-Hessian

The descent direction of our algorithm is

dt= max (ε, |ht|)−1gt

for some ε > 0. Here, |ht| is the element-wise absolute value of the vector ht, and the max

operator takes the maximum between ε and |ht|, also element wise.

(18)

1. Saddle-Free Newton direction: We use this update step because it actively repels saddle points. This is fortuitous, because a saddle point is easily confused with a local minimum, which prevents us from finding a lower point. For more information refer to Section 2.3.

Saddle-Free Newton has the following update step: dt= |Ht|−1gt,

where |Ht| is the matrix found by taking the absolute value of the eigenvalues of H.

Since we estimate the local curvature with the pseudo-Hessian, which is a diagonal matrix, we can replace this by an element-wise operation:

dt=

gt

|ht|

, where |ht| is just the absolute value of ht.

Again, it is worth noting how computationally inexpensive calculating this descent di-rection is because of the pseudo-Hessian—especially in the context of high-dimensional problems like neural networks. We do not need to do an expensive eigenvalue decom-position or a matrix inversion.

2. Setting a maximum step size: When the curvature is very flat—ht close to 0—

the algorithm starts taking steps that are very large. Although this can be a positive quality, step sizes can increase too much, so they should be restricted for stability. This is done by replacing |ht| with max (ε, |ht|).

3.3 Pseudo Saddle-Free Newton method

Our algorithm, the Pseudo Saddle-Free Newton method (PSFN) is the combination of the pseudo-Hessian with SFN descent as described in section 3.1 and 3.2. See Algorithm 1 for the pseudo-code of PSFN. The whole algorithm is comparable in memory requirement and computational cost to Adam [Kingma and Ba, 2014].

(19)

Algorithm 1 Pseudo Saddle-Free Newton method (PFSN) require: α: Learning rate

require: β ∈ (0, 1): Exponential decay factor

require: εh> 0: Small number for numerical stability (default: 10e − 16)

require: εd> 0: Limiting the maximum step size

require: f (·): Stochastic objective function with parameter θ require: θ0: Initial parameter vector

t ← 0 (Initialize timestep)

s0← 0, ¯θt← 0, ¯gt← 0, θ2t← 0, θgt← 0 (Initializing running averages)

while θt not converged do

t ← t + 1

gt= ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)

st← βst−1+ 1 ¯ θt← β ¯θt−1+ θt ¯ gt← β¯gt−1+ gt θ2 t← βθ2t−1+ θ2t θgt← βθgt−1+ θtgt if t > 2 then h−1_t = (sθ2 t− ¯θ2t)/(stθgt− ¯gtθ¯t+ εh) dt= min 1/εd, |h−1t | gt θt← θt−1− αdt else

Update with SGD with very small learning rate end

end

return: θt(Resulting parameter)

The algorithm keeps track of five exponential moving averages to estimate the pseudo-Hessian. The moving averages are all initialized at zero and the hyperparameter β ∈ (0, 1) controls the decay rate.

We do not update in the Saddle-Free Newton direction during the first two steps. Since we use linear regression to find the pseudo-Hessian we need at least two data points (θ1, g1) and

(θ2, g2). These data points are collected during the “burn-in period”, which can be done

using SGD with a very small learning rate.

We directly calculate the elements of h−1_t instead of inverting those of ht, and have added

an optional εh≥ 0 to the denominator; both choices are done for numerical stability. Since

we directly calculate the inverted pseudo-Hessian, we have to update the descent direction to match this change.

dt= max(εd, |ht|)−1gt= min(1/εd, |h−1t |)gt (3.20)

3.3.1 Extending PSFN with heuristic Armijo line search

We extend PSFN with heuristic Armijo line search as described in Section 2.5 to find a good step size. See Algorithm 2 for the pseudo-code of heuristic Armijo line search.

(20)

Algorithm 2 Heuristic Armijo line search require: γ ∈ (0, 1) : step-size decay rate require: ρ > 1 : step-size growth rate require: σ ∈ (0, 1) : Armijo constant require: αmin> 0 : minimum step size

require: ft: objective function for mini-batch at step t

require: dt: descent direction at step t

α = ραt−1 repeat α = γα ˜ θt= θt−1− αdt until ft( ˜θt) ≤ ft(θt−1) − σαg>t dt or α < αmin; if α < αminthen αt← 1 else αt← α end θt← ˜θt

Heuristic Armijo line search is a natural choice because it is not directed at maximizing improvement, but rather at guaranteeing that we move within a region where a first-order Taylor approximation roughly holds. Our estimation of the pseudo-Hessian requires us not to move too fast, since if the objective function changes too quickly from θtto θt+1it becomes

less likely that a local quadratic approximation holds.

Heuristic Armijo line search requires extra function evaluations to find a good step size. Although this is an expensive operation, it is not uncommon for second-order methods that base their descent direction on an estimate of the Hessian. An estimate is inherently noisy, which can result in poor descent directions and instability of the algorithm.

In the original SFN [Dauphin et al., 2014b], they do not use a learning rate, but they instead use damping for stability. In their case dt= (|Ht| + λI)−1, where the dampening coefficient

λ is chosen from {100_{, 10}1_{, . . . , 10}6_{} such that it maximizes the improvement. This requires}

six extra function evaluations for every mini-batch.

A big drawback of heuristic Armijo line search is that it introduces three extra hyperpa-rameters. To prevent extensive parameter grid searches we use the parameters found by Vaswani et al. [2019]:

• γ = 0.9

• ρ = 2b/n_{, where b is the batch size and n is the size of the training set. In the case of}

a deterministic function b = n = 1. • σ = 0.1

We have also added a minimum step size αmin, which is set to αmin = 10−5 in all our

experiments. If the Armijo line search does not find a viable step size larger than the minimum step size, a step is enforced and αtis set to 1 to restart the search for the optimal

step size at the next t. This prevents endless search evaluations in the case of a bad descent direction.

(21)

4—Experiments & Discussion

We will test and investigate the properties of the two algorithms introduced in Section 3 using four different problems:

• Powell singular function [Powell, 1962]

• Stochastic Rosenbrock function [Yang and Deb, 2010]

• MNIST classification with a fully connected neural network [LeCun et al., 1998] • CIFAR-100 classification with modern neural network architectures [Krizhevsky et al.,

2009]

Every section in this chapter consists of four parts: • Description of the problem,

• Relevance of the problem,

• Description of the experimental setup, • Results and discussion.

We choose to discuss the results directly after showing them; otherwise a lot of backtracking would be required while reading. We end with some closing remarks after presenting all the results.

4.1 Powell singular function

4.1.1 Description of the problem

The Powell singular function was introduced by Powell [1962]. It is a convex, unconstrained optimization problem whose function is defined by:

f (x) = (x1+ 10x2)2+ 5(x3− x4)4+ (x2− 2x3)4+ 10(x1− x4)4. (4.1)

The problem has as a starting point x0 = [3, −1, 0, 1] and has a global minimum at x∗ =

[0, 0, 0, 0], with f (x∗) = 0.

4.1.2 Relevance of the problem

PSF is a standard reference problem for global optimization algorithms [Steihaug and Suleiman, 2013]. Since this problem is convex, it will allow us demonstrate the usefulness of the pseudo-Hessian, without ever using its properties of saddle-point repulsion.

(22)

4.1.3 Description of the experimental setup

We will perform two experiments on the Powell singular function:

1. We will compare the performance of SGD, SGD+Momentum, SGD+Armijo, Adam, PSFN and PSFN+Armijo on the Powell Singular function from starting point x0 =

[3, −1, 0, 1].

To make a fair comparison we will search for the optimal step size for the different algorithms. The search is done equidistant on a log scale from α = 1 to α = 1e − 6 in 30 steps. All other parameters will be set to their default: Momentum (β = 0.9), Adam (β1= 0.9, β2= 0.999), PSFN (β = 0.9, εd = 10e − 4).

2. We will compare descent direction found by the pseudo-Hessian dt = ˜Ht−1gt with

the descent direction found by Newton’s method dt = ˜Ht−1gt. We do this because

Newton’s method converges within 30 iterations. The comparison will be done by using the angle between the descent directions,

hd1, d2i = cos−1

d1· d2

kd1k kd2k

. (4.2)

To do this, first we will store the pseudo-Hessian ˜Ht, the gradient gtand the parameters

θtat each step for PSFN and PSFN+Armijo. Then we will compare multiple descent

directions with Newton’s descent direction d = H_t−1gt. Specifically, we will compare

it with dt = gt, which is our baseline, dt = ˜Ht−1gt, which is the Newton’s descent

direction found by the pseudo-Hessian and dt= Hdiag−1 gt, where Hdiag is the diagonal

of the true Hessian.

We hypothesize that the estimation of the pseudo-Hessian is more informative than the diagonal of the Hessian, because it is designed to be used in this manner; the effect of the off-diagonal elements of the Hessian is accounted for in the diagonal of the pseudo-Hessian, while this is obviously not the case for the diagonal of the Hessian. Thus we would expect the descent direction using the pseudo-Hessian to look more like that of Newton’s method than using only the diagonal of the Hessian.

4.1.4 Results and discussion

Performance of PSFN and PSFN+Armijo

Figure 4.1 gives a comparison of the performance of the different algorithms. We depict the first 2000 iterations because by then all algorithms have mostly converged.

We notice the following things:

• SGD+Armijo outperforms SGD as expected. Heuristic Armijo line search finds a good step size at every step.

• Momentum and Adam both outperform SGD+Armijo, but Adam becomes unstable towards the end.

• PSFN and PSFN+Armijo are both the clear winners; they both find lower functions values.

• PSFN without Armijo line search finds a lower function value, but is erratic in its behavior, which is undesirable. It seems that the Armijo line search restricts the step size when the pseudo-Hessian estimation is poor, smoothing the descent.

• PSFN does not only find a lower function value; the speed in the beginning is compa-rable with that of SGD+Momentum and Adam.

(23)

Figure 4.1: Comparison in performance of SGD(α = 0.005), SGD+Momentum (α = 0.005), SGD+Armijo, Adam (α = 0.6), PSFN (α = 0.1), PSFN+Armijo on the Powell singular function. The step sizes are found by doing an equidistant search on a log scale from α = 1 to α = 10−6 in 30 steps.

Performance of the pseudo-Hessian

Figures 4.2 and 4.3 give a comparison of the the angles found between the descent direction of Newton’s method on the one hand, and the gradient, the descent direction given by PSFN and the descent direction using the diagonal of the Hessian on the other hand.

Figure 4.2: Histograms showing the angle between Newton’s descent direction dt= H−1gt

and the descent direction used by gradient descent dt = gt, PSFN dt = ˜H−1gt and dt =

H_diag−1 gt, where Hdiag is the diagonal of the Hessian.

We notice the following things. The angle between the descent direction when using the gradient and the descent direction of Newton’s method is always smaller than 90 degrees. This is expected because it is a convex landscape and so the findings of Wang et al. [2013] apply.

Both the descent directions found by the pseudo-Hessian and the diagonal of the Hessian have a lot more mass close to zero. This means that both find descent directions closer to that of Newtons’ method. The mass distribution of the pseudo-Hessian is closer to zero than that of the diagonal of the Hessian. This suggests that our estimated pseudo-Hessian could outperform the true diagonal of the Hessian.

(24)

Figure 4.3: Histograms showing the angle between the Newton’s descent direction dt =

H−1gt and the descent direction used by gradient descent dt = gt, PSFN + Armijo dt=

˜

H−1gtand dt= Hdiag−1 gt, where Hdiag is the diagonal of the Hessian.

moving away from the right direction. The fact that PSFN finds angles larger than 90 degrees also explains why the descent is erratic in behavior. Since the step size is constant, and the descent direction sometimes moves away from the right direction, the result is an increase in the function value. This also explains why PSFN+Armijo does not suffer from this problem and is very smooth. During the steps where PSFN finds a descent direction that moves away from from the minimum, the step size becomes very small.

4.2 Stochastic Rosenbrock function

4.2.1 Description of the problem

The Rosenbrock function was introduced by Rosenbrock [1960]. It is also know as the Rosenbrock valley, because the global minimum is inside a long and narrow valley. It is a non-convex, unconstrained optimization problem whose function is defined as:

f (x) = (a − x1)2+ b(x2− x21)

2_. _(4.3)

The problem has a global minimum at x∗= (a, a2_{), where f (x}∗_{) = 0.}

Yang and Deb [2010] extend the Rosenbrock function by introducing a stochastic version: f (x) = (a − x1)2+ bε(x2− x21)

2_, _(4.4)

where ε ∼ Uniform(0, λ), λ > 0, which has the same global minimum.

4.2.2 Relevance of the problem

Rosenbrock is a standard reference problem for global optimization algorithms. The Stochas-tic Rosenbrock function maintains all of the difficulties of the Rosenbrock function, but adds a stochastic layer to the problem. This is another problem where our algorithm cannot ben-efit from its saddle-point repulsion qualities. However, it is useful in demonstrating the quality of our pseudo-Hessian estimation in the presence of stochasticity.

4.2.3 Description of the experimental setup

We want to test how robust PSFN is in the context of stochasticity. We do this by comparing the performance of SGD, SGD+Momentum, SGD+Armijo, Adam, PSFN and PSFN+Armijo on the Stochastic Rosenbrock function in the presence of varying levels of stochasticity, taking λ ∈ {1, 3, 5}. For our experiments we choose (a, b) = (1, 100), start from x0= [0, 0] and run for 10,000 iterations in line with Henriques et al. [2019].

(25)

Figure 4.4: Comparison of the performance of SGD (α = 0.002), SGD+Momentum (α = 0.001), SGD+Armijo, Adam (α = 0.002), PSFN (α = 0.56) and PSFN+Armijo on the Stochastic Rosenbrock function with λ ∈ {1, 3, 5}. SGD and SGD+Momentum in in λ = 3 and λ = 5 are left out because of their explosive behavior.

We first tune the algorithms to find the optimal step size for the different algorithms over the average of five runs on the Stochastic Rosenbrock function with with λ = 1. For every run we use a different random seed. The search is done equidistant on a log scale from α = 1 to α = 1e − 6 in 30 steps. All other parameters will be set to their default: Momentum (β = 0.9), Adam (β1= 0.9, β2= 0.999), PSFN (β = 0.9, d= 10e − 4).

Then we repeat the experiment, but increase λ to 3 and 5, while keeping the parameters of the different optimizers the same. This will give an indication of how sensitive the different optimizers are to stochasticity.

4.2.4 Results and discussion

Figure 4.4 compares the performance of the different algorithms. We have included neither SGD with λ = 5 nor SGD+Momentum with λ = 3 and λ = 5 in the plots, because their function values exploded.

We notice a few things. SGD+Armijo outperforms SGD, even though SGD has been tuned. The adaptive learning rate found with heuristic Armijo line search still adds value as pre-dicted by Vaswani et al. [2019]. Momentum performs very well with λ = 1, but does not generalize to λ = 3 and λ = 5. The parameter choice is very sensitive to the stochasticity. SGD suffers from the same problem.

PSFN and PSFN+Armijo both perform very well, systematically outperforming the compe-tition—both in their initial speed and their final function evaluation. The linear regression seems to find a good estimate of the pseudo-Hessian even in the presence of stochasticity. PSFN does seem to get stuck around at a function value of 10−10, probably because of the constant step size. In the experiments with the Powell function we showed that the estimated pseudo-Hessian sometimes gives a poor descent direction; if Armijo line search is not present to choose a small step size, then this will increase the function value. At a function value of 10−10, PSFN seems to have reached an equilibrium between taking a few steps in the direction of the minimum, followed by taking a few steps away from the minimum.

(26)

4.3 MNIST classification with a fully connected neural

network

4.3.1 Description of the problem

The MNIST database was introduced by LeCun et al. [1998]. It contains 60,000 training examples and 10,000 test examples of hand-written digits. All images are centered, size-normalized to fit 28 × 28 pixels and in greyscale. We will try to classify the images by using a fully connected neural network with one hidden layer of size 1000.

4.3.2 Relevance of the problem

MNIST is commonly used to test the performance of new machine-learning methods, mak-ing it a good benchmark for comparison. Baldi and Hornik [1989] have shown that neural networks with one hidden layer only have one minimum, the global minimum. All other stationary points are saddle points. Dauphin et al. [2014b] have shown that the local land-scapes around saddle points are plateaus. Our algorithms are designed to address both prob-lems, which should show in the performance of our optimization algorithms. PSFN+Armijo searches for the ideal step size using the work of Vaswani et al. [2019]. They also use this problem as a benchmark, making it a fair comparison.

4.3.3 Description of the experimental setup

We will use a fully connected neural network with one hidden layer of size 1000. The hidden layer has ReLU activation units and there is a Softmax on the output layer. The loss is calculated using the cross entropy between the output and the target.

We will compare SGD, SGD+Momentum, SGD+Armijo, Adam, PSFN, PSFN + Armijo. Each optimizer is run on five different seeds for 100 epochs. We will use the default settings in the case of SGD (α = 0.01), SGD+Momentum (α = 0.01, β = 0.9) and Adam (α = 0.01, β1 = 0.9, β2 = 0.999). For PSFN and PSFN+Armijo we will do a grid search over

β ∈ {0.7, 0.9, 0.99, 0.999} and εd ∈ {1, 1e − 1, . . . , 1e − 5}. In the case of PSFN we will also

search {1, 10e − 1, . . . , 10e − 5} for the best step size α.

During training we will use a batch size of 128, in line with Vaswani et al. [2019], for a fair comparison with SGD+Armijo. We will use wall time, instead of epochs at the x-axis, to give an impression how the different algorithms compare in computational cost.

We will only be looking at the train loss and not the validation loss, because that is what our optimizer tries to minimize. Although good performance on the validation loss is a desirable trait, it lies outside the scope of this thesis.

We expect two things:

• We expect SGD+Armijo to outperform SGD, SGD+Momentum and Adam, since we are partly reproducing the work of Vaswani et al. [2019], who found SGD+Armijo outperforms all standard optimizers.

• We expect PSFN and PSFN+Armijo to outperform all other methods since optimizing this network should be an ideal problem for PSFN. This is because all stationary points are saddle points or the global minimum, and all other methods do not have a good way to escape saddle points.

4.3.4 Results and discussion

Figure 4.5 gives a comparison of the performance of the different algorithms. We only show the first 100 minutes, because it is clear that everything interesting has happened within

(27)

Figure 4.5: Comparison of the performance of different optimizers on MNIST with a neural network with one hidden layer of size 1000. All optimizers use their default parameters, except for PSFN+Armijo, which uses β = 0.999 and εd = 1e − 4. After 67 minutes the

line plot stops for PSFN+Armijo, because the train loss has become zero, which cannot be plotted on a log scale.

that range.

Let us note three things before discussing the results of the figure:

1. There is no line plot of PSFN, because every combination of parameters we tried performed very poorly. In earlier experiments we found that the pseudo-Hessian can give very poor descent directions. It seems that Armijo line search is needed to keep the step sizes small in those cases.

2. The line plot of PSFN+Armijo suddenly stops after 67 minutes; this is because the loss has become perfectly zero, which cannot be plotted on a log scale.

3. During our experiments on MNIST we decided to make a small adaptation to PSFN; if the loss of a mini-batch equals zero, we skip all further steps for this mini-batch. This was because we noticed that some mini-batches had loss zero, resulting in zero gradient on every parameter. These zero-loss mini-batches slowed convergence significantly, because they distorted our estimate of the pseudo-Hessian. When we have multiple batches with zero gradient, the linear regression will start estimating that the curvature of the landscape is zero, which is useless to update with. This poor estimate of the Hessian meant that the training time till convergence was almost double as long. Looking at the figure, we notice the following things:

• SGD+Armijo outperformed SGD, SGD+Momentum and Adam, as we expected. We have reproduced the results of Vaswani et al. [2019].

• The loss of PSFN+Armijo became perfectly zero after 67 minutes. This was very surprising, because we had never come across a neural network trained on MNIST mini-mized to zero within machine precision. This supports our intuition that PSFN+Armijo can escape the saddle points of a neural network with one hidden layer.

• The exponential decay rate of PSFN+Armijo was 0.999, which seems very high. How-ever, we use linear regression to estimate ht, which is susceptible to noise, especially

(28)

Figure 4.6: Train loss on MNIST classification. The first 100 epochs the network was trained with SGD+Armijo; from there it was trained with PSFN+Armijo (β = 0.999, εd= 1e − 4).

After 164 epochs the line plot stops, because the train loss has become zero, which cannot be plotted on a log scale.

close to zero. Luckily, because there is plateau, the landscape does not change much in that direction, meaning that previous data points are more relevant.

Training from a saddle point

Because PSFN+Armijo outperformed the competition, it seems that it does what it claims. It seems to effectively escape saddle points and their surrounding plateaus. An alternative explanation would be that PSFN+Armijo never ends up in a saddle point or plateau to begin with. To test this hypothesis we opted to do an extra experiment.

We train SGD+Armijo for 100 epochs, because SGD+Armijo seems to be converged within this period, and then continue training it with PSFN + Armijo. If PSFN + Armijo can escape plateaus, then it should be able to decrease the loss further. The results of this experiment can be found in Figure 4.6.

The experiment seems to demonstrate an ability to escape plateaus. Armijo+SGD only made very small progress after 60 epochs; when the optimizer was switched the loss continued to decrease to zero. This was as in the other experiments.

Training on a different architecture

In our experiments we have used a very specific architecture, a fully connected neural network with only one hidden layer, that has been chosen in favor of our method. The performance immediately made us curious as to how it would perform on a more standard architecture used for MNIST classification, i.e. more than one hidden layer and convolutional layers. To satisfy our curiosity we repeated the experiments above on a more standard architecture:

• We did not do a grid search for PSFN+Armijo, but instead used the parameters found in the previous experiment.

(29)

Figure 4.7: Comparison of the performance of different optimizers on MNIST with a neural network with two convolutional layers and a fully connected hidden layer. All optimizers use their default parameters, expcept for PSFN+Armijo, which uses β = 0.999 and εd= 1e − 4.

• We choose the architecture from “Image classification (MNIST) using Convnets”1_,

which is a showcase example of Pytorch. Their showcase neural network consists of two convolutional layers with a kernel of size three, a maxpool with a kernel of size two, followed by a fully connected layer of size 128 and an output layer. All three hidden layers have ReLU activation units and there is a Softmax on the output layer. See Figure 4.7 for the results of this experiments. Although the result is less impressive than with the previous architecture, PSFN+Armijo still performs well. PSFN+Armijo is fast and reaches a lower loss than the competition. This experiment seems to demonstrate that our method also works when the number of hidden layers increases and in the presence of convolutational layers. Although it seems less stable than we would prefer, it is hard to estimate this instability, because of the log scale of the figure.

4.4 CIFAR-100 classification with Resnet-18

4.4.1 Description of the problem

The CIFAR-100 database was introduced by Krizhevsky et al. [2009]. It contains 100 classes with 500 examples per class in the test set and 100 examples per class in the train set. All images are 32 × 32 pixels and in color. We will try to classify the images by using Resnet-18 [He et al., 2016], which is a standard neural network for image classification.

4.4.2 Relevance of the problem

We have added this problem to check if the results generalize to standard architectures on more difficult problems. All previous problems were chosen to demonstrate certain qualities of the algorithm. Contrarily, Resnet-18 [He et al., 2016] is a standard architecture that uses both convolutional layers and skip connections. Resnet-18 has 11.69 × 106 _parameters,

making it a very high-dimensional problem to solve.

(30)

Figure 4.8: Comparison in performance of different optimizers on CIFAR-100 with Resnet-18. All optimizers use their default parameters, expect for PSFN+Armijo, which uses β = 0.99 and εd = 1e − 3.

4.4.3 Description of the experimental setup

Training Resnet-18 on CIFAR-100 requires significantly more computational resources. This prevents us from doing an extensive grid search and running each optimizer multiple times on different seeds.

We choose to compare SGD, SGD+Momentum, SGD+Armijo, Adam, PSFN+Armijo. Each optimizer is run for 400 epochs. We will use the default settings for SGD (α = 0.01), SGD+Momentum (α = 0.01, β = 0.9) and Adam (α = 0.01, β1 = 0.9, β2 = 0.999). For

PSFN+Armijo we will do a small grid search over β ∈ {0.99, 0.999} and εd ∈ 1e − 2, 1e −

3, 1e − 4}. During training we will use a batch size of 128.

We will only be looking at the train loss and not the validation loss, because that is what our optimizer tries to minimize. Although good performance on the validation loss is a desirable trait, it lies outside the scope of this thesis.

4.4.4 Results and discussion

Figure 4.8 compares the performance of the different algorithms.

We notice that SGD outperforms SGD+Armijo. This is very surprising and in conflict with the work of Vaswani et al. [2019]. Apparently heuristic Armijo line search does not manage to find an optimal step size. Throughout all our experiments we have kept the parameters of Armijo line search constant as described in Section 2.5. These hyper-parameters might not be ideal for Resnet-18. Since we use the same hyper-hyper-parameters in PSFN+Armijo, this might be an indicator that they are not ideal in that case either. PSFN+Armijo does not outperform the competition. At this point it is hard to say why this is the case, but we have few possible explanations:

• The parameters for heuristic Armijo might not have been optimal. This would result in step sizes that are too large, leading to a poor estimate of the pseudo-Hessian and therefore a poor descent direction.

(31)

106 _{parameters. It might be that the pseudo-Hessian becomes less effective as the}

dimensionality of the problem increases. The pseudo-Hessian attempts to compress information the Hessian—which has D2_{elements—to a matrix with only D elements.}

• The architecture of the network, and particularly skip connections, might prevent the presence of plateaus. In that case we would not expect our method to outperform the competition.

4.5 Closing remarks

Overall, we consider PSFN and PSFN+Armijo a promising algorithm, because it performed well in all cases we tested it on, except for CIFAR-100 image classification with Resnet-18. There are large knowledge gaps at this point, since we have only tested it on a limited number of problems, in which we vary only a limited number of hyper-parameters. To get any conclusive results it must be tested in a more vigorous manner, but this remark does not apply uniquely to our gradient-descent method.

There is one other obvious limitation to the experiments presented here. None of them are compared with the original Saddle-Free Newton method. Although we wanted to do this, it was not possible in the available time. They authors have not released an official version of their algorithm, and their algorithm is difficult to implement, even with a perfect description. It does not help that their pseudo-code contains many parts that are ill defined. The pseudo-Hessian is not a true substitute for the full Hessian. Ideal cases excluded, it is simply impossible to fully capture the Hessian in a diagonal matrix. Even with these limitations, we think it is fair to claim from these experiments that the pseudo-Hessian can be valuable in the context of gradient descent. It can clearly be used to find a good descent direction in the cases of the Powell singular function, the Stochastic Rosenbrock function and even in neural networks during MNIST classification.

(32)

5—Future Work

We think we have convincingly demonstrated that the pseudo-Hessian estimated by gradient linear regression can be of value. However, at this point we can not claim that we fully understand its properties or possible uses. We would like to use this opportunity to outline some interesting opportunities for further research. There are many, but we will limit ourselves to three that we consider important.

5.1 Understanding the pseudo-Hessian

We have demonstrated that we can estimate the pseudo-Hessian in an inexpensive way and that it is inexpensive to work with. We have shown empirically that it could be valuable during gradient descent, but other than that we have no guarantees as to its performance. If we want to make the pseudo-Hessian a part of gradient-descent methods, then a better understanding of its properties will be necessary.

Some questions that we recommend pursuing are:

• In what function spaces is the pseudo-Hessian a good estimate of the Hessian? • How does the descent path influence the estimate of the pseudo-Hessian? We can

imagine that the estimated pseudo-Hessian only gives an accurate representation of the function in particular directions, specifically those close to previous descent directions. • There is no reason to assume that it is optimal to have the same exponential decay rate β for every step t. Is there an optimal βt for every step? We think it plausible

that there is a relationship between the distances traveled along the descent path—and particularly between θt−1and θt—and the optimal exponential decay rate βt.

• What is the relationship between the dimensionality of the problem and the usefulness of the pseudo-Hessian? We hypothesize that the pseudo-Hessian becomes less useful as the dimensionality increases, since the pseudo-Hessian tries to do the job of a matrix with D2 nonzero elements, using only D parameters.

5.2 Bayesian linear regression

Using Bayesian linear regression instead of linear regression would be a natural extension of our estimate of the pseudo-Hessian.

Currently, when ht,i (the ith element of ht) is small, this could have two causes:

• The curvature in this direction is small, which would indicate a plateau, requiring a big step in this direction.

• The curvature in this direction is not small but there is high variance in the gradients in this direction, requiring a small step in this direction.

(33)

This difficulty is currently solved by using Armijo line search, which limits the step size when needed. However, this problem could also be solved by using the variance of the estimate, by restricting the step size when there is a lot of uncertainty. For this we could use Bayesian linear regression.

Estimating the pseudo-Hessian using Bayesian linear regression is fairly simple. In fact, we originally derived the univariate Hessian estimate in Equation (3.6) from a Bayesian perspective. We chose not to include in our method section, because we have not found an effective way of incorporating uncertainty into our algorithm.

Although this is not the place to go into too much detail, we can give a short summary of how we approached this derivation. We start by assuming that the gradients gt∼ N (β>tθ¯t, σ2t),

where βt= [β1,t, β2,t]> for some ¯θt= [θt, 1]>, where E(β1,t) = ht. The assumption that the

gradients are normally distributed is not unusual in the context of neural networks; this is also done by Roux et al. [2008]. In the case of neural networks, we work with the mean gradient of the mini-batch, which is approximately normally distributed for large samples according to the Central Limit Theorem.

We choose a Normal-Inverse Gamma prior,

p(β, σ2) = N Γ−1(µ, Λ−1, a, b) (5.1) = N (β | µ, σ2Λ−1)Γ−1(σ2| a, b), (5.2) to set up a system of equations such that the prior at time t can be expressed in terms of the posterior at time t − 1. We cannot simply use the posterior at time t − 1 as the prior at time t, because the properties of the gradient will vary as a function of the location θt. To

solve this problem we use a power prior [Ibrahim et al., 2015] to move from the posterior at time t − 1 to the prior at time t. It introduces a diffusion parameter 0 < γ < 1 that attempts to quantify the heterogeneity of the function.

This results in p(βt, σt2) ∝ p(βt−1, σ2t−1) γ_p(β 0, σ02) (5.3) ∝ p(βt−1| µ, σ2Λ−1)γp(σ2| a, b)γ (5.4) ∝ p(βt−1| µ, γσ2Λ−1)p(σ2| γa, γb), (5.5)

where p(βt, σt2) is our prior at time t, p(βt−1, σ2t−1) is our posterior at time t−1 and p(β0, σ02)

is our initial prior. For p(β0, σ02) we choose the improper prior p(β0, σ20) ∝ 1, because we

have no prior information about our optimization problem.

We end up with something that can be interpreted as Bayesian learning, where the param-eters are updated using the following equations:

µt= ¯θt−1> θ¯t−1+ γΛt−1 −1 γΛt−1µt−1+ ¯θ>t−1gt , (5.6) Λt= ¯θt−1> θ¯t−1+ γΛt−1 , (5.7) at= γat−1+ 1 2, (5.8) bt= γbt−1+ 1 2 g 2 t+ γµ>t−1Λt−1µt−1− µ>t Λtµt . (5.9)

5.3 A better saddle-repellent descent direction

Saddle-Free Newton descent actively repels saddle points, but there is no evidence to believe that it does this in an optimal way. We have theoretical guarantees, in a convex landscape, that dt = Ht−1gt is a good descent direction, by effectively scaling the gradient in the

(34)

Figure 5.1: Image is from Alain et al. [2019]. It shows the relationship between the directions with different Hessian eigenvalues and how much these directions can reduce the loss by moving along them with the optimal update step.

direction of the eigenvectors of the Hessian by the inverse of their respective eigenvalues. For more on this, see Section 2.4.2.

The derivation of Saddle-Free Newton method in Appendix A.1 only shows that we are minimizing a first-order Taylor approximation within a trust region, but there is no reason to believe this trust region is optimal. Furthermore, the trust region is obtained by finding an upper bound on the difference between a first and second-order Taylor approximation. It is possible that this upper bound holds very loosely; in that case, we could be unnecessarily restricting our possible descent directions.

The work of Alain et al. [2019] looked at the loss landscape of a LeNet architecture. They calculated the smallest and largest eigenvectors of the Hessian and looked at the optimal step size in those directions, by doing a line search along each eigenvector. In the case that the eigenvalues were positive they found that the optimal step size was in line with Newton’s descent— approximately _λ1. In the case that the eigenvalues were negative, the optimal step size seemed to be decorrelated with the eigenvalues; the optimal step size was not _|λ|1 . At the same time, they did confirm that the descent directions with negative eigenvalues were valuable. Average improvements along directions with negative eigenvalues yielded a bigger drop in the loss function. See Figure 5.1 for a visual representation of their results.

This critique shows a big flaw in the Saddle-Free Newton method. Nevertheless it does seem able to escape saddle plateaus more effectively than the competition. Finding a better de-scent direction that also actively escapes saddle plateaus, with stronger theoretical support, seems very promising.

Simple Saddle-Free Newton Descent with a Pseudo Hessian

MSc Artificial Intelligence

Master Thesis