FlipOut: Uncovering Redundant Weights via Sign Flipping

(1)

MSc Artificial Intelligence

Master Thesis

FlipOut: Uncovering

Redundant Weights

via Sign Flipping

by

Andrei Constantin Apostol

12273805

July 16, 2020

48 ECTS 11/2019-07/2020

Supervisors:

M.Sc. Maarten C.

Stol,

Dr. Patrick D.

Forr´

e

Assessor:

Matthias Reisser

Universiteit van Amsterdam

(2)

1 Introduction 3 2 Background 5 2.1 Pruning literature . . . 6 2.1.1 Why prune? . . . 6 2.1.2 Saliency criteria . . . 8 2.1.3 Regularization techniques . . . 10 2.1.4 Sparse learning . . . 12 2.1.5 Lottery tickets . . . 14 2.2 Baselines . . . 15 2.2.1 Magnitude pruning . . . 15 2.2.2 DeepHoyer . . . 17 2.2.3 SNIP . . . 19 2.3 Networks used . . . 20 2.3.1 VGG19 . . . 21 2.3.2 ResNet18 . . . 22 2.3.3 DenseNet121 . . . 23 3 Method 24 3.1 Motivation . . . 24

3.2 FlipOut: applying the aim test for pruning . . . 28

3.2.1 Determining which weights to prune . . . 28

3.2.2 Perturbation through gradient noise . . . 29

4 Related Work 32 4.1 Deep-R . . . 32

4.2 Magnitude and uncertainty pruning . . . 32

4.3 Zeros, signs and the supermask . . . 33

5 Experiments 33 5.1 General Setup . . . 34

5.2 Dataset preprocessing and models . . . 36

5.3 Choosing the hyperparameters for FlipOut . . . 38

5.4 Comparison to baselines . . . 39

5.5 Is it just the noise? . . . 41

6 Discussion 44

7 Limitations and future work 45

Appendices 53

(3)

Acknowledgments

I would like to thank my supervisors, dr. Patrick D. Forr´e and M.Sc. Maarten C. Stol, for their sagely advice and guidance throughout the de-velopment of this thesis, for which I am deeply grateful.

I would like to thank the professors, teaching assistants and other aca-demic staff at the University of Amsterdam for providing high quality courses and education, which have been of immeasurable value to me.

I thank Stijn Verdenius, Ioannis Gatopoulous, Bjarne de Jong and the rest of the research & development team at BrainCreators for creating an environment for learning & growth.

I thank my friends from Romania as well as all the great people that I met during my studies in Amsterdam for keeping things fun and enjoyable, even during these difficult times.

Finally, I would like to thank my parents, Daniela and Titi, for their invaluable support and love, which has helped me persevere and reach my goals.

(4)

Abstract

Modern neural networks, although achieving state-of-the-art results on many tasks, tend to have a large number of parameters in the form of weights, which increases training time and resource usage. This problem can be alleviated by pruning, i.e. selectively removing a subset of weights from the network. Existing methods, however, often require extensive parameter tuning or multiple cycles of pruning and retraining until convergence in order to obtain a favorable trade-off between sparsity and performance. To address these issues, we propose a novel pruning method which uses the oscillations around 0 that a weight has undergone during training (i.e. sign flips) in order to determine its saliency. Our method can perform pruning before the network has converged, requires little tuning effort due to having good default values for its hyperparameters, and can directly target the level of sparsity desired by the user. We perform experiments on a variety of object classification architectures and show that it is competitive with existing methods and achieves state-of-the-art performance for levels of sparsity of 99.6% and above for most of the architectures tested.

1 Introduction

The success of deep learning is motivated by competitive results on a wide range of tasks (Vaswani et al. (2017), Brock et al. (2019), Huang et al.

(2017)). However, well-performing neural networks often come with the drawback of a large number of parameters, which increases the computa-tional and memory requirements for training and inference. This poses a challenge for deployment on embedded devices, which are often resource-constrained, as well as for use in time sensitive applications, such as au-tonomous driving or crowd monitoring. Moreover, costs and carbon dioxide emissions associated with training these large networks have reached alarm-ing rates (Strubell et al. (2019)). To this end, pruning has been proven as an effective way of making neural networks run more efficiently (LeCun et al. (1990), Hassibi and Stork (1993), Li et al. (2017), Han et al. (2015),

Molchanov et al. (2017)).

By pruning, we refer to setting a subset of a neural network’s weights to exactly 0 and not updating them during backpropagation, thus effectively removing any influence those weights may have on the model’s output. Early works (Hassibi and Stork (1993), LeCun et al. (1990)) have focused on using the second-order derivative to detect which weights to remove with minimal impact on performance. However, these methods either require strong assumptions about the properties of the Hessian, which are typically violated in practice, or are intractable to run on modern neural networks due to the computation of the second-order derivative.

(5)

One could instead prune the weights whose optimum lies at or close to 0 anyway. Building on this idea, Han et al. (2015) propose training a network until convergence, pruning the weights whose magnitudes are below a set threshold, and allowing the network to re-train. This process can then be repeated iteratively. Frankle and Carbin (2019) show that higher levels of sparsity can be achieved without incurring a performance degradation with this method by simply resetting the remaining weights to their values at initialization after a pruning step. Yet, these methods require re-training the network until convergence multiple times, which can be a time consuming process.

A common alternative is to incorporate pruning as part of the opti-mization process. For instance, one can add regularization terms that have minima at 0 to the loss function. Notably, Louizos et al. (2018) pro-pose a differentiable approximation of the L0 regularizer and Yang et al.

(2020) use a slightly modified version of the Hoyer measure (Hoyer(2004)). Other successful approaches have also emerged; under the Sparse Varia-tional Dropout framework proposed by Molchanov et al. (2017), dropout rates (Hinton et al. (2012), Srivastava et al. (2014)) are treated as un-bounded learnable parameters. Optimizing the network with this method results in many of these rates getting updated to exactly 1, thus creating sparsity. Finally, Liu et al. (2020) use a learnable threshold below which all parameters are masked. All these methods, however, suffer from two problems, namely that they require extensive manual tuning of the hyper-parameters in order to obtain a favorable accuracy-sparsity trade-off, and that the final sparsity of the resulting network cannot be predicted given a particular choice of these hyperparameters, which often translates into the fact that the practitioner has to run these methods multiple times when applying them to novel tasks or datasets.

To summarize, we have seen that the pruning methods presented so far suffer from one or more of the following problems:

1. Computational intractability

2. Having to train the network to convergence multiple times

3. Requiring extensive hyperparameter tuning for optimal performance 4. Inability to directly target a final sparsity

A pruning method that has none of the above issues would be easily applicable in a diverse range of scenarios with little effort and cost on the practitioner’s part. We note that by using a heuristic in order to determine during training whether a weight has a locally optimal value

(6)

of low magnitude, pruning can be performed before the network reaches convergence, unlike the method proposed by Han et al. (2015). As such, we explore the following research question in this work:

Can a heuristic for determining whether a point represents a local optimum for a weight during training successfully be ap-plied to pruning? And if so, is it possible to avoid all the aforementioned issues?

We propose one such heuristic, coined the aim test, which determines whether a value represents a local optimum for a weight by monitoring the number of times that weight oscillates around it during training, while also taking into account the distance between the two. We then show that this can be applied to network pruning by applying this test at the value of 0 for all weights simultaneously, and framing it as a saliency criterion. By design, our method is tractable, allows the user to select a specific level of sparsity and can be applied during training.

Our experiments, conducted on a variety of object classification ar-chitectures, indicate that it is competitive with respect to other pruning methods found in literature, and can outperform them for sparsity levels of 99.6% and above. Moreover, we empirically show that our method has good default values, easing the burden of hyperparameter tuning.

Concluding the introduction, we present the reader with a roadmap for the rest of this thesis. Section 2 includes an overview of notable works from the pruning literature as well as details about the baselines and net-work architectures used in our experiments. Section 3 is dedicated to our methodology; a presentation of our pruning algorithm and the reasoning behind some particular design choices. In Section 4 we discuss similarities and differences with other related methods. The experimental results used to validate the utility of our proposed method are found in Section 5. In Section 6 we present a discussion in which we compile everything presented so far and answer our research question. We conclude the thesis with Sec-tion 7, which deals with limitaSec-tions of our method and experiments as well as potential avenues for future work.

Part of this work has been submitted at the NeurIPS 2020 Conference

and is currently under review.

2 Background

This section introduces the reader with the background necessary for the rest of this work. We begin by presenting a sample of previous research

(7)

work in the field of neural network pruning (Section 2.1); note that this listing is by no means exhaustive and only serves as a general overview in order to provide more context for the reader. We continue this section by detailing the baseline methods (Section 2.2) as well as the network archi-tectures (Section 2.3) used for the experiments in Section 5. We assume that the reader is familiar with the fundamentals of deep learning, such as the backpropagation algorithm (Rumelhart et al. (1985)), convolutional neural networks (Fukushima (1988),LeCun et al. (1998)) and stochastic gradient descent (Bottou (1998)).

Before discussing the background, however, we introduce here the no-tation and definitions used throughout this work. All non-scalar values, (vectors, matrices, tensors etc.) are denoted in boldface, e.g. θ; no special notation is used for scalars, e.g. j. If a non-scalar is indexed such that its indexed value is a scalar, we will use scalar notation, e.g. θj. For a

given neural network parameterised by θ of dimensionality dθ and || · ||0

the L0-norm (i.e. the number of non-zero elements), we define the following

terms:

sparsity: 1 − ||θ||0 dθ

compression ratio: dθ ||θ||0

In other words, the sparsity represents the ratio of total parameters that have been pruned, and compression ratio is the total number of parameters divided by the number of remaining parameters.

2.1 Pruning literature

2.1.1 Why prune?

We begin by addressing the practical benefits that can be achieved by pruning. Firstly, overfitting, a well known challenge in training neural networks, can occur for overparameterised models. That is, the network memorizes all the samples in the dataset without separating the signal (patterns) from the noise (random variations). In practice, an indication of this occurring is by having a low training loss, but a large test loss (the network is said to not be able to generalize). Common solutions include regularization techniques or simply using a smaller model. Pruning, however, can also help. Zhang et al. (2017) perform a series of experiments for the task of object classification in which they show that a sufficiently large network can obtain perfect accuracy on a training dataset, even when

(8)

the labels or the pixels are randomly shuffled. The accuracy on the test set in these cases is, of course, no better than random guessing. Lee et al.

(2019) repeat the experiment of training with random labels for a pruned network, showing that, indeed, overfitting does not occur in this case, and conjecture that this is due to the fact that removing parameters reduces the capacity of the network. Therefore, pruning can act as an effective method of regularization.

The other two benefits of pruning are reducing storage sizes and com-putational speedup. The way in which this manifests itself in practice de-pends on the pruning method used. For element-wise pruning (removing individual weights), a dense weight matrix θ can be converted to sparse representation using, for instance, the compressed sparse row (CSR) or compressed sparse column (CSC) formats. Broadly speaking, these repre-sentations only store the indices of the nonzero elements, rather than the entire matrix, and allow for more efficient basic linear algebra subprograms (BLAS) to be performed. In literature, it is common practice to use the resulting sparsity of the network as a proxy for measuring these gains (Han et al. (2015), Lee et al. (2019), Yang et al. (2020), Ding et al. (2019)). It is worth mentioning, however, that while there exists a strong correlation between sparsity and speedup, it is not a perfect metric; that is, one could construct two neural networks of the same architecture with equal levels of sparsity, yet obtain different measurements in terms of inference time, depending on how the active weights are distributed throughout the net-work. This is due to the fact that a weight in a filter from a convolutional layer gets reused multiple times when processing an input volume; as such, pruning a weight from a layer will reduce computation depending on the size of the volume used as input to that layer. In other words, pruning a weight from a layer that receives a larger input volume will generate a higher speedup.

Structural pruning, on the other hand, removes entire rows or columns from the parameter matrix θ, e.g. the input connections of a neuron or the parameters that form a filter, with the goal of reducing dimensionality. This guarantees that storage size reduction and computational speedup can be achieved without any need for post-processing (converting to a sparse matrix format). Instead of sparsity, the number of remaining units (i.e. neurons or filters) is used as a metric (Louizos et al. (2018), Molchanov et al. (2017)). Alternatively, speedup can be directly measured in terms of the number of floating point operations (FLOPs) performed in a forward pass of the pruned network using a single sample (Li et al. (2017).

Note that there exist other methods which achieve the same goals (i.e. acceleration and compression) which are not incompatible with pruning,

(9)

such as quantization (Gupta et al. (2015)), which is the practice of using limited numerical precision for training and inference, and knowledge dis-tillation (Hinton et al. (2015)), whereby a small model is trained to mimic the output activations of a larger model. These works are, however, beyond the scope of this work and are not discussed here.

2.1.2 Saliency criteria

A common approach to pruning is to estimate the effect of removing a weight on the value of the loss function. Specifically, the weights are ranked according to a saliency criterion, and those that fall at the bottom of this ranking are removed, in an attempt to minimize the performance degrada-tion.

LeCun et al.(1990) note that the naive solution, which is to remove one parameter at a time from the network and re-compute the value of the loss function, is prohibitively expensive. Therefore, they use information from the second-order derivative in order to form an analytical solution. For computational feasibility, the Hessian of the loss function is approximated by a Taylor expansion and assumed to be diagonal. A saliency criterion based on it is then constructed, by which the weights are ranked. Those with the lowest saliency are removed, followed by a period of retraining, in an iterative process.

Hassibi and Stork (1993) have noticed that the Hessian is strongly non-diagonal for all problems that they have tested on, which violates the assumption of LeCun et al. (1990). With this observation, they propose a method which does not assume this property. They derive a formula which can be used to simultaneously compute the inverse Hessian and update the remaining parameters, eliminating the need for retraining after a pruning step. It is shown that the recursive calculation of the inverse Hessian is of the same computational complexity as computing the Hessian. This complexity, however, still proves to be impractical for large modern neural networks. Thus, research focus has shifted towards constructing heuristics which are more feasible to compute.

Han et al. (2015) use the magnitude of the weights as a pruning signal, adopting the pipeline of training, pruning and re-training fromLeCun et al.

(1990). By determining the pruning sensitivity of each layer (and setting their threshold accordingly), the authors achieve sparsity rates of over 90% on common benchmarks. Han et al. (2016) extend this work by adding two more steps, namely trained quantization and Huffman coding, in order to get to higher compression rates.

(10)

Li et al. (2017) note that for commonly used CNN architectures ( Si-monyan and Zisserman (2015), Krizhevsky et al. (2012)) the convolutional layers contribute two orders of magnitude more FLOPs than the linear lay-ers even though they only contain a small portion of the network’s weights, due to reusing the same set of kernels across the input volume when per-forming the convolution operation. As such, they propose removing the filters with the lowest L1-norm.

As a final example, Liu et al. (2020) highlight several problems with methods that utilize the iterative pruning and finetuning pipeline. Specifi-cally, weights which are regarded as unimportant according to the saliency criterion at a certain time step might become important later in training. Pruning, however, does not allow for this to happen, as those weights are frozen at 0. Moreover, manually determining the amount of parameters to prune tends to be time consuming, and pruning at predetermined intervals (e.g. after a certain number of epochs, but not between two epochs) can be nonoptimal. To address these issues, they propose masking all weights which fall outside of a threshold, which is separate per each layer and is optimized jointly with the parameters. More specifically, for the param-eters θl _{∈ R}co×ci _{of a layer l, and a threshold vector t}l ∈ Rco _{the binary}

mask Ml determining which parameters are kept is computed as:

M_ijl = S(Ql_ij) Ql_ij = F (θ_ijl , tl_i) = |θ_ijl | − tl_i

Here S(·) is the unit step function. The threshold vector is of the same dimensionality as the number of units (i.e. neurons or filters) in that layer. To make the threshold learnable, the derivative of the step function is approximated by the authors using the long-tailed estimator:

d dxS(x) ≈ H(x) =    2 − 4|x| −0.4 ≤ x ≤ 0.4 0.4 0.4 < |x| ≤ 1 0 otherwise

A regularization term is imposed on the threshold such that it becomes larger, encouraging a higher level of sparsity. Note that with this formu-lation, the weights that are masked at one iteration can be updated such that they exceed the threshold and become active again. This method is, in effect, using a magnitude-based saliency criterion, where the amount of masked weights is automatically determined.

(11)

2.1.3 Regularization techniques

The L1 regularizer can be applied in order to encourage weights getting

updated to lower values and help obtain sparse solutions due to the form of its derivative, which is simply the sign of the weight (with its derivative at 0 often simply being estimated as 0). However, as Yang et al. (2020) note, this causes all weights to shrink by a constant factor irrespective of their magnitudes, which does not guarantee a high level of sparsity and could also affect performance. As such, there exists a body of work dedicated to alternative solutions.

Ideally, one could use the L0 regularizer, which penalizes the network

for each weight that is not exactly at 0. Due to it not being differentiable, however, using it in an end-to-end learning system directly is not possible.

Louizos et al. (2018) solve this by deriving a differentiable form of this regularizer through the usage of the reparameterisation trick proposed by

Kingma and Welling (2014) and a modified variant of the concrete dis-tribution (Maddison et al. (2017)). This term is then added to the loss function in order to impose sparsity on the network as part of the opti-mization process. Yang et al. (2020) introduce a set of regularizers based on the Hoyer measure, which estimates the sparsity of a vector by the ra-tio between its L1 and L2 norms. They show that this term has several

desirable properties, such as differentiability, scale invariance and having minima along the axes (similar to the L0 norm). We use this method as a

baseline and discuss about it at length in Section 2.2.2.

Other regularization techniques have also been adapted in order to ob-tain sparsity. Gomez et al. (2019) posit that applying dropout to the sub-set of weights which are considered least useful according to some criterion (e.g. magnitude) at every iteration can make the network more tolerant to sparsification. Pruning is then performed post-training. Experiments show that, indeed, this can lead to better performing sparse networks. The same holds true when using Dropconnect (Wan et al. (2013)) instead, which removes individual weights rather than units.

Under the framework of Variational Dropout (Kingma et al.(2015)), the dropout rates are treated as learnable parameters and optimized in tandem with the weights of the neural network. In the original work, however, the dropout rates are restricted to be upper bounded by 0.5 due to difficulties in training caused by large variance in the gradients. Molchanov et al.

(2017) address this issue by proposing a method to reduce this variance, allowing the aforementioned restriction to be lifted. The authors show experimentally that this can lead to many dropout rates to be updated to a value of exactly 1, thus effectively removing those weights from the

(12)

network and achieving high levels of sparsity with little compromise in terms of performance.

Scardapane et al. (2017) show that by applying group Lasso (Yuan and Lin (2006)) and sparse group Lasso (Friedman et al. (2010)) on a neural network, one can simultaneously prune and apply feature selection, given an appropriate partitioning of the parameters. More specifically, for a neural network parameterised by θ whose elements are partitioned into a set of groups G, group Lasso is defined as:

RGL(θ) =

X

g∈G

||g||2pdg

Here, dg denotes the number of parameters in group g and is used to

normalize the regularization penalty with respect to the group’s size. While this can achieve group-level sparsity (that is, either all elements in a group are 0 or none of them are), element-wise sparsity can also be desirable in addition to this. Thus, an L1 penalty per each group can be added, forming

the sparse group Lasso regularizer:

RSGL(θ) = RGL(θ) + RL1(θ)

The collection G is defined as the union between the neuron groups (weights which are outbound from each neuron) and bias groups (comprised of single-element bias weights). Under this formulation, different effects can be achieved depending on which group of weights is regularized. For in-stance, feature selection can be achieved via regularizing the neurons of the first layers, which receive the input features, while bias selection is a result of penalizing the bias groups. The authors demonstrate experimen-tally that using this group definition, both group- and sparse group Lasso methods can get similar performance as compared to the classical Lasso or weight decay regularizers, while at the same time resulting in a smaller network.

Wen et al. (2016) also apply the group Lasso regularization in order to obtain sparsity and experiment with several ways of partitioning the weights into groups. A convolutional layer is parameterised by a tensor θl _{∈ R}Nl×Cl×Ml×Kl, with each dimension representing, in order, the number

of filters, the number of channels, and the height and width of each filter. With this in mind, we list below the partitioning methods tried by the authors and include examples of how those groups might be indexed:

• All weights that form a filter or a channel → θl

n,:,:,:, θ:,c,:,:

• Weights at a spatial position across the c-th channel in all filters → θl

(13)

• An entire layer → θl

Regularly, filters are restricted to cubic shapes. However, pruning groups of weights at a specific spatial position removes this restriction and allows them to be shaped arbitrarily. Note that regularizing entire layers is only possible when shortcut connections, such as those in residual networks (He et al. (2016)) are present; otherwise, all signal would be blocked from passing beyond that layer and the network would collapse. The utility of these partition schemas are proven experimentally. Interestingly, multiple layers from a residual network can be pruned with little degradation in performance, demonstrating that, indeed, it can be optimal for a subset of layers to perform the identity mapping, as theorized in the original work of He et al. (2016), validating the utility of this architecture.

Ding et al. (2019) rely on weight decay to prune nonsalient weights. A desired number of non-zero entries Q is first picked; at each update, the parameters are ranked according to the impact caused by their removal, as estimated using a first order Taylor expansion:

L(θw→0, x, y) = L(θ, x, y) − ∂L(θ, x, y) ∂w (0 − w) + O(w 2₎ |L(θ, x, y) − L(θw→0, x, y)| = | ∂L(θ, x, y) ∂w w| (1)

The top-Q weights as computed using Eq.1 receive regular SGD updates, while the rest only get passively updated, i.e. only the gradient of the weight decay term is applied, driving them closer to 0. With this method, weights which are deemed unimportant at one time step are not perma-nently removed, allowing them to get updated and potentially become important later in training, similar to the method of Liu et al. (2020). Once training is done, the weights with the lowest magnitudes are pruned, such that we have exactly Q remaining connections. Experiments suggest that this method can achieve up to 90% sparsity with no loss in accuracy for networks trained on the CIFAR10 benchmark (Krizhevsky (2009b),

Krizhevsky (2009a)).

2.1.4 Sparse learning

Some research effort has been dedicated to sparse learning; that is, methods that maintain a set level of sparsity from initialization, so as to also reap the benefits of pruning in the training phase.

Dettmers and Zettlemoyer (2019) propose the following method:

1. Initialize a network parameterised by θ and randomly remove p% of its parameters

(14)

2. Compute gradients and momentum using a mini-batch of data 3. Perform SGD update

4. Prune 50% of the weights with the smallest magnitudes in each layer 5. Regrow a number of weights in each layer according to the layer’s

momentum contribution

Steps 2-5 are then repeated. By regrow, we refer to allowing previously pruned weights to be updated via SGD. The authors argue that a large mo-mentum term indicates that a weight is consistent in lowering the value of the loss function. Extrapolating this idea, the aggregated momentum of all weights of a layer can serve as an indication of that layer’s efficiency. Thus, the number of weights that are regrown in each layer is proportional to its momentum contribution, normalized with respect to the entire network. In other words, weights from layers deemed less efficient are removed in favor of weights from more efficient ones. Since at each update the number of pruned weights is equal to the number of regrown weights, this method guarantees that a fixed level of sparsity is maintained throughout training. This, however, requires computing the momentum terms for all weights at every iteration, regardless of the level of sparsity, thus limiting its ben-efit. To address this issue, Evci et al. (2019) propose updating the connec-tivity less frequently. From a network initialized such that it has a sparsity of s, every ∆T iterations the weights with the lowest magnitudes at each layer are removed, and an equal number of connections with the highest gradients are regrown, ensuring that the layerwise sparsity is preserved. The number of weights which are pruned and regrown each time is low-ered according to a cosine annealing schedule. Indeed, the results confirm that using this approach reduces the number of FLOPs performed during training and achieves better performance as compared to other baselines. Moreover, the authors test three different strategies for sparse initialization: (1) uniform, where the sparsity of each layer is equal to the global spar-sity, (2) Erd˝os-R´enyi (Mocanu et al. (2018)), which scales the sparsity by the number of neurons at each layer, and (3) Erd˝os-R´enyi-Kernel (ERK), which also takes into account the size of the filters. Importantly, the two latter strategies allocate more active weights to smaller layers, with ERK offering the best performance, even when used with other sparse learning methods.

Other notable works from this category include those ofLee et al.(2019) and Bellec et al. (2018). However, we defer discussing these methods to Sections 2.2.3 and 4.1, respectively, in a more appropriate context.

(15)

2.1.5 Lottery tickets

The Lottery Ticket Hypothesis, proposed by Frankle and Carbin (2019), states as follows:

“A randomly-initialized, dense neural network contains a sub-network that is initialized such that—when trained in isola-tion—it can match the test accuracy of the original network after training for at most the same number of iterations.”

These subnetworks, coined winning lottery tickets, are found via iter-ative magnitude pruning (IMP), a repeated process of train-prune-reset-finetune, where the weights are rewound back to their original value after each pruning step. An explanation for this, as conjectured by the authors, is that in over-parameterised networks, the optimization algorithm searches for subnetworks with favorable initializations. Many follow-up studies have emerged in order to better understand this phenomena. One such work is presented by Zhou et al. (2019), who run a series of ablation tests, which we discuss about at greater length in Section 4.3.

Frankle et al. (2019) introduce instability analysis, which can be sum-marised as follows:

1. A network is trained for k iterations

2. Two copies of that network are created and further trained in par-allel, using two different SGD data orders (i.e. the mini-batches are sampled differently)

3. After training, linear interpolation is performed between the weights of the two copies

4. The test error is measured for a number of points along the line generated by linear interpolation

5. The instability of the network is computed as the maximum increase in error on that line compared to the averaged error of the two net-works

A network is said to be stable at iteration k if its instability at that point is close to 0. In other words, stability indicates that the network will reach minima which are linearly connected irrespective of the data order. The authors find that stability does not occur at initialization for large-scale networks and datasets, but at some early point in training (after 3% or less of a pre-determined number of iterations in their experiments). Moreover,

(16)

they find that the subnetworks generated by IMP can be trained to match the accuracy of the original, unpruned network only if they are stable. As such, they revise IMP to do late resetting; that is, weights are rewound back to their values at an iteration where stability is met, rather than at initialization.

Morcos et al.(2019) explore the generality of lottery tickets. They do so by applying transfer learning for a dataset (target) using a ticket generated by training on a different, but related, dataset (source). The experiments demonstrate that winning tickets are indeed able to generalize, especially when the source dataset is large. This suggests that winning tickets found via IMP can capture a general connectivity pattern in the network, rather than adapting it to a particular dataset.

The process of finding winning tickets, however, suffers from the same problem as the pruning method of Han et al.(2015), namely that it requires multiple rounds of pruning and retraining. You et al. (2020) discover that it is possible to draw winning tickets (i.e. prune and reset) early and maintain, or even surpass, the accuracy of tickets drawn after a full training cycle. This holds true even when using 8-bit quantization (Wu et al.(2018)) or large learning rates. As such, You et al. (2020) propose the Early Bird Train algorithm: at every epoch, the mask that would be generated by pruning, using the method of Liu et al. (2017), is computed. When the differences between the current mask and those from the previous epochs (in terms of their Hamming distance) is below a threshold, the ticket is extracted. In other words, this algorithm identifies the point at which further training does not change the pruning decision. The reliability of this method is then validated experimentally.

2.2 Baselines

We present the baseline methods that we use throughout our experiments. Note that all methods discussed here perform element-wise pruning, as opposed to structural pruning. This section only serves to familiarise the reader with how these methods function; for more details, such as exper-imental setup or particular hyperparamters used, please refer to Section 5.

2.2.1 Magnitude pruning

Han et al. (2015) propose a three-stage pruning pipeline in order to facil-itate the usage of deep neural networks on mobile or embedded devices, while at the same time preserving accuracy. Specifically, this pipeline

(17)

in-volves:

1. Initial pre-training of the network

2. Pruning weights whose magnitudes are below a user-defined threshold 3. Fine-tuning the network

The first phase is said to learn which connections are important. In other words, weights which have large magnitudes are considered to have a higher contribution to the loss function than weights whose magnitudes are small. Re-training is performed after pruning such that the surviving weights (i.e. those that did not get selected for pruning) can compensate for the perfor-mance degradation caused by pruning. Steps 2 and 3 can then be repeated iteratively in order to achieve higher levels of sparsity. The pruning thresh-old is chosen as a multiple of each layer’s standard deviation, and is separate for every layer. Regularizers, such as the L1 or L2 norms, can also be added

in order to encourage weights to be updated towards 0. Additionally, if a neuron’s entire set of either input or output connections are pruned, that neuron will block all learning signal. Thus, one could safely prune that neuron, i.e. prune both the input and the output weights entirely.

To ensure that performance degradation is minimal, an additional mea-sure is taken: dropout ratio adjustment. Dropout and pruning have similar effects, in that both methods set a subset of the weights to 0. Dropout randomly samples the weights which are to be removed at every step of SGD, while pruning sets them to 0 permanently. Therefore, dropping out too many weights from an already pruned matrix can reduce the predictive variance, of the model and impact accuracy. The authors propose lowering the dropout ratio proportional to the number of weights that are pruned. In our experiments, however, this was not necessary, as we did not use dropout for any of the networks that we test.

The authors demonstrate that high levels of sparsity (over 90%) can be achieved without any impact on accuracy. These results, however, can only be achieved when retraining the weights, performing pruning iteratively and using regularization, indicating that these are crucial components of the pipeline. Indeed, performing no re-training or pruning a larger number of weights only once have yielded comparatively worse results.

For our experiments, we use two modifications proposed by Frankle and Carbin (2019). Firstly, for simplicity, the second step of the pipeline is modified to instead sort the weights by their magnitude and prune a percentage of those at the bottom of the ranking. The percentage of weights which get pruned at every iteration is a user-defined value, and allows for a desired amount of sparsity to be achieved with no parameter tuning.

(18)

Secondly, the authors have observed that for deep networks, where the number of weights per layer can vary significantly, pruning layers at the same rate can cause bottlenecks; that is, the smaller layers can block most of the signal from passing when pruned. To address this problem, the au-thors suggest global pruning; instead of making a pruning decision on a layer-by-layer basis, weights are ranked across the entire network. While this strategy is not a guarantee (small layers can still be pruned excessively and become bottlenecks), experiments show that it has improved perfor-mance for large networks. This has the added benefit of not having to set a separate pruning rate or threshold for every layer separately; instead, one can simply specify how many weights from the entire network get pruned. We follow this suggestion and also employ global pruning.

Unlike Frankle and Carbin (2019), however, we do not reset the re-maining weights to their values at initialization after a pruning step. Our experiments are focused on comparing different pruning criteria, and thus this additional step would introduce a confounder and make direct com-parisons more difficult.

2.2.2 DeepHoyer

The Hoyer measure (Hoyer (2004)), used in compressed sensing as a spar-sity measure, is defined as:

S(X) = √ n − (Pn i=1|xi|)/ pPn i=1x2i √ n − 1

Here, X represents a vector of dimensionality n. The terms √n in the numerator and √n − 1 in the denominator are used to normalize S(X) such that it lies in the range [0, 1]. A value of S(X) = 1 corresponds to a vector with a single nonzero element and S(X) = 0 to a vector whose elements all have equal values. Previous works use the Hoyer measure as a sparsity-inducing regularizer for neural networks and forego normalization entirely (Repetti et al. (2014), Krishnan et al. (2011)). Instead, only the ratio between the L1 and L2 norms of the vector is used:

R(X) = Pn i=1|xi| pPn i=1x 2 i (2)

Yang et al. (2020) note several similarities between Eq. 2 and the L0

regularizer, which can be used to directly impose sparsity on the network. Both quantities have minima along the axis (where the weights are 0) and are scale invariant (i.e. R(αX) = R(X)). The Hoyer regularizer has

(19)

the added benefit of being differentiable almost everywhere, meaning it can be directly optimized through SGD. However, the value of the Hoyer regularizer lies in the interval [1,√n], while the L0 norm is in [1, N ]. In

order to have them both operate in the same range, the authors propose the Hoyer-Square regularizer, which is simply:

HS(X) = R(X)2 = (Pn i=1|xi|) 2 Pn i=1x2i

For a network of L layers, the modified loss function takes the form: L(θ) + λHS

L

X

l=1

HS(θl) (3)

Here, θl represents the weights at layer l, and λHS is a hyperparameter

that controls the amount of Hoyer-Square regularization imposed. Other regularizers (such as L1 or L2) are compatible with this method and can

be added to the loss function in tandem. Authors show that due to the structure of its gradient, it induces a trimming effect. That is, weights which are below a certain threshold will be updated away from 0 and those below get updated towards 0. We provide a proof of this effect in Appendix A. Pruning with the Hoyer-Square regularizer can then be performed via the following pipeline:

1. Training the network with the Hoyer-Square regularizer

2. Prune weights whose magnitudes are below a user-defined threshold 3. Fine-tune the network without the Hoyer-Square regularizer

The authors confirm experimentally that using this regularizer can ob-tain high levels of sparsity with little to no degradation in performance. Moreover, the weight distribution after training with the Hoyer-Square reg-ularizer (but before pruning) is shown to have a sharp peak around 0 with long tails, as a consequence of the trimming effect. This is desirable since pruning will remove only the weights which are close to 0 (i.e. at the peak), and allows for pruning to be performed only once, unlike the method of

Han et al. (2015) (Section 2.2.1).

For our experiments, we integrate the implementation provided by the authors at DeepHoyer GitHub Repository, Huanrui Yang (2020) into our codebase and use the pipeline described above. Note that Yang et al.

(2020) also propose the Group Hoyer-Square regularizer, which is used for structural pruning. However, our experiments are only performed for element-wise pruning, and we therefore do not discuss this variant in our work.

(20)

2.2.3 SNIP

Lee et al. (2019) note that sparsity-inducing regularization techniques tend to require extensive parameter tuning. At the same time, Hessian-based methods (LeCun et al. (1990), Hassibi and Stork (1993)) make unrealistic assumptions (the Hessian being diagonal or positive definite) while also being computationally infeasible for larger networks. Another problem, which also affects the magnitude pruning criterion proposed by Han et al.

(2015) is that these saliency criteria are not scale invariant and, as such, modifications to the architecture or the optimization method used can drastically affect the pruning decisions. Moreover, they require multiple iterations of pruning and finetuning, which can be cost prohibitive.

To address these issues, the authors propose a method for perform-ing prunperform-ing at initialization that is independent of the magnitudes of the weights, coined SNIP. To this end, pruning is framed as a constrained op-timization problem where a separate connection parameter is introduced; for a dataset D of pairs of features and labels (xi, yi), a neural network

pa-rameterised by θ, a loss objective L, a desired amount of nonzero weights κ and with || · ||0 representing the L0 norm, we have:

min c,θ L(c θ, D) = minc,θ 1 n n X i=1 L(c θ, xi, yi) s.t. ||θ||0 ≤ κ, θ ∈ Rm, c ∈ {0, 1}m

The binary tensor c, of the same dimensionality as the weights θ, controls which parameters are used in the network, with cj = 1 indicating the

weight θj is active and cj = 0 indicating it is pruned. One could then

determine the importance of a weight by simply setting its corresponding value cj to 0 and measuring the effect of this action on the loss function:

∆Lj(θ, D) = L(1 θ, D) − L((1 − ej) θ, D)

Here, ej represents a vector that has the value of 1 at location j and zeros

elsewhere. However, calculating ∆Lj(θ, D) requires performing a forward

pass for all samples in the dataset; in order to select which weights to prune, one would have to repeat this process for all weights in the network, which can be computationally infeasible. Therefore, the authors propose a continuous relaxation of the binary constraint on c. With this formulation, one could estimate the value of ∆Lj by the derivative of L with respect to

(21)

cj by an infinitely small value δ. Formally, we have: ∆Lj(θ, D) ≈ ∂L(c θ, D) ∂cj c=1 = lim δ→0 L(c θ, D) − L((c − δej) θ, D) δ c=1 (4) The value _∂c∂L

j can be computed for all weights j simultaneously; therefore,

a single forward pass is required under this continuous relaxation. With this in mind, the SNIP pruning criterion performs the following steps:

1. A desired number of nonzero weights κ is chosen 2. Initialize network with weights θ

3. Compute _∂c∂L

j for all weights j and rank them according to this

quan-tity

4. Keep top κ weights and prune the rest 5. Train the network regularly

Note that in this description we omit normalizing the saliencies of the weights, as done in the original work, since it does not affect the ranking. The authors suggest using variance scaling initialization, such as the one proposed by Glorot and Bengio (2010). This ensures that the gradients _∂c∂L

j

do not saturate and maintain the same variance as they pass from layer to layer, such that SNIP can reliably compute the saliency irrespective of the network architecture.

The authors’ experiments show that SNIP performs comparably to other methods found in literature, being able to achieve high levels of sparsity with negligible deterioration in performance and can be applied to various architectures without modifications. Moreover, it is shown that a single mini-batch of samples is sufficient to estimate the saliencies of the weights in step 3 of the pipeline.

In our experiments, we use an unofficial implementation of SNIP from

SNIP Unofficial GitHub Repository, Milad Alizadeh (2019) and directly apply it to the networks that we test on prior to training, using a single mini-batch of samples. Note that we use the initialization schema proposed by He et al. (2015) for all the networks that we test, which conforms to the authors’ suggestion regarding weight initialization.

2.3 Networks used

All architectures discussed here are convolutional neural networks (Fukushima

(22)

Table 1: The number of total parameters and biases for each model. The percentage of biases as compared to the total number of weights in the network is displayed in parenthesis. The last column represents the number of floating point operation required for a forward pass on a single sample, assuming an input size of 32 × 32 × 3 for VGG19 and ResNet18, and 224 × 224 × 3 for DenseNet121.

Model Total params. Num. biases FLOPs

VGG19 20M 11k (0.0005%) 555M

ResNet18 11.1M 4.8k (0.0004%) 398M DenseNet121 6.9M 41.8k (0.06%) 2.83B

layer has an output dimension equal to that of the number of classes in the dataset (10 in our case) and uses a softmax activation function in order to turn its output vector into a probability distribution over the predicted class. The size and number of floating point operations performed on a sin-gle instance batch for each model are presented in Table 1. To estimate the number of FLOPs, we use the utilities provided in Shrinkbench (GitHub Repository), Jose Javier Gonzalez Ortiz (2020), which contains the official implementation of the experiments performed by Blalock et al. (2020), and adapt them to accommodate for residual and dense connections.

2.3.1 VGG19

Simonyan and Zisserman (2015) perform a series of experiments in order to determine how the depth of a network affects its performance. Among other things, the authors observe that the error rate goes down with net-work depth, and saturates at 19 layers, when trained on the ImageNet (Deng et al. (2009)) dataset, suggesting that the additional nonlinearities introduced by the activation functions in between layers benefits perfor-mance. Indeed, subsequent experiments confirm this conjecture; replacing pairs of convolutional layers of 3×3 filters with a single layer of 5×5 filters, thus maintaining the same receptive field, has shown worse performance.

The network employed by us has 19 total layers, with the first 16 per-forming convolution, as described by Simonyan and Zisserman (2015). For the convolutional layers, the authors use a filter size of 3 × 3, which is the smallest filter size capable of capturing spatial information, with a stride of 1. Max-pooling is also performed, with a window size of 2 × 2 and a stride of 2. The number of channels per layer increases with depth; starting

(23)

at 64 and doubling after the 2nd, 4th, 8th and 12th convolutional layer. The last three layers of the network are two fully-connected layers of 4096 neurons and, finally, the classification layer with softmax activation. The activation function used throughout the network is the rectified linear unit (Nair and Hinton (2010)). For computational efficiency, however, we use the implementation from Train CIFAR10 with PyTorch (GitHub Repos-itory), Unknown Author (2017), which replaces the two fully connected layers with channel-wise average pooling, with a window size and stride of 1. The size of the classification layer is also modified to match the new dimensionality. Another modification in the implementation used in this work is the addition of batch normalization (Ioffe and Szegedy (2015)) after every convolution operation in order to preserve gradient flow.

2.3.2 ResNet18

He et al. (2016) explore whether gaining better performance is simply a matter of increasing network depth. Up to a certain point, this is proven to be true; beyond that, however, the authors have observed a sharp degra-dation in performance. While one may attribute this observation to the vanishing/exploding gradients problem (Hochreiter et al. (2001)), this is not the case, as the networks are equipped with methods known to allevi-ate this issue, such as batch normalization (Ioffe and Szegedy (2015)) or Kaiming weight initialization (He et al. (2015)). The authors coin this the degradation problem, and explain it via the following example: a larger net-work could be constructed from a smaller netnet-work by simply adding layers on top which are made to learn the identity mapping, indicating that, at least in theory, a larger model should perform just as well as a smaller one; however, for large systems, even learning an identity mapping is difficult. To address this, they propose the residual learning framework, in which sequences of layers are modified to fit a residual mapping. Specifically, the output xl = Hl(xl−1) at layer l is modified to xl = Hl(xl−1) + xl−1, i.e.

it references the identity mapping. This is achieved in practice through shortcut connections (Bishop et al. (1995)), whereby the output of a layer is added to the input of a subsequent layer (other than the one immedi-ately after it). Here, Hl(·) can be a composition of multiple layers and/or

intermediate operations, such as pooling, activation functions and batch normalization. The authors hypothesise that this residual mapping is eas-ier to optimise. For instance, if it were optimal for a sequence of layers to perform identity mapping, their weights could simply be updated to 0, making the network rely purely on the shortcut connection. While this is an extreme example, referencing the identity mapping can still ease

(24)

opti-mization when there exists an optimal solution close to it. The authors experiment with networks of varying depths, confirming that for regular networks, with no residual connections, the test accuracy is worse when the depth is too large. Repeating the experiments with residual networks yields opposite results: deeper networks outperform their shallow counter-parts on both training and testing data, indicating that residual learning alleviates the degradation problem.

We use an 18-layer convolutional network with residual connections for our experiments, detailed in the original author’s work and implemented in Train CIFAR10 with PyTorch (GitHub Repository), Unknown Author

(2017). It is structured in residual blocks of 2 convolutional layers each, with a skip connection from the input of the block to its output, and ReLU (Nair and Hinton (2010)) activations after each layer. There are 8 such blocks in total, with the number of channels doubling every 2 blocks, starting from 64. A single convolutional layer of 7 × 7 followed by a max-pooling operation of window size 3 × 3 is placed before the first block, with channel-wise average pooling followed by a linear layer and softmax placed after the last block for classification. Max-pooling is performed in the 3rd, 5th and 7th blocks after the first convolution operation. As with VGG19, all filters in the residual blocks are of size 3 × 3 and down-sampling via max-pooling uses a window of 2 × 2. Batch normalization is performed after every convolutional layer.

2.3.3 DenseNet121

Building on the observations of He et al. (2016), namely that deeper net-works are easier to train with residual connections, Huang et al. (2017) propose the dense convolutional network. That is, they introduce short-cut connections from each layer to all subsequent layers. A key difference from residual networks arises in the way that the shortcut connections are treated. The authors argue that in residual networks, there is no explicit distinction between the feature maps received from a shortcut connection and the ones from the previous layer since they are merged through sum-mation. Therefore, they propose concatenating them instead. Formally, the output xl of layer l becomes xl = Hl([x0, x1. . . xl−1]) where [·, ·]

repre-sents the concatenation operation. To ensure that the concatenated feature maps share the same dimension, the layers of dense convolutional networks are grouped in blocks; layers that form a block output feature maps of the same dimensionality, and shortcut connections are applied only within the same block. This framework of dense connectivity allows for feature reuse across the network; a layer receives as input the feature maps generated by

(25)

all previous layers from its block. Due to this, the layers can be built to be narrow (i.e. having a small number of channels), making the network more parameter efficient and less susceptible to overfitting. This is verified experimentally by the authors, which demonstrate that densely connected networks achieve comparable or greater performance than networks with up to 10 times as many parameters.

For our experiments, we use a densely connected convolutional network of 121 layers, as presented in the original paper and implemented in the Pytorch Torchvision framework (Paszke et al. (2019), Torch Contributors

(2019)). There are in total 4 dense blocks in the network, consisting of multiple pairs of convolutional layers with filter sizes of 1 × 1 and 3 × 3, respectively, with batch normalization and ReLU activations inserted before every convolution. Every such pair of convolutional layers outputs a feature map of 32 channels, and receives as input a feature map of 32·(l−1), where l − 1 is the number of pairs that precede it; this is an effect of the dense connectivity pattern. The number of pairs per block are, in order, 6, 12, 24 and 16. A transition layer is inserted after every dense block, consisting of 1 × 1 convolution followed by downsampling via 2 × 2 max-pooling with a stride of 2. Preceding the dense blocks is a 7 × 7 convolution and 3 × 3 max-pooling, both of stride 2. Finally, classification is performed on the output feature map of the last dense block via channel-wise average pooling and a fully connected layer.

3 Method

We are interested in detecting during training whether 0 is a point of local optimum for a weight. Doing so would allow us to simultaneously prune that weight and set it at a point where the loss is minimized, without having to train until convergence multiple times. Additionally, we would like to construct our pruning method such that it avoids the other issues discussed in Section 1, i.e. it is computationally tractable, requires little parameter tuning and allows for an exact level of sparsity to be specified.

In Section 3.1 we present a general method used to determine points of optimality by leveraging the behavior of weights when near such points. In Section 3.2 we show how a specific instance of this test can be applied to pruning, forming the basis of our proposed method.

3.1 Motivation

Mini-batch stochastic gradient descent (Bottou (1998)) is the most com-monly used optimization method in machine learning. Given a mini-batch

(26)

of B randomly sampled training examples consisting of pairs of features and labels {(xb, yb)}B_b=1, a neural network parameterised by a weight

vec-tor θ, a loss objective L(θ, x, y) and a learning rate η, the update rule of stochastic gradient descent is defined as:

gt = 1 B B X b=1 ∇θtL(θt, x_b, y_b) θt+1 ← θt − ηgt

In other words, a subset of samples is used to estimate the empirical gra-dient over the entire dataset.

Given a weight θt_j, one could consider its possible values as being split into two regions, with a locally optimal value θ∗_j as the separation point. Depending on the value of the gradient and the learning rate, the updated weight θt+1_j will lie in one of the two regions. That is, it will either get closer to its optimal value while remaining in the same region as before or it will be updated past it and land in the opposite region. We term these two phenomena under- and over-shooting, and provide an illustration in Fig. 1. Mathematically, they correspond to:

under-shooting: η|g_jt| < |θt_j − θ∗_j| over-shooting: η|g_jt| > |θt_j − θ∗_j|

Indeed, it is also possible that a weight gets updated exactly to a local optimum, i.e. η|g_jt| = |θt

j − θ∗j|. However, it is highly unlikely that this

equality holds for the majority of weights, since the learning rate is often set empirically and is shared across all weights. Therefore, we do not take this case into account in our analysis.

With the behavior of under- and over-shooting, and under the assump-tion that mini-batches can reliably estimate the empirical gradient, one could construct a heuristic-based test in order to evaluate whether a weight has a local optimum at a specific point without needing the network to have reached convergence:

1. For a weight θj, a value of φj is chosen for which the test is conducted.

2. Train the model regularly and record the occurrence of under- and over-shooting around φj after each step of SGD.

3. If the number of such occurrences exceeds a threshold κ, conclude that θj has a local optimum at φj, i.e. θ∗j = φj.

(27)

Figure 1: Over- and under-shooting illustrated. The vertical line splits the x-axis into two regions relative to the (locally-)optimal value θ_j∗. Over-shooting corresponds to when a weight gets updated such that its new value lies in the opposite region (blue dot), while undershooting occurs when the updated value is closer to the optimal value, but stays in the same region (green dot).

We coin this method the aim test.

Previous works on neural network pruning have demonstrated that neu-ral networks can tolerate high levels of sparsity with negligible deterioration in performance (Han et al. (2015), Molchanov et al. (2017), Frankle and Carbin(2019),Liu et al. (2020)). It is therefore not unreasonable to assume that for a large number of weights, there exist local optima at exactly 0, i.e. θ_j∗ = 0. One could then use the aim test to detect these weights and prune them. Importantly, when using the aim test for φj = 0, the two regions

around the tested value are the set of negative and positive real num-bers, respectively. Checking for over-shooting then becomes equivalent to testing whether the sign of θj has changed after a step of SGD, while

under-shooting can be detected when a weight has been updated to a smaller ab-solute value and retained its sign, i.e. (|θt+1_j | < |θt

j|)∧(sgn(θ t

j) = sgn(θ t+1 j )).

However, under-shooting can be problematic; for instance, a weight could be updated to a lower magnitude, while at the same time being far from 0. This can happen when a weight is approaching a non-zero local optimum, an occurrence which should not contribute towards a positive outcome of the aim test. By positive outcome, we refer to determining that φj = 0 is indeed a local optimum of θj. A similar problem can occur

for over-shooting, where a weight receives a large update that causes it to change its sign but not lie in the vicinity of 0. These scenarios, which we will refer to as deceitful shots going forward, are illustrated in the general case, where φj can take any value, in Fig. 2a and Fig. 2b. Following, we

(28)

(a) Deceitful observations of under-shooting. (b) Deceitful observations of over-shooting.

Figure 2: In the plots above, the dotted vertical line represents the value at which the aim test is conducted (i.e. a value we would like to determine as a local optimum or not), while the red dot represents the value of a true local optimum. When testing for a value which is not a locally optimal value φj 6= θ∗_j, over- or under-shooting around φj can be merely a

side-effect of that weight getting updated towards its true optimum θ∗_j. These observations would then contribute towards the aim test returning a false positive outcome, i.e. φj = θj∗, and are therefore deceitful shots. Whether

we observe an over-shoot or an under-shoot in this case depends on the relationship between φj and θ_j∗. In (a), we have φj > θ_j∗, where if the

hypothesised and true optimum are sufficiently far apart, we always observe an under-shoot, regardless of how the weight gets updated with respect to its true optimum. Conversely, in (b), we have φj < θj∗ and always observe

over-shooting.

make two observations which help circumvent this problem.

Reducing the impact of deceitful shots Firstly, one could reduce the impact of deceitful shots by also taking into account the distance of the weight to the hypothesised local optimum, i.e. |θj − φj|, when conducting

the aim test. In other words, the number of occurrences of under- and over-shooting should be weighed inversely proportional to this quantity, even if they would otherwise exceed κ.

Reducing the number of deceitful shots Our second observation is that by ignoring updates which are not in the vicinity of φj, the number

of deceitful shots are reduced. In doing so, one could also simplify the aim test; with a sufficiently large perturbation to θj, an update that might

(29)

oth-erwise cause under-shooting can be made to cause over-shooting. Adding a perturbation of ± is, in effect, inducing a boundary around the tested value, [φj − , φj + ]; all weights that get updated such that they fall into

that boundary will be said to over-shoot around φj. With this framework,

checking for over-shooting is sufficient; updates that under-shoot and are within of the tested value are made to over-shoot (Fig. 3a) and updates which under-shoot but are not in the vicinity of φj, i.e. a deceitful shot,

are now not recorded at all (Fig. 3b). This can also be seen as restricting the aim test to only operate within a vicinity around φj. Following, we

piece together the ideas discussed so far in order to create a criterion for identifying and pruning weights that have locally optimal values at 0.

3.2 FlipOut: applying the aim test for pruning

We present the two components necessary for applying the aim test for pruning. Specifically, in Section 3.2.1 we propose a saliency criterion that takes into account the number of times over-shooting has occurred as well as the distance of a weight from its hypothesised local optimum, as a result of the previously made observations in Section 3.1 regarding deceitful shots. Finally, in Section 3.2.2 we present a schema of adding perturbation into the weight tensor θ.

3.2.1 Determining which weights to prune

Pruning weights that have local optima at or around 0 can obtain a high level of sparsity with minimal degradation in accuracy. Han et al. (2015) use the magnitude of the weights once the network is converged as a crite-rion; that is, the weights with the lowest absolute value (i.e. closest to 0) get pruned. The aim test can be used to detect whether a point represents a local optimum for a weight and can be applied before the network reaches convergence, during training. For pruning, one could then apply the aim test simultaneously for all weights with φ = 0 . We propose framing this as a saliency score; at time step t, the saliency τ_jt of a weight θt_j is:

τ_jt = |θ t j|p flipst_j (5a) flipst_j = t−1 X i=0 [sgn(θi_j) 6= sgn(θi+1_j )] (5b)

Here, [·] represents the Iverson bracket, and the values θ0 are the neu-ral network’s weights at initialization. With perturbation added into the

(30)

weight vector, it is enough to check for over-shooting, which is equivalent to counting the number of sign flips a weight has undergone during the training process when φj = 0 (Eq. 5b); a scheme for adding such

pertur-bation is described in Section 3.2.2. In Equation 5a, the denominator |θt_j|p

represents the proximity of the weight to the hypothesised local optimum, |θt

j − φj|p. However, since we have φj = 0 for all weights, this becomes

equivalent to the weight’s magnitude. The hyperparameter p controls how much this quantity is weighted relative to the number of sign flips. When p = 0, only the number of sign flips are taken into account, while when p → ∞, this criterion becomes equivalent to magnitude pruning.

There are two common ways of determining how many weights to prune when using a saliency score. For instance, one could remove those weights whose saliency falls below a user-defined value κ, similar to the method of Han et al. (2015). With this approach, however, it is not possible to target a final sparsity directly. As such, we adopt the strategy of Frankle and Carbin (2019), i.e. pruning a percentage of the remaining weights each time. Given m, the number of times pruning is performed, r the percentage of remaining weights which are removed at each pruning step, k the total number of training steps, dθ the dimensionality of the weights and || · ||0

the L0-norm, the resulting sparsity s of the weight tensor after training the

network is simply: s = 1 − ||θ k_|| 0 dθ = (1 − r)m (6)

The parameters m and r can then be selected such that a specific sparsity is achieved, allowing the user to have more fine-grained control. This final sparsity can be determined depending on the desired level of speedup and/or how much degradation in performance is tolerable. Moreover, the combination of parameters m and r that generates a specific sparsity level is not unique and can be tweaked according to circumstances (i.e. one could choose to prune more often, but less parameters at a time, or vice-versa). 3.2.2 Perturbation through gradient noise

Adding gradient noise has been shown to be effective for optimization ( Nee-lakantan et al. (2015), Welling and Teh (2011)). Specifically, it has been found to lower the training loss and reduce overfitting by encouraging an exploration in the parameter space, thus effectively acting as a regularizer. While the benefits of this method are helpful, our motivation for its usage stems from allowing the aim test to be performed in a simpler manner; weights that under-shoot and are in the vicinity of the value tested as an optimum will pass over the axis due to the injected noise, thus making

(31)

(a) Under-shooting can become over-shooting by adding perturbation.

(b) Ignoring deceitful shots.

Figure 3: (a) All weights that under-shoot but are within of φj will be

made to over-shoot. (b) When testing at a value which is not a local optimum for θj, i.e. φj 6= θj∗ and adding a perturbation to θj, not taking

under-shooting into account means that if the weight gets updated such that it does not lie in the boundary around φj induced by the perturbation,

an event that would otherwise contribute to a false positive outcome for the aim test will not be recorded, so the likelihood of rejecting φj as an

optimum increases.

checking for over-shooting sufficient.

We have seen in Section 3.1 that an update of SGD at time step t will cause a weight θ_jt to over-shoot around an optimum θ_j∗ if the inequality η|gt_j| < |θt

j − θ∗j| holds. As such, methods that affect the magnitudes of the

parameters or change the mechanics of the gradient computation can have an effect on the number of sign flips. Examples of these include weight decay, the L1 regularizer, the choice of initialization strategy or using

al-ternative optimization techniques such as momentum, Adam (Kingma and Ba (2015)) or RMSprop (Tieleman and Hinton (2012)). This also holds true for the choice of learning rate, in that a large learning rate can cause a higher number of sign flips, and vice-versa. To make FlipOut more robust to methods that modify the magnitudes of the parameters, the variance of the noise distribution is scaled dynamically by the L2 norm of the

pa-rameters of each layer θl. We also normalize it by the number of weights and introduce a hyperparameter λ which acts as a multiplier. For a layer l and dl its dimensionality, the gradient for the weights in that layer used

(32)

by SGD for updates is modified to: ˆ gt,l ← gt,l + λt,l (7a) t,l ∼ N (0, σ2_t,l) (7b) σ_t,l2 = kθ t,l_k2 2 dl (7c) As training is performed, it is desirable to reduce the amount of added noise so that the network can successfully converge. Previous works do this through annealing schedules by decaying the variance of the Gaussian distribution proportional to the current time step. Under our proposed for-mulation, however, explicitly using an annealing schedule is not necessary. By pruning weights, the term in the numerator in Eq. 7c decreases, while the denominator remains constant. This ensures that annealing will be induced automatically through the pruning process, and there is no need for manually constructing a schedule.

Note that we have also experimented with other recipes for the variance of the Gaussian distribution (Eq. 7c) which also fulfill the property of automatic annealing, such as the L1 norm instead of L2 or modifying the

power of the numerator to 1 and normalizing it by √dl instead; its current

form, however, has given us the best results overall.

Pruning periodically throughout training according to the saliency score in Eq. 5 in conjunction with adding gradient noise into the weights using Eq. 7 forms the FlipOut pruning method, which is summarised in Algo-rithm 1.

Algorithm 1 FlipOut

1: procedure FlipOut(k, r, m) . Num. updates, pruning rate & frequency

2: Initialize θ0 3: for t = 0 to k-1 do 4: gt ← 1 B PB b=1∇θtL(θ t_{, x}

b, yb) . Compute gradient over mini-batch

5: gˆt,l _{← g}t,l_{+ λ}t,l _{for all l} _{. Add gradient noise to each layer (Eq. 7)}

6: θt+1_{← θ}t_{− ηˆ}_gt _{. Update using noisy gradients}

7: if t mod m = 0 then . Check if it’s time to prune

8: compute τ_jt+1 for all j . Compute saliency scores (Eq. 5)

9: rank weights θt+1 by saliency

10: prune bottom m% of remaining weights

11: end if

12: end for

13: return θk _{. Return resulting sparse weight tensor}

FlipOut: Uncovering Redundant Weights via Sign Flipping

MSc Artificial Intelligence

Master Thesis