Compression for Deep Learning: Pruning via Iterative Ranking of Sensitivity Statistics

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Compression for Deep Learning:

Pruning via Iterative Ranking of Sensitivity Statistics

by

S

TIJN

L

UCAS

V

ERDENIUS

10470654

July 3, 2020

48EC November 2019 - June 2020

Supervisors:

D

R

. P

ATRICK

F

ORRÉ

MS

C

. M

AARTEN

S

TOL

Assessors:

D

R

. P

ATRICK

F

ORRÉ

P

ROF

. D

R

. M

AX

W

ELLING

U

NIVERSITY OF

A

MSTERDAM

(2)

A

MSTERDAM

M

ACHINE

L

EARNING

L

AB

In cooperation with:

B

RAIN

C

REATORS

B.V.

Special Acknowledgements:

Andrei Apostol, Bjarne de Jong, Ioannis Gatopoulous, Jaap Verdenius, Konstantin Todorov, Maarten Stol, Patrick Forré, Philipp Ollendorff,

(3)

Abstract

With the introduction of SNIP [Lee et al.,2019], it has been demonstrated that modern neural networks can effectively be pruned before training. Yet, its Sensitivity Criterion has since been criticized for not propagating training signal properly or even disconnecting layers. As a remedy, GraSP [Wang et al., 2020] was introduced, compromising on simplicity. However, in this work it will empirically be shown that by applying the Sensitivity Criterion iteratively in smaller steps - still before training - we can improve its performance without difficult implementation. As such, ‘SNIP-it’ is introduced. In addition, it will be demonstrated how the algorithm can be applied to both structured and unstructured pruning, before and/or during training, therewith achieving state-of-the-art sparsity-performance trade-offs. That is, while already providing the computational benefits of pruning in the training process from the start. Furthermore, the discussed methods are rigorously evaluated, on more than one dimension, by looking at robustness to overfitting, disconnection and adversarial attacks, as well as traditional metrics.

(4)

1 Introduction

Wouldn’t it be great if you could considerably speed-up the training of your neural network? Or what about, storing it on the smallest devices? In the last decade, deep learning has undergone a period of rapid development. Since AlexNet’s revolution [Krizhevsky et al.,2012], advances have been noticeable on many fronts; for instance in generative image modeling [Kingma and Welling,2013,Goodfellow et al.,2014] and translation [Vaswani et al.,2017]. Be that as it may, deep learning comes with an array of practical drawbacks. Amongst others, its training process is notably time-consuming, computationally expensive and data-hungry.

It doesn’t stop there. That is, its models operate like black boxes and are particularly sensitive to hyperparame-ters during training, making the tuning-process a necessity. This in turn multiplies the required computation considerably. Besides being rather inconvenient for the developer of these networks, this need for compute greatly harms the environment [Strubell et al.,2019]. It has been estimated that training a large network alike Transformer [Vaswani et al.,2017] is equivalent to approximately 55 times the yearly emissions of an average person, provided Neural Architecture Search (NAS) and hyper-parameter tuning are employed on top. At the same time, shallow-learning NLP only requires a fraction of that [Strubell et al.,2019]. Granted, only resourceful companies like Google and Facebook really train in this order of magnitude [Strubell et al., 2019]. Nonetheless, with the ever-increasing complexity of deep learning models, this magnitude of energy consumption may very well be more common in the future.

Moreover, in the circumstance of hardware limitations, the high-dimensional parameter-space can be a real obstacle. Think for example of making Computer Vision (CV) applications that use deep learning, such as augmented reality or ‘Face-swap’, publicly available on mobile devices, which inherently have limited storage space, processing power and battery life. Another example, is the real-time processing of data in the field of radio astronomy, where data streams are often too abundant to ever be stored on disk and analysed in full, after the moment of observation.

All these problems, even if never fully solved, could be relieved significantly if we find a way to reduce the size of neural networks. Literature has since shown that modern neural networks are heavily overparametrised [Denil et al.,2013,Dauphin and Bengio,2013,Ba and Caruana,2014,Han et al.,2015], hypothesising that a great deal of the parameter-space that these networks are equipped with, is actually redundant. In response, the field of model compression suggests networks can be pruned off their redundant model components, with promises of reduced storage and computational effort.

Modern pruning techniques are quite successful at finding these sparse solutions and prune over 95% of the parameters in a network, whilst leaving raw performance intact - e.g. [Han et al.,2015,Wang et al.,2020, Frankle et al.,2019b,Evci et al.,2019,You et al.,2019]. They often operate, by the ‘train-prune-finetune’ pipeline [Han et al.,2015,Li et al.,2016,Liu et al.,2019], which proceeds as follows; (a) take a pre-trained network and (b) prune some model-components, after which (c) some fine-tuning is performed [Han et al., 2015,Li et al.,2016]. Although these remarkable sparsity rates may give the impression that the problem is already solved, the following three observations exhibit why they actually provide a skewed perspective:

1. The ‘train-prune-finetune’ routine still requires training a dense network first, which accounts for a great deal of computation. On the upside, the weights are already close to their final values this way. However, pruning efforts have no contribution to the training phase. Some methods fight this by pruning during training [Louizos et al.,2018,Frankle et al.,2019b,Evci et al.,2019,You et al.,2019], which reduces the problem somewhat. Preferably, we would even prune before training, so that we can exploit the sparse setting from the start. Unfortunately, this is a harder problem to solve. Recently, [Lee et al.,2019,Wang et al.,2020] started tackling this problem, yet nevertheless only aimed at pruning individual weights, which brings us to the next point;

2. Pruning methods frequently only target individual weights - i.e. ‘unstructured pruning’, which is more flexible and results in higher sparsity rates. However, down the line this leads to sparse weight matrices which require formatting such as ‘Compressed Sparse Row’ (CSR) to be stored efficiently [Buluç et al., 2009,Saad,2003], and only speeds up computation with dedicated libraries [Han et al.,2016,Liu et al.,

(6)

2019]. Fortunately, research concerning ‘structured pruning’ exists as well - i.e. pruning entire rows, columns or filters of a weight-tensor. Be that as it may, those methods are limited to pruning during- or after trainingin the literature, making the benefits available for inference only.

3. Evaluation of these compressed sub-networks is usually performed along the dimensions of classifier-accuracy, computational cost and sparsity. However, this only validates a pruned solution shallowly. One can imagine that more aspects get affected in the process - for example, robustness to adversarial attacks, overfitting and data-imbalance in the training data. Of course, some works do actually discuss these themes in relation to compression [Cosentino et al.,2019,Sehwag et al.,2020,Lee et al.,2019,Arora et al.,2018]. Even so, it is often overlooked in the literature.

In the recently proposed ‘Lottery Ticket Hypothesis’ [Frankle and Carbin,2019], it is established that networks contain sub-networks that are trainable in isolation, to the same performance as the parent-network - thereby, demonstrating that it would be possible to compress networks early-on. Subsequently, in SNIP [Lee et al., 2019], it was first endeavoured to actually prune at initialisation, manifesting notable sparsity-rates using a simple ‘Sensitivity Criterion’ - without resorting to difficult training schemes. Practically, SNIP employs the gradients obtained from a single batch to rank weights in their Sensitivity Criterion, by means of a single forward- and backward pass, before training commences. They then cut-off at a desired sparsity κ and start training without further disturbance. Their criterion was thereafter criticised, due to its unfaithful gradient propagation and occasional disconnection of layers [Lee et al.,2020,Wang et al.,2020,Hayou et al.,2020]. To mitigate, orthogonal initialisation [Lee et al.,2020] and ‘GraSP’ [Wang et al.,2020] were suggested. Where the latter replaces the gradient-magnitude product of SNIP with a Hessian-gradient-magnitude product as a better sensitivity signal approximation [Wang et al.,2020], yet compromises on simplicity.

This leaves us with the current state of affairs, where through the Lottery Ticket Hypothesis [Frankle and Carbin,2019], SNIP [Lee et al.,2019] and GraSP [Wang et al.,2020], the first steps have been made for pruning before training. However, they still suffer from teething problems [Lee et al.,2020,Wang et al.,2020, Hayou et al.,2020]. Moreover, they operate in the unstructured domain only, which at the time of writing, cannot provide sufficient speedups [Han et al.,2016,Liu et al.,2019]. Yet, in the domain of structured pruning, no methods endeavor pruning before training - to the best of the writer’s knowledge at the time of writing.

1.1 Contributions & Scope

This thesis builds upon the aforementioned research, by examining both structured- and unstructured pruning, before and during training. Therewith taking the next steps for (un)structured pruning before training and also improving the Sensitivity Criterion. This thesis’ contributions are five-fold and can be summarised as follows:

1. Sections3.2.1&5.2will introduce ‘SNIP-it’, a method that uses the same criterion as SNIP [Lee et al., 2019] but applies its sensitivity statistics iteratively in smaller steps. Therewith, greatly improving performance and robustness in the high-sparsity regime, yet without compromising on simplicity.

2. Correspondingly, it will be hypothesised in Sections3.1&3.2.1that it is specifically because of its iterative nature, that SNIP-it yields different architectures, which perform better and are less likely to disconnect. This hypothesis will further be tested by means of experiments in Sections5.2.1&5.2.3.

3. Then, Sections3.2.2&5.1will demonstrate how SNIP-it can be applied to structured pruning as well, yielding a method - ‘SNAP-it’ - that doesn’t require a pre-trained network or complex training schedule to obtain a competitive structured pruning algorithm. As such, to the best of the writer’s knowledge at the time of writing, this thesis will introduce the first structured pruning method that operates before training.

4. Additionally, Section3.1will show in how the Sensitivity Criterion is proportional to function elasticity [Sydsaeter and Hammond,1995,Zelenyuk,2013] and discuss the ranking and pruning of weights and nodes in a combined fashion by means of elasticity in Sections3.2.3and5.3.

(7)

1.2 Outline 1 INTRODUCTION

of sparse solutions will therefore be performed on traditional measures for model compression - such as accuracy and sparsity - as well as on progressive ones - e.g. robustness and connectivity.

To scope, this thesis will only evaluate established computer vision classifiers on well-known benchmarks. Further, this thesis will only consider structured and unstructured pruning - two of many other compression techniques, such as Quantisation and Factorisation. It will also only introduce methods that prune before or during training, in order to take the next step in solving the introduced problems - although they will be evaluated against baselines that prune after training as well.

1.2 Outline

In section2, Neural Network Pruning will be defined and discussed in a general fashion and some contemporary techniques will be reviewed. Thereafter, in Section3, the Sensitivity Criterion will be linked to function elasticity [Sydsaeter and Hammond,1995,Zelenyuk,2013], the hypothesis will be discussed and new methods will be introduced. Experimental setup will be discussed next, in Section4, after which the results will be displayed and analysed in Section5, validating these new methods as well as baselines on traditional metrics and also on robustness. Finally, the work’s contributions will be critically discussed in Section6, after which there will be referred back to the greater picture and the possibilities for future work.

(8)

2 Background

In this section neural network pruning will shortly be discussed and formalised in Section2.1. Thereafter, in Section2.2(‘Literature Review’), state-of-the-art methods will be discussed in detail. Finally, in Section2.3 (‘Related Work’), this thesis will be positioned in relation to the then defined framework and introduced literature. In addition, for the unfamiliar reader a short introduction to deep learning is provided in AppendixA.

2.1 An Overview on Neural Network Pruning

2.1.1 Definitions

Network Pruning Network Pruning is a major sub-field from the field of model compression in deep learning. It is connected to a family of algorithms a ∈ A parametrised by hyperparameters λ , that are tasked with modifying neural networks to reduce their storage size, computational complexity, connectivity and/or energy consumption. Assume the existence of an uncompressed network f (x; θ ), that has a compressed counterpart f (x; θ0), which is produced by some compression-algorithm a(·|λ ), such that:

f(x; θ0) ≈ f(x; θ ) ∀x

kθ0k0 kθ k0

(1)

Its sparsity κ is then the fraction of zeroes in θ0. Practically this is often approached by masking out parameters through a mask M. Algorithm a then produces said mask at the pruning time tk:

a(·|λ ) 7→ M

θ0 = M θ

M ∈ {0, 1}|θ |

The pair of networks also share some set of (un)desirable properties p ∈P (e.g. accuracy on a test-set) with P = {p0, . . . pn}, in which they differ in the values ρ of some of these properties. These values are typically

assumed to be more favourable for a successfully compressed network, i.e. with some evaluation criterion function ei(·), related to a specific compression motivation with index i, as well as some corresponding value

pi= ρi, one would require the following for an optimal compression algorithm:

ei(pi= ρi0) ≥ ei(pi= ρi) ∀i (2)

Meaning; we have a more favourable, or at least as good, configuration of that property in the compressed network, for all identifiable properties. These properties minimally constitute some trade-off between a performance α ∈P and sparsity metric κ ∈ P, but can include many more. The amount of these properties differs per approach. A pruning algorithm may also not strictly hold to these criteria but can get close to most, or even just some, of the criteria in Function2and yet still be a decent compression algorithm. It would just not be an optimal one. Optimality becomes a harder goal to achieve with the increase of the number of these properties that are evaluated. As a matter of fact, even when considering only a couple of properties, real optimal compression algorithms haven’t been found yet, nor made plausible in theory. Generally, we’re always dealing with a trade-off between these metrics - i.e. with a Pareto-front.

Structure-type The function a(.|λ ) may target specific aspects of a network. In this thesis, two different types are distinguished: unstructured pruning and structured pruning.

• Unstructured Pruning removes individual weights by means of the mask M, which makes it more flexible and allows for a higher sparsity than its structured counterpart. Down the line, this approach leads to sparse weight matrices that can be stored more efficiently in CSR-format, as well as used for quicker matrix arithmetic [Buluç et al.,2009,Saad,2003]. However, to utilise these benefits, specialised libraries and/or hardware are a necessity [Han et al.,2016,Liu et al.,2019] and even with those the speedups are found to be no competition for structured pruning [Liu et al.,2019].

(9)

2.1 An Overview on Neural Network Pruning 2 BACKGROUND

• Structured Pruning aims at removing entire nodes of a layer - i.e entire rows and columns from θ . More generally, it removes parameters in a grouped way so that it allows for removing certain structures of a network altogether. Therewith, it reduces weight matrices’ dimensions, resulting in speed-ups without dedicated libraries and/or hardware.

Additionally, there are plenty of other approaches to model compression, which include amongst others ‘Fac-torisation’ and ‘Quantisation’. The former uses the concepts of matrix factorisation to perform dimensionality-reduction on weight matrices. The latter is more fundamental, where elements of a network are encoded in lower precision, or sometimes even have a shared value, so that they require less storage and arithmetical operations can be performed more efficiently. Although promising, this type of compression will not be in the scope of this thesis, as they don’t encompass pruning.

Timing of Compression The next degree of freedom is when one should apply said compression. This work defines the time of possible compression to be tk, with tk≤ T and where T ∈P is the training time expressed

in number of updates until convergence of the network under a fixed learning rate, dataset and optimiser. Each compression algorithm aims at creating a compressed network at some time tkthat is, from there on, trainable

to convergence (or converged already). This can also be performed iteratively. What is more important though, is when in the training process it takes place. The possible timings that will be distinguished in this thesis are: • After Training compressing a pre-trained network at t_k= T . The majority of methods operate in this domain. The benefit is that the model is already converged and therefore weights are close to their final value. The drawback is that pruning efforts have no contribution to the training phase.

• During Training aims to compress dynamically in the course of training. Or in other words; at any time t0< tk< T the algorithm sees fit. Some algorithms start with a dense network and prune away bit-by-bit

during training (e.g. [Louizos et al.,2018,Molchanov et al.,2017]), whilst others start sparsely with the periodical interchanging of which parameters are considered pruned and which are ‘grown back’ (e.g. [Bellec et al.,2018,Evci et al.,2019]).

• Before Training compressing at tk= t0. This effectively means pruning at initialisation, which is more

challenging because the parameters are far away from their final value at pruning-time. Yet, still it has been demonstrated that this is feasible [Frankle and Carbin,2019,Lee et al.,2019,Wang et al.,2020].

Pruning Domain Together, the endeavoured structure-type and timing will constitute the ‘domain’ of the pruning algorithm in this thesis. An overview of these domains with some methods is displayed in Table1of Section2.3. The domain tells something about the benefits the pruning method can give, as well as how they compare to others. That is, newly introduced methods are best compared within their domain of operation for fair comparison. For example, pruning unstructured after training is easier than structured before training. Yet, sometimes comparison to methods in other domains may lead to a more holistic understanding of qualities.

Saliency Criterions One prominent concept within pruning is to rank all weights and/or nodes according to some ‘saliency criterion’. This is a criterion that should indicate how important a certain parameter or group is. Subsequently, we prune/threshold a predetermined fraction κ at pruning time by means of this criterion.

2.1.2 Motivations

Compression algorithms can have different motivations and, as such, different algorithms are aimed at compressing different elements of a network, which in turn affect their design choices and hence; their domain. It is therefore important, to analyse what a compression algorithm is trying to achieve. Here, five different motivations for compression will be considered and shortly discussed.

Speedup In some real-time applications, it is essential to have fast network inference. Think for example, of real-time video manipulation or applications that operate on stock markets. With this ambition, one would not be interested in compressing before or during training. The priority is to achieve fast forward passes through

(10)

the network at inference time. Alternatively, one could be interested in speeding up training on top of that, in which case the we need to compress as early as possible. Speedup is easily achieved by reducing dimensions of parameter tensors. Therefore, it is most straightforward to compress structurally, if one wishes to refrain from using specialised libraries and/or hardware [Han et al.,2016,Liu et al.,2019]. Corresponding velocity metrics are ‘Floating-point Operations Per Second’ (FLOPS), as well as process and wall time.

Memory Footprint Perhaps the most related to the name, model compression is an effective way to reduce the storage size of a network. This was imaginably the first motivation for compression. In order to fit large models in Random Access Memory (RAM) on mobile devices, or to reduce storage on hard-disks, compression can be applied to shrink the network. All structure-types are valid approaches for this end. However, structured pruning has more impact in this regard [Han et al.,2015]. Unstructured pruning only achieves disk storing reduction if a specialised format - such as CSR - is employed [Buluç et al.,2009,Saad,2003]. Otherwise it will still store zero entries. Moreover, structured pruning is the only one that can reduce RAM-footprint before or during training - since it allows for parameter shrinking and doesn’t operate by masking. To elaborate why this is interesting; when we have more memory available during inference, we can process larger sets of input. Correspondingly, when we have more memory during training, we can also fit larger batches in memory, which in turn allows us to proportionally increase the learning rate [Goyal et al.,2017].

Generalisation As mentioned in the previous sections, modern architectures are heavily overparametrised [Denil et al.,2013,Dauphin and Bengio,2013,Ba and Caruana,2014]. Traditionally, it is viewed that if this is left unchecked, it will lead to overfitting [Bishop,2006]. It has since been demonstrated that this is not completely true [Zhang et al.,2016], yet nonetheless modern architectures come with a battery of methods to mitigate this phenomenon by means of model regularisation. In fact, it is speculated that regularisation via SGD is why neural networks do fairly well regardless of overparametrisation [Zhang et al.,2016]. It has further been argued, that model compression is a form of regularization itself [Molchanov et al.,2017]. Additionally, most compression algorithms reduce overfitting by reducing complexity, and this effect from pruning has theoretically been shown to be linked to better generalisation [Arora et al.,2018]. Meanwhile, some works have empirically shown to reduce overfitting specifically [Ullrich et al.,2017,Molchanov et al., 2017,Lee et al.,2019] on for instance; the ‘Random Label Test’ [Zhang et al.,2016] where classification labels are randomly shuffled to see if the network will simply remember the dataset, instead of generalising. These effects are sometimes attributed to the ‘Minimum Description Length Principle’, which states that the best model is the one that can describe the data in as little information as possible, thus getting the ‘big picture’ and not overfitting on the data [Wallace,1990,Rissanen,1978,Ullrich et al.,2017]. With this in mind, one could counter overfitting by compression. More formally, we would find θ0such that for a loss function L(·):

E Dtest L(D|θ

0₎

− E Dtrain L(D|θ

0₎

≤ _E_D_test L(D|θ) − E _D_train L(D|θ) (3) Appropriately, one should compress at a time somewhat before training is finished, so that the parameters can adapt to their compressed setting in a final round of fine-tuning and ‘forget’ their overfitting. Both structured-and unstructured pruning would be valid approaches, as well as some extreme scenarios in quantisation such as ’binary weight representation’ [Courbariaux et al.,2015,2016].

Epochs Until Convergence As a neural network becomes more complex, the optimisation algorithm used to train it has to navigate a more and more complex optimisation landscape. This complexity is due to the highly non-convex landscape that arises with more codependent parameters. In practice, it has been demonstrated that the more parameters a network has, the longer it takes to reach convergence - and that pruning can help in this effort [Frankle and Carbin,2019]. As modern architectures are heavily overparametrised [Denil et al.,2013, Dauphin and Bengio,2013,Ba and Caruana,2014], it is reasonable to assume that some training processes could be shortened by reducing the number of trainable parameters in θ , prior to training. In Section2.1, T ∈P was defined to be the training time until convergence of the original network. Now consider T0_{to be}

the equivalent to the compressed network, then we would say that from this motivation, the goal is:

(11)

2.1 An Overview on Neural Network Pruning 2 BACKGROUND

Note that this depends heavily on external parameters as well, such as learning-rate, data distribution and so forth. In the literature,Frankle and Carbin[2019] demonstrate empirically that compressed sub-networks exist for which Formula4holds - which will be further discussed in Section2.2.1. However, to the best of knowledge at the time of writing, this motivation has not been proven theoretically (yet). Still, if an algorithm can be designed to find a compressed network that holds to this property, training time and computational budget can be reduced. Correspondingly, when designing with this goal in mind, one would try to compress before or early on in training. Both structured- and unstructured pruning are valid approaches to reduce the complexity of the optimisation landscape.

Reducing Energy Use Related to aforementioned topics, it is found they almost all influence energy usage. For example, with a larger RAM-footprint comes more energy usage [Han et al.,2015]. Additionally, with more FLOPS comes more energy usage as more GPU/CPU operations need to be performed. Also with longer training time until convergence one will ultimately need more energy. Despite this being an important motivation, energy usage has a lot of external dependencies (e.g. hardware, temperature), and as such, it will not be considered much in this thesis. Instead, the other metrics will be considered its proxies.

2.1.3 Approaches from a Bird’s-eye View

Here, common approaches to pruning will be mentioned on a high level, in short. As a reminder; next, in Section2.2(‘Literature Review’), there will be elaborated upon them. Thereafter in Section2.3(‘Related Work’), this thesis will be put into perspective with respect to (w.r.t.) aforementioned definitions and the by-then introduced literature.

• Pruning started with (a) derivative-based criterion ‘Skeletonization’ [Mozer and Smolensky,1989] for structured pruning and (b) hessian-based criterion ‘Optimal Brain Damage’ [LeCun et al.,1990] for unstructured pruning. Later works also found criterions based on magnitude [Han et al.,2015,Li et al., 2016,Frankle and Carbin,2019], sign flips [Bellec et al.,2018], gradient-magnitude products [Lee et al., 2019,You et al.,2019], gradient-hessian-magnitude products [Wang et al.,2020] and many more.

• In addition, there are methods which train sparsely initialised networks. One way is to initialise with a sparse distribution and then train with the periodical interchanging of which parameters are considered pruned and which are ‘grown back’ [Bellec et al.,2018,Dettmers and Zettlemoyer,2019,Evci et al., 2019].

• Another way, is to instead train the network in a sparse setting from the start, which is thereafter not changed anymore [Lee et al.,2019,2020,Wang et al.,2020,Hayou et al.,2020,Tanaka et al.,2020].

• Following the publication of the Lottery Ticket Hypothesis [Frankle and Carbin,2019], a body of research was created examining the behaviour and obtainment of ‘winning tickets’ [Frankle et al.,2019b,Yu et al., 2020,Morcos et al.,2019,Desai et al.,2019,Evci et al.,2019,Zhou et al.,2019,Frankle et al.,2019a, Malach et al.,2020], as well as some criticism [Liu et al.,2019,Gale et al.,2019].

• Bayesian approaches also had their fair share of publications [Ullrich et al.,2017,Louizos et al.,2018, Molchanov et al.,2017]. The Bayesian interpretation of weight pruning is not recent [Weigend et al., 1991], nor is it strictly related to deep learning (see for exampleTipping[2001]).

• Additionally, ‘sparsity-inducing priors’ are often applied in previous settings. Here a distribution of model components with a specific prior that promotes sparsity is used, which is added to the training objective as regulariser [Carvalho et al.,2009,Louizos et al.,2018,Liu et al.,2017,Yang et al.,2020, Wen et al.,2016,Alvarez and Salzmann,2016,Lebedev and Lempitsky,2016,Ye et al.,2018]. • In a different branch, we see the use of dropout [Srivastava et al.,2014] as a means of pruning [Kingma

et al.,2015,Molchanov et al.,2017,Gomez et al.,2019].

(12)

structured pruning, with the assumption that this implies maintaining performance [He et al.,2017,Luo et al.,2017]

• Some works perform single-shot structural pruning with pre-trained networks [Li et al.,2020,Yu and Huang,2019,Li et al.,2019a]

• Finally, there have been multiple papers where Alternating Direction of Multiplier Methods (ADMM) is used for structured pruning, in which iteratively different layers get optimised in isolation [Liu et al., 2019,Li et al.,2020].

2.2 Literature Review

Next, the approaches mentioned previously are discussed in more detail and the specific algorithms that are later employed as baselines are taken under the microscope. This takes form as a general literature review. What follows, is an exposition on where this thesis is positioned in relation to the literature review in Section2.3. The subset of the discussed methods in the literature review which are also baselines, are additionally documented with their domain in Table1in Section2.3.

2.2.1 Magnitude as Saliency Criterion

As mentioned before, a paradigm that emerged soon after the appearance of model compression was using weight magnitude (i.e. absolute value) as a proxy for saliency. The Magnitude Criterion is thus defined by:

scm(θi j) = kθi jk1

The algorithms that use it, assume that low magnitude entails low influence on the final loss [Han et al.,2015, Li et al.,2016,Frankle and Carbin,2019], and that magnitude decrease over time would mean weights are ‘going to zero anyway’ [Zhou et al.,2019]. This is coined here as the ‘Magnitude Assumption’. Although the idea was deemed sub-optimal in the early ‘Optimal Brain Damage’ [LeCun et al.,1990] and more recently by Ye et al.[2018], it regained traction after work fromHan et al.[2015], which reintroduced magnitude pruning and reached very high sparsity levels with relatively low drop in performance - albeit with an extensive training routine. Results seemed to suggest that one could take the Magnitude Assumption for granted. The assumption became a common standard that also inspired the Lottery Ticket Hypothesis [Frankle and Carbin,2019].

The Lottery Ticket Hypothesis InFrankle and Carbin[2019], a new view on compression was offered by the authors, summarised as: ‘every network contains a sub-network that does the same job, with less parame-ters’. This sub-network is defined as one that can achieve commensurate performance with less parameters, in commensurate time [Frankle and Carbin,2019] - Thereby corresponding to Equations1&4from this thesis. Stochastic Gradient Descent (SGD) is now re-purposed by the authors as a Neural Architecure Search (NAS) method, instead of an optimisation algorithm, that can find this sub-network from the overparametrised parent-network. Thereby, rendering the latter as a proxy for the possible network-space. Moreover, they argued that these networks can be discovered early in the training process and demonstrated this to be the case for small networks by an iterative pruning algorithm, ‘Iterative Magnitude Pruning’ (IMP) [Frankle and Carbin,2019]. In this algorithm, they train the network shortly, after which they prune and reset surviving parameters to initial values. This process is repeated until desired sparsity κ is reached and thereafter trained to convergence. They argue that these initial weight-values are essential for the strength of the surviving sub-network, therefore they brand them ‘winning lottery tickets’. They demonstrate high sparsity rates without much loss of accuracy, with sub-networks obtained early-on in training.

Criticism Be that as it may, they do require multiple iterations of training and pruning to get to that sub-network. Additionally, their original algorithm produced disconnections of layers when applied in larger networks - a major set-back [Frankle and Carbin,2019,Liu et al.,2019,Frankle et al.,2019b,Gale et al., 2019]. Lottery Ticket algorithms have since received criticism. Besides the initial hurdle of not working for large networks [Liu et al.,2019,Frankle et al.,2019b], the work ofGale et al.[2019] finds that lottery tickets are usually outperformed by methods that include compression as part of training such as Sparse Variational Dropout [Molchanov et al.,2017] and `0-regularisation [Louizos et al.,2018].

(13)

2.2 Literature Review 2 BACKGROUND

Works on Winning Tickets and Iterative Magnitude Pruning In the follow-up paper [Frankle et al., 2019b], the premise of returning to the initial weight values is disputed. They argue, that the exact iteration to rewind to ek, is correlated with the ‘stability’ of the network. For this purpose, they define two stability

measures - for which the reader will be deferred to the original paper [Frankle et al.,2019b]. Subsequently, they go on to demonstrate that their ‘instability’ in the initial stages of training is higher in deeper networks and that perfect pruning time is highly correlated with these stability measures. They hypothesise that this is because ‘winning tickets’ are more stable at this point, meaning they would end up in a specific local optima. Therefore, at this point the tickets are more robust against noise from SGD/data-order because SGD is very sensitive to randomness early on in training. When rewinding epoch ekis too late, however, one gets back to

random pruning performance, which is not explained very well. They aim to exploit the temporary ‘Stability Gap’, which is a range of (early) epochs where the difference between these stabilities is larger. In pursuit of that, they alter IMP to not return to the values exactly at ek= 0. Rather, they rewind to some epoch where

stability is at the best place in the stability gap, which adds another degree of freedom to the list of tuneable hyperparameters - but does make the algorithm applicable for larger networks [Frankle et al.,2019b]. Next, they further develop this idea in their paper on Linear Mode Connectivity [Frankle et al.,2019a], where they introduce ‘instability analysis’ as a formal way to find the optimal epoch to rewind to.

Simultaneously,Zhou et al.[2019] researched different pruning conditions and find that on top of low weight magnitude, sign flipping is also an important salience criterion - which was already pre-confirmed in the work ofBellec et al.[2018]. They further demonstrate that some networks can even be trimmed by magnitude before training with sub-optimal but decent performance [Zhou et al.,2019]. Additionally, inYu et al.[2020] it is shown how the IMP algorithm is easily applied to networks from different areas of application - such as Natural Language Processing (NLP) and Reinforcement Learning (RL). Similarly,Morcos et al.[2019] and Desai et al.[2019] show that winning tickets can be used in transfer learning across data-sets and optimisers as well. This demonstrates that the strength of these tickets is not an exception for experimental setup of the original paper ofFrankle and Carbin[2019].

Subsequently,Evci et al.[2019] show that the costly iterative nature of the IMP algorithm - especially with rewinding - is not necessary and pruning can be done during training. They are inspired byBellec et al.[2018] andDettmers and Zettlemoyer[2019], where connections are not only cut but also reestablished, and combine this with IMP to create a new lottery algorithm called ‘RigL’ [Evci et al.,2019], where they prune during regular training, every so many iterations. Each time, they prune a number of weights based on the same weight magnitude criterion. However, on top of that they grow back the disconnected weights that receive the highest gradient signal [Evci et al.,2019]. They achieve state-of-the-art results. However, this is at the cost of a large number of new tuneable hyper-parameters, as well as complex annealing schemes.

In new recent work,Malach et al.[2020] demonstrate how to prove the Lottery Ticket Hypothesis and also find an even stronger hypothesis; ‘showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training’[Malach et al.,2020].

2.2.2 Dropout and Sparsity

Dropout [Srivastava et al.,2014] is a regularisation method used to reduce overfitting, and from a Bayesian perspective, perform efficient model averaging. From a NAS perspective, dropout operates as a method that every forward-pass performs the sampling of a different model architecture, within the subspace defined by the parent-model architecture. It is therefore not surprising that dropout is related to model compression, as it trains with sparse sub-networks.

Sparse Variational Dropout A Bayesian approach that is frequently employed is Sparse Variational Dropout (SVD) [Molchanov et al.,2017]. Here, the dropout [Srivastava et al.,2014,Kingma et al.,2015] framework is employed to supplement the training objective with a sparsity objective. The method is build upon Variational Dropout [Kingma et al.,2015], which extended Gaussian dropout through Variational Inference to obtain a framework where dropout rates per parameter are learned and sparsity can be controlled through the objective function. Variational Dropout had one problem though, which was that it became unstable in the high-sparsity

(14)

regime [Kingma et al.,2015,Molchanov et al.,2017]. Molchanov et al.[2017] solve this problem with a variance reducing trick and yield SVD, which allows for jointly training a network and obtaining high weight-or node-sparsity in the process, through a framewweight-ork of dropout. A shweight-ort derivation is recweight-orded in AppendixB, which was moved there because SVD ended up not being used as a baseline - unlike was intended initially. The reason for that is that SVD may achieve very high sparsity rates but has also been reported with a high variance in results between different runs [Gale et al.,2019,Gomez et al.,2019]. A finding that was reconfirmed in early explorative research in this thesis.

Targeted Dropout More recently, after a spike in research on magnitude based saliency criterions following the Lottery Ticket Hypothesis [Frankle and Carbin,2019], it was illustrated that ‘sparse’ dropout can be practiced in a more direct frequentist setting too. InGomez et al.[2019], it is argued that Dropout - from design - is meant to make networks more robust under changes in architecture. Therefore they argue that, if dropout is applied with the same criterion as the final saliency criterion for pruning, then the network will learn to be robust under that saliency criterion. They coin this as ’Targeted Dropout’ (TD) [Gomez et al., 2019], and train with dropout-layers that drop out the lowest magnitude weights and nodes. Then after training, they establish that the network is a great deal more stable in the strain of pruning by the saliency criterion and therefore reaches a better sparsity before collapsing [Gomez et al.,2019]. In this effort, they also do not introduce new hyperparameters other than the ones used in the saliency criterion itself. However, their method can also be considered a regularisation effort on top of any other pruning-algorithm that already uses a saliency criterion, instead of being a pruning algorithm in itself.

2.2.3 Sparsity Inducing Priors

Ordinary weight regularisation (or ‘weight decay’) is usually employed to ensure parameters don’t take on extreme values, as this is often correlated with overfitting [Bishop,2006]. This is usually achieved by adding a `2-error-term to the loss function as regulariser. Similarly, `1-regularisation terms have been used to encourage

compression of individual weights [Han et al.,2015] or groups of weights [Li et al.,2016]. Often, training with such regularisations is followed by subsequent pruning with a saliency criterion to finish the job, since the regulariser only pushes a parameter to zero but cannot reach it exactly.

However, using a norm such as the `1still has collateral damage, as it penalises the magnitude too and not

only the sought after ’non-zero-ness’. Which is acceptable when pursuing actual regularisation, yet if one considers just pruning - which is usually only obtained with a high regularisation hyperparameter - it would be required that sparsity is induced without further consequences. In pursuit of that goal, `0-norm regularisation

would be the ideal solution. Since, this specific norm encourages the network to set weights to exactly zero and does not influence the weights actual values while they are non-zero. Its shape is accordingly; 0 when a weight is exactly zero, and 1 everywhere else. This is in a way a natural description of what one wants one’s pruning algorithms to do; kill specific weights and as many as possible, yet in balance with the original training objective - like described by Formula1. Unfortunately, exactly this (binary) property makes the `0-norm

non-differentiable, and therefore, it cannot be added to the training objective as regulariser.

Still, one can do better than `1, and notable research has focused on finding the very steep and long tailed

distributions that form a good ‘sparsity inducing prior’, or put differently; a distribution that approximates `0.

To name a few; Carvalho et al.[2009] use a ’Horse-shoe prior’, Bai and Ghosh[2017] use an ’Inverse Gamma-gamma prior’ andSchmidt and Makalic[2018] use ’Log-Scale Shrinkage Priors’.

Relaxed L0-Regularisation Another solution, is to not discard the `0-norm just yet, but instead try to relax

the discrete character of the distribution underlying it. In the work ofLouizos et al.[2018], this is solved using a relaxed Hard-concrete distribution, which is obtained by taking a binary random variable, stretching it [Maddison et al.,2016,Jang et al.,2016] and then passing the sample through a hard-sigmoid. The former is a trick to get a relaxed version of a binary random variable, the latter is a computationally lighter approximation of the sigmoid, composed of a couple of linear segments. A short derivation is found in AppendixBfor the

(15)

interested reader, which yields the following optimisation goal:

arg min ˜ θ ,φ E q(s|φ ) 1 N N

∑

i=1 Lf xi, ; ˜θ min(1, max(0, s)), yi + λ |θ |

∑

j=1 1 − Q(sj≤ 0 | φ )

Where, on top of a masked version of the ordinary optimisation goal, the `0-norm regularisation is approximated

by the CDF of distribution Q(·). See AppendixBfor more detail. Authors choose Q(·) to be the binary concrete distribution, because it is related to the original Bernoulli distribution of the mask. After this specification and a few others, for which the reader will be deferred to the original paper ofLouizos et al.[2018], the objective is further updated. In conclusion, on top of a globally shared parameter β , for each parameter θi j, a

location αi jis learned, which all have to be kept in memory in the entire training process. Moreover, a lot of

overhead computation is applied in the passes. Finally, there is also still relied on a form threholding.

DeepHoyer-Regularisation Following the difficulties experienced in the previously discussed `0-regularisation

[Louizos et al.,2018], recently a different approach was introduced; DeepHoyer [Yang et al.,2020]. They avoid much of the sampling difficulties by instead approximating `0-norm by the Hoyer measure H [Hoyer,

2004], which is defined as the ratio between `1- and `2-norm:

H(θ ) = kθ k1

kθ k2

They then show how the range of this measure, for any vector of length N, is [1,√N], whereas for the the real `₀-norm it is [1, N] [Yang et al.,2020]. Hence they square it to come to their regulariser HS:

kθ k0 ≈ HS(θ ) = (

_∑

i, j θi j )2

∑

i, j θ_{i j}2

This is then added to the training objective, scaled by hyperparameter λ , so that weights are pushed towards zero, like a real `0-norm would do. After training, they then threshold remaining weights, for which they

introduce an additional thresholding hyperparameter [Yang et al.,2020]. Additionally, they also provide a rewritten form of HS that is applicable to structured pruning, where each group of weights corresponding to a node gets pushed to zero jointly [Yang et al.,2020].

2.2.4 Training Sparsely Initialised Networks

A major drawback of the discussed approaches is that they achieve their sparse settings only late in or posterior to the training process. In contrast, an appealing strategy is to instead train already sparse networks from scratch, to an equivalent performance. This way you can directly benefit from the sparse setting.

Regrowing Weights Early adopters wereBellec et al.[2018], with their method of ’Deep Rewiring’, where they introduce a method that allows pruned connections to grow back where needed, whilst keeping the total number of active ones to a maximum bound. In turn, this led to Sparsity From Scratch (SFS), where they similarly train with sparsity build into training, but allow weights to grow back proportional to the layer’s relevance, which is evaluated by the layer’s respective momentum of gradients [Dettmers and Zettlemoyer, 2019]. Subsequently, this gave inspiration to the RigL paper [Evci et al.,2019], which was already discussed in Section2.2.1. Although the goal of training sparsely initialized networks was achieved by these papers, they still use the entire network during training for growing back connections and thus did not achieve their final sparsity exactly at initialisation. Therefore, they are still considered in the during-training domain in this thesis and not in the before-training domain. The first method to actually pursue this, will be discussed next.

SNIP: Single Shot Network Pruning In the field of model compression, new papers often appear with proclaimed state-of-the-art sparsity-performance trade-offs. However, often these methods come at the cost of

(16)

difficult implementations that require multiple hyper-parameters to be tuned and sometimes even very specific training schemes. Think for example, of `0-regularisation [Louizos et al.,2018], where one needs an extra set

of parameters in memory during training and a difficult implementation, or think of RigL [Evci et al.,2019], where to achieve state-of-the-art results, one needs a complex annealing scheme that - at the very least - depends on the network, task and the data. One could argue, that the real innovation is actually in a new method that is a simplification compared to its baselines, while results remain reasonable - rather than achieving a small improvement in sparsity with the drawback of introducing even more complexity.

To this end,Lee et al.[2019] introduced ‘SNIP’, which uses a derivative-based saliency criterion. Practically, they employ the gradients obtained from a single batch to rank weights in their Sensitivity Criterion, cut-off at a desired sparsity κ and then train without further disturbance. Granted, they don’t get the best scores. However, they introduce a great deal of simplicity that can be leveraged before training and works out-of-the-box. The approach goes as follows. First, they introduce an auxiliary variable ci j= 1 for each weight θi jsuch that the

optimisation objective is substituted as indicated:

min θ 1 N N

∑

i=1 L(D|θ) = min θ 1 N N

∑

i=1 L(D|θ c) _c=1

Thereafter, they take a single batch, do a forward pass, and calculate the first-order gradient w.r.t the auxiliary variable ci j, rather than the weight. They choose not to use the Hessian like inLeCun et al.[1990],Hassibi et al.

[1994] because this would not scale to larger networks. Moreover, if they would use the weight’s derivative directly (∂ L

∂ θi j), they would measure the change in loss under change in weight-value rather than it’s mere presence [Lee et al.,2019]. By using the auxiliary variable instead, they measure the weight’s importance for the loss function directly - according to the authors [Lee et al.,2019]. This approach was previously already introduced in Skeletonization [Mozer and Smolensky,1989] for structured pruning. The authors seem to have missed that and re-coin it as the ‘Sensitivity Criterion’ sc(·) [Lee et al.,2019]:

g(D;θi j) = ∂ L(D|θ c) ∂ ci j _c i j=1 sc(θi j) = g(D;θi j)

∑

k,l g(D;θkl)

Then, they rank the weights of the network globally based on this sensitivity signal, and prune away the least important weights by setting specific variables ci j= 0, up to a certain desired sparsity level κ. Following

that, they omit the auxiliary variables, leaving the way free to train a sparse network from scratch without further interference. They demonstrate in their paper that a single batch, given it is of sufficient size, is enough to estimate the function sc(·), as well as that the models train to compatible performance on a variety of architectures and don’t overfit [Lee et al.,2019]. More on SNIP’s method is discussed in Section3.1.

A Signal Propagation Perspective on SNIP In one of the follow-up papers,Lee et al.[2020] show that SNIP’s auxiliary variables are actually redundant, since they are initialised as c = 1 and one can extend g(·) with the chain rule as follows (details are discussed in Section3.1):

∂ L(D|θ c) ∂ c c=1 = ∂ L(D|θ) ∂ θ θ

Besides ridding ourselves of this auxiliary variable c, this also explains another issue with the sensitivity scores, that according to the authors explains the occasional failings of SNIP [Lee et al.,2020]. Namely, because of the multiplication with θ(l), each layer scales the signal to a degree. Hence, when a global threshold is chosen, the scale of layers can disproportionally affect weight’s saliency just because of in which layer they are [Lee et al., 2020]. This is easily solved by ensuring that the singular values of a new weight matrix are close to one, or put

(17)

differently, approximately orthogonally initialised. The singular values are namely geometrically interpreted as the lengths of the eigenvectors of a weight-tensor [Poole,2014]. As such, they scale the transformation that this tensor applies by its values [Poole,2014]. Following that, if they are all close to 1, it ensures that the signal is ‘faithful’ and ‘dynamical isometry’ is maintained [Lee et al.,2020]. It is shown, that by better initialisation networks reach better sparsity-performance trade-offs after SNIP, are more stable and avoid previous fallacies such as the disconnecting of layers [Lee et al.,2020,Hayou et al.,2020,Wang et al.,2020].

GraSP: Preserving Gradient Flow for Pruning Before Training Concurrently, a different perspective on the same problem is explored by [Wang et al.,2020]. They find that the work ofLee et al.[2020] is giving empirical findings of the theories of Neural Tanget Kernel (NTK) [Jacot et al.,2018,Arora et al.,2019, Wang et al.,2020]. Just likeLee et al.[2020] they argue that SNIP [Lee et al.,2019], although powerful in finding pruneable weights before training, is not optimal. It evaluates weights in isolation, which therefore can influence the gradient-flow through the network by reducing the norm of the gradient. The premise of SNIP, is that it is aimed at preserving loss under the strain of pruning, but they argue that when evaluated before training this loss is random anyway. Thus, they argue it is better to preserve training dynamics, rather than the loss value. For this, they introduce GraSP [Wang et al.,2020], which applies a Taylor expansion on the directional derivative ∆L(θ ). Contrary to other approaches that use this derivation [Molchanov et al.,2016,You et al., 2019], they extend it to the second order derivative by using a Hessian-derivative product that doesn’t require computing the explicit Hessian matrix [Wang et al.,2020], ending up in the following saliency criterion:

scGraSP(θ ) = θ H

∂ L(D|θ) ∂ θ

They show in their derivation, that if their saliency score is positive, this means gradient flow increases after pruning that weight, whereas if it is negative it would decrease. They further show the Hessian captures the interactions between multiple weights where SNIP assumes this to be the identity matrix, thereby suffering from disconnection and underfitting [Wang et al.,2020]. More on GraSP’s method is discussed in Section3.1.

Very Recent Progress: SynFlow and WoodFischer Since this is also the niche that the work from this thesis is in, very recent simultaneous research is discussed here as well. These concern works that were made public around the same time as the paper that was produced from this thesis [Verdenius et al.,2020], which is in itself a summary of this thesis. Specifically, the papers that are referred to are the works that introduced ‘SynFlow’ [Tanaka et al.,2020] and ‘WoodFischer’ [Singh and Alistarh,2020].

Very recently, concurrently to the research of this thesis,Tanaka et al.[2020] realised that there is a critical compression at which disconnection takes place [Lee et al.,2020,Hayou et al.,2020,Wang et al.,2020,Tanaka et al.,2020]. They argue that no algorithm should surpass said rate before reaching the maximum compression possible - something the previous methods, that prune before training, all fail to do. They then analyse why this happens and find that it is different for each algorithm. They argue that in case of random pruning the smallest layer is disconnected first, in case of magnitude pruning the widest layer is the victim, and for SNIP [Lee et al., 2019] and GraSP [Wang et al.,2020] it is the layers with the most parameters. They proceed to explain this by looking at synaptic saliency conservation laws, for which they show that all mentioned algorithms respect them - except random -, though one does more so than the other. This is inherently a positive thing. However, they further show that when conservation is coupled with single-shot pruning, this leads to disconnection, due to the method’s tendency to prune a layer proportional to its size. Yet, when applied iteratively - still before training - one can avoid that. Namely, after the first iteration, the largest layer has become smaller, effectively lowering the bias towards that layer with each pruning-event [Tanaka et al.,2020]. Moreover, they show that by also training in between these pruning events one can make the network better follow the conservation laws - which is also what happens in IMP [Frankle and Carbin,2019,Frankle et al.,2019b,Tanaka et al.,2020].

Further, they introduce their own method; ‘SynFlow’, which is an iterative data-free extension on magnitude pruning, which they claim solves the disconnection problem by having a very high critical compression rate.

Almost all these points directly correspond to the work introduced in this thesis, which in turn was also published slightly earlier in roughly the same weeks in the aformentioned paper [Verdenius et al.,2020]. More on howTanaka et al.[2020] relates to the research from this thesis is discussed in Sections2.3and6.

(18)

Finally, also very recently, a work bySingh and Alistarh[2020] analysed and generalises the pruning literature before training and specifically addresses second-order approximation quality. To that end, they introduce ‘WoodFisher’, where they apply a gradual pruning strategy with efficient second-order approximations to obtain state-of-the-art sparsity-performance trade-offs [Singh and Alistarh,2020]. Again, this work relates to the work in this thesis. More on how it does so is also discussed in Section2.3.

2.2.5 Concurrent Advances in Structured Pruning

Structured pruning is different in the sense that, by introducing extra structural constraints to our pruning algorithm, one can enforce certain pruning behaviour - i.e. the actual reduction of parameter tensor sizes, as opposed to only making them sparser. A couple of approaches that were mentioned before have a straightforward extension into the structured domain (e.g. SVD [Molchanov et al.,2017], HoyerSquare [Hoyer, 2004] and `0-regularisation [Louizos et al.,2018]). Yet, there has been sufficient additional research into this

topic specifically. Its branch somewhat deviated over time, as it is a more constrained problem setting with its own approaches. Therefore, a short overview will be provided in this section.

Efficient ConvNets It has been demonstrated that, even though the majority of weights of a large CNN are situated in the fully-connected part of the network, still the convolutional layers contribute the majority of the FLOPs. This is due to the nature of convolutional operations. For example, in VGG16 the fully connected layer is responsible for 90% of the parameters, whereas only 10% of the FLOPs [Li et al.,2016]. To this end,Li et al.[2016] introduced ‘Efficient ConvNets’, where they employ a magnitude criterion for structured pruning to a great success. Just like was the case in unstructured pruning, there has been work on the structured pruning for some time before this [Mozer and Smolensky,1989]. However, structured pruning started gaining traction only after this more modern work fromLi et al.[2016].

Subsequent Structured Pruning Research Next, two notable works improved on this. InHe et al.[2017], input channels are pruned based on a 2-step algorithm, where they first select channels they deem most repre-sentative and then try to reconstruct the output channels with the lowest reconstruction loss. Simultaneously, in Luo et al.[2017] authors adopt the same view, yet take the opposite approach by pruning away input-channels in the next layer and thereby also the output channels of the current layer, with additional optimisation for channel selection and fine-tuning [Luo et al.,2017].

Furthermore, ‘group-sparsity’ is used as sparsity inducing prior, where they penalise groupings of weights, with subsequent channel-pruning in some works [Liu et al.,2017,Wen et al.,2016,Alvarez and Salzmann, 2016,Lebedev and Lempitsky,2016,Ye et al.,2018]. Most notable of them is ‘Network-slimming’ byLiu et al.[2017], where they impose a `1-penalty on scaling factors in batch-norm layers and thereby push certain

neurons to zero during training. Then they identify the unimportant neurons by magnitude of batch-norm scaling factors [Liu et al.,2017]. An idea later adopted in the Gate-Decorator [You et al.,2019].

Next,Yu et al.[2018] argue in their method ‘NIPS’ that one should consider the statistics of more than one layer and instead prune globally based on the reconstruction error of the final response layer. In more recent work,Li et al.[2019b] extend the idea of using statistics from more than one layer by using the concatenated corresponding row, column and/or filters of adjacent layers, to evaluate the energy of a neuron as saliency criterion [Li et al.,2019b]. Moreover, in work byHe et al.[2019], pruning based on magnitude or energy is replaced by a distance metric between individual filters to the geometric median over all filters in that layer, to shift pruning on low contribution towards pruning on redundant filters [He et al.,2019].

Concurrently, from a totally different angle, inHe et al.[2018] they frame the problem as an automation task of the tuning-work of node pruning - in practice usually performed by machine learning engineers - and replace the expert by a reinforcement learning agent (ACM). This agent prunes a pre-trained network and receives a reward based on keeping its performance while reducing its FLOPs [He et al.,2018].

The GateDecorator In a different branch,Molchanov et al.[2016] developed a saliency criterion for filters based on the effects of that filter on the loss function - alike SNIP [Lee et al.,2019] - but for pre-trained

(19)

2.3 Related Work 2 BACKGROUND

networks. For this, they derive the Sensitivity Criterion. However, they find it through a Taylor expansion [Molchanov et al.,2016] - which will be discussed further in Section3.1. Similarly, in a follow-up workYou et al.[2019] evaluate the importance score of a filter using a the same sensitivity score, with the same derivation. Yet, instead they apply it on introduced gates φi, which in turn are defined over the simple batch-norm layers:

xout = φ (γqxin− µB σ_B2+ ε + β ) sc(φi) = ∂ L(D|θ) ∂ φi φi

They differ in that they update the gates φ for each batch normalisation layer, and then calculate the change in loss under the change in these gates [You et al.,2019]. They then apply the sensitivity pruning of batchnorm scaling factors on pre-trained models in an iterative pruning algorithm called ‘Tick-tock’ where they first (a) train the gates φ in isolation with an additional `1-penalty, and then (b) prune and fine-tune the network. This

is performed in an iterative fashion. This method is coined the ‘GateDecorator’ [You et al.,2019].

Recent Works Currently, single-shot methods for structured pruning get introduced, that prune once on pre-trainedmodels [Li et al.,2020,Yu and Huang,2019,Li et al.,2019a]. Additionally, inLiu et al.[2019] the reinforcement learning agent from ACM [He et al.,2018] gets replaced by a heuristic algorithm and adopts alternating direction of multipliers (ADMM) as methods for structured pruning [Liu et al.,2019]. This gets further improved byLi et al.[2020], where they convert the previous method to a single-shot post-training algorithm, by utilising soft constraints in ADMM [Li et al.,2020].

2.2.6 Criticism on Pruning

To put things in perspective, the progress and efficacy of pruning methods has been called into question recently. First, the magnitude-based structured methods were criticised inYe et al.[2018] andLiu et al.[2019]. In the former, the assumption of magnitude pruning as an effective saliency criterion is disputed. They show that, because of the scaling of subsequent layers in a deep network, the network can always find a way to satisfy sparsity-inducing penalties and still not actually help achieving sparsity [Ye et al.,2018]. In the latter, they demonstrate that (a) the assumption is unfounded that specific weight-values at initialisation or pruning time are important and (b) that we don’t need to pre-train a large overparametrised network and can just start with a pruned network from scratch and get the same performance [Liu et al.,2019].

Additionally, inGale et al.[2019] it is demonstrated that pruning methods often perform inconsistently and call for a better standardisation of tests and benchmarks. More recently, in [Blalock et al.,2020], the lack of common benchmarks is once again brought up. They argue that often the reporting of papers in the field is flawed to the point that no valid conclusions can be drawn. Specifically, they indicate that for a large body of research not enough architecture-dataset tuples are used (and sometimes trivial examples on purpose), reproducibility is poor, results are not statistically significant, no control experiments are performed and the comparison to state-of-the art, as well as older methods, is largely ignored [Blalock et al.,2020].

2.3 Related Work

Now, there will be reflected onto the previous sections and this thesis will be positioned with respect to the preceding literature and definitions that were discussed. Subsequently, an overview of the to-be-introduced methods together with the stated baselines is displayed in Table1.

1. Firstly, this thesis is focused on pruning before training or during training. It therefore distinguishes itself from most of the works introduced in sections2.2.1,2.2.2and2.2.3, of which some prune during training but most prune after training. In this regard, it rather builds on works from Sections2.2.4.

2. Secondly, it extends the works on the ‘Sensitivity Criterion’ [Mozer and Smolensky,1989,Lee et al., 2019,Molchanov et al.,2016,You et al.,2019,Wang et al.,2020,Lee et al.,2020,Tanaka et al.,2020].

(20)

Contrary toWang et al.[2020],Hayou et al.[2020], andTanaka et al.[2020], this thesis doesn’t disregard the criterion but instead applies it iteratively to solve its teething problems - i.e. scale invariance and disconnection [Lee et al.,2020,Wang et al.,2020,Hayou et al.,2020]. However, instead it does so without adding much more complexity.

3. Thirdly, it closely relates to two other very recent concurrent methods that also advocate iterative pruning before training. They are WoodFisher [Singh and Alistarh,2020] and SynFlow [Tanaka et al.,2020]. The latter of whom also has a very similar hypothesis and explanation as to why this is an effective solution, as is introduced in this thesis. However, they arrive there at a more theoretical way - whereas this thesis does so empirically. Therefore the works complement each other. Both mentioned works are published simultaneously with this thesis though. As such, content in this thesis was produced prior to these concurrent methods and their work is hence - although referred to and reflected upon - not a part of the main body and only used as a baseline at most.

4. Fourthly, this thesis extends the Sensitivity Criterion into a structured counterpart, thereby building on the works introduced in Section2.2.5. Yet, alternatively to all those methods, it operates before training. This is, to the best of the writer’s knowledge at the time of writing, a novel contribution.

5. Finally, this thesis connects existing literature of the different derivations of the Sensitivity Criterion [Mozer and Smolensky,1989,Molchanov et al.,2016,You et al.,2019,Lee et al.,2019,2020,Hayou et al., 2020], to a framework of function elasticity [Sydsaeter and Hammond,1995,Zelenyuk,2013]. Another very recent current method that also connects these approaches under one umbrella is WoodFisher, which generalises them as forms of second-order approximations [Singh and Alistarh,2020]. Their work is, as mentioned before, a simultaneous research to this thesis and is hence not part of the main body.

It is also informative to compare baselines to the methods introduced in this thesis - since it provides an perspective on how to compare them. Because of the high variety domains that are considered (see Section2.1 and Table1), one also obtains a wide variety of baselines to choose from. As such, it is sensible to choose closely related ones. Therefore, in the unstructured domains, SNIP [Lee et al.,2019], GraSP [Wang et al., 2020] and SynFlow [Tanaka et al.,2020] are chosen, because they are the closest related baselines regarding research on the Sensitivity Criterion. Additionally, IMP [Frankle and Carbin,2019,Frankle et al.,2019b] and HoyerSquare [Yang et al.,2020] are added to this, in order to have baselines from the more distant areas of magnitude pruning (Section2.2.1) and sparsity inducing priors (Section2.2.3) as well.

For the same reasons, in structured pruning there is compared against GateDecorators [You et al.,2019] as it is the most recent structured method that uses a Sensitivity Criterion. Moreover, EfficientConvNets [Li et al., 2016], `0-regularisation [Louizos et al.,2018] and Group-HS [Yang et al.,2020] are chosen as extra baselines,

again for a different perspective from the mentioned research areas.

Table 1:Pruning domain with the baselines and methods introduced in this thesis - the latter is denoted with a ‘∗’.

Domain

Structured Unstructured Combined

Before Training SNAP-it∗

SNIP (2019) GraSP (2020) SynFlow (2020)

SNIP-it∗

CNIP-it∗

During Training `0-Regularisation (2018) IMP (_SNIP-it2019b,∗2019) CNIP-it∗ After Training

EfficientConvNets (2016) Group-HS (2020) GateDecorators (2019)

Compression for Deep Learning: Pruning via Iterative Ranking of Sensitivity Statistics

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Compression for Deep Learning:

Pruning via Iterative Ranking of Sensitivity Statistics

S

TIJN

L

UCAS

V

ERDENIUS

July 3, 2020

Supervisors:

D

. P

F

MS

. M

S

Assessors:

D

. P

F

P

. D

. M

W

U

A

A

M

L

L

In cooperation with:

B

C

B.V.

Special Acknowledgements:

Contents

1

Introduction

1.1

Contributions & Scope

1.2

Outline

2

Background

2.1

An Overview on Neural Network Pruning

2.2

Literature Review

∑

∑

∑

∑

∑

∑

∑

2.3

Related Work

Domain

_∑