Efficient Coding for Learned Source Compression

(1)

MSc Artificial Intelligence

Master Thesis

Efficient Coding for Learned Source

Compression

by

Putri A. van der Linden

10768017

April 28, 2020

48 EC January 2019 - March 2020

Supervisor:

Karen Ullrich

Assessor:

Wilker Ferreira Aziz

(2)

Abstract

Motivated by the Minimum Description Length principle, in this work we proposed the Sparse Bayesian Variational Autoencoder (SBVAE) in order to study sparsity in generative models. We show that this is possible to a limited extent, achieving up to 27% sparsity in encoder and decoder models without losing significant reconstruction performance. We investigated the effects of sparsity on the expression and behaviour of the latent space and observed no significantly deviating behaviour compared to non-sparse baselines, other than a marginal loss of precision in the latent space.

(3)

List of Symbols

The following contains a list of symbols and functions that are commonly used in this document Abbreviations

ELBO Evidence Lower Bound MDL Minimum Description Length VAE Variational Autoencoder Functions

L(·) ELBO objective

DKL(·||·) Kullback-Leibler divergence

log Natural Logarithm log₂ Base 2 logarithm L(·) Loss function

LMC_{(·) Monte-Carlo approximation of loss function}

Q(·) Variational approximate distribution q(·) Variational approximate density Information Theory

`(·) Codeword length of a codeword `min(·) Minimal code length

`C(·) Average message length

C(·) Symbol code for an image X and alphabet A; an injective function C : X → A∗ H(·) Entropy

H(·|·) Conditional entropy HC(·, ·) Cross-Entropy

I(·; ·) Mutual Information Probability Theory

θ Model parameters

E[·] Expected Value V[·] Variance

(6)

D Data set

N (·) Gaussian distribution P(·) Powerset

X Image of random variable

H Hypothesis

Ω Sample space

P (·) Discrete probability distribution p(·) Continuous probability density function Symbols

φ Variational parameters w Neural network weight

(7)

1 Introduction

Deep source compression methods have recently shown great successes in image- and video compression [1, 27, 30, 32, 34]. However, a drawback is that they are often very costly in terms of compute, which – in the case of neural networks – usually encapsulates computation and storage of millions of parameters. This hugely limits their application and deployment on lighter devices such as mobile phones.

In recent years, there has been a steady trend of neural network architectures becoming increasingly more complex and in terms of amount of parameters and network depth. This trend is accompanied by a need for more powerful computational hardware and larger memory capacity. Indeed, as long as a neural network is sufficiently large with respect to the task at hand, it will often be possible to fit a dataset arbitrarily well, at the cost of increased risk of overfitting. It has therefore widely come to light that effective deployment of artificial neural networks will not only require increasingly better hardware, but will also require new methods for carefully constructing network architectures that limit arbitrary levels complexity in order to prevent wasting compute [28, 29, 38, 33, 35].

The Minimum Description Length principle (MDL) provides an information-theoretic per-spective of data compression and states that the best model of the data is the one that jointly minimizes the communication costs of both the model and the data misfit. Interpreting deep source compression methods under the MDL principle, it becomes apparent that they largely – if not solely – optimize for the data misfit costs, thereby neglecting the model cost.

If, for example, we use the objective of minimizing an error defined as the difference between observed data and the corresponding prediction then there is no penalty for the degree of complexity of the resulting network

Motivated by the MDL principle, in the current research we derive an objective that seeks to incorporate model complexity into the loss function. This can naturally be achieved by employing Bayesian variational inference over the model parameters.

As such, our proposed method, the sparse Bayesian Variational Autoencoder, forms an at-tempt to create an efficient and relatively cheap data compression model at test time.

An intuitive yet simple metric of model complexity for neural networks is its amount of parameters. In general, neural networks are heavily overparameterized and hence contain a lot of redundancy in the weights [35]. Function compression methods attempting to tackle this redundancy adopt a Bayesian approach in order to learn a density over the model parameters [35, 28]. These methods generally incorporate variational inference to induce sparsity in density estimations.

It is well established that variational inference has an MDL vindication and is directly related to the objective of minimizing communication costs of a model [35, 16, 38, 28, 29, 12]. As such, the MDL justification of variational inference provides an error measure not only for the data reconstruction but also for model complexity, which allows for data modelling while simultaneously constraining model redundancy.

Function compression using variational inference (i.e. Variational Dropout ) has already been shown to reach significant amounts of sparsity in discriminative models [35, 28]. Classification tasks are quite tolerant to small perturbations in the predicted output distributions over the classes; as long as the correct class is given the majority of the predicted probability mass, the prediction is correct. Reconstruction tasks, however, are much less lenient to these perturbations since the intent is to model the input exactly. We notice that, in general, sparsity in generative models is yet poorly understood.

(8)

In this work, we aim to provide some initial insights on sparsity in generative models, thereby hoping to open up the field for compressed reconstruction models. Specifically, we will study the level of model compression that can be reached in reconstruction tasks by investigating the rate-distortion trade-off between model sparsity and reconstruction performance. Furthermore, we study the effects of model sparsity on the expression of the latent space: by pruning a network, we are effectively removing degrees of freedom from the model, thereby reducing its expressivity and flexibility. We hypothesize that this will be observable through reduced expressivity in the latent space, implying that points will be mapped to increasingly simpler subareas of the latent space.

• We propose the Sparse Bayesian Variational Autoencoder (SBVAE), which is able to effi-ciently compress images end-to-end.

• we show that using this method, we can sparsify encoder and decoder models up to 27 percent.

• We observe that sparsity does not significantly influence generative behaviour of VAEs • Finally, we have found some evidence that incorporating model complexity into generative

(9)

2 Preliminaries and Background

We assume the reader is familiar with basic concepts of probability theory and information theory, such as expected values, Bayes’ theorem, discrete and continuous marginal and conditional densities, and messages and codes in the context of information theory. The current section will reiterate some important concepts from these areas that are fundamental within this project.

2.1 Models and Hypotheses

In this work we adopt the definitions from [11] for hypothesis and model. Specifically, a simple hypothesis refers to a single probability function, and model refers to the set of probability functions with the same functional form. A simple hypothesis can thus be seen as an instance of a model. The general term hypothesis refers to both simple hypotheses and models.

2.2 Bayesian Inference

One of the main objectives for machine learning tasks is to find statistical models of some observed dataset with which it is possible to make accurate predictions about new datapoints. This is generally done by fitting a complex function to an observed training dataset while making sure the function generalizes to new datapoints as well.

Bayesian inference expresses the belief that a given hypothesis explains the observed data. Specifically, consider a dataset D and a hypothesis for the data in the form of a density p parameterized by θ. Through Bayes’ theorem, probabilities can be interpreted in terms of an agent’s prior belief about a model, p(θ), its observed likelihood under the model, p(D|θ) and the evidence or marginal likelihood for the model p(D) [2].

p(θ|D) | {z } posterior = p(D|θ) · p(θ) R p(D|θ) · p(θ)dθ = likelihood z }| { p(D|θ) · prior z}|{ p(θ) p(D) | {z } evidence

The prior represents the belief in the model before observing the evidence. The likelihood represents the probability that the current model generated the observed data. The evidence normalizes the density and thereby ensures that the posterior is a valid probability density.

When iteratively performing Bayesian inference on new data, the posterior gets adjusted to fit the observed data. This leads to better generalization as the size of the data increases [2].

(10)

2.2.1 Approximate Bayesian Inference

In practice, exact Bayesian inference is often infeasible for sufficiently complex functions. For example, consider the following probabilistic model: Let D be an observed dataset of which we wish to model the density. Assume the samples xi _{∼ p(x) are generated by some}

la-tent variables zi _{such that the marginal density is defined as p(x) =}

R p(x|z)p(z)dz (Fig. 1). The marginal requires integration over all possible values of z, which quickly becomes intractable if z is a high dimensional multivariate random variable.

A range of approximate inference techniques have been proposed that enable computationally efficient approximation methods for learning densities. The current section will discuss Variational Bayesian

Infer-ence. Figure 1:

Graphical

model for

(11)

2.2.2 Variational Inference

Variational Inference is an approximate inference technique that uses optimization to find a computationally tractable approximate distribution over latent variables [3, 2]. It is done by introducing an approximate density q of a predefined family of distributions. Here, we exploit efficiency acquired through the choice of the approximate parametric family: the goal is to choose the parametric family such that it is expressive enough to closely approximate the true distribution, but constrained enough for efficient traversal of the parameter space. Then, instead of maximizing the marginal likelihood directly, we maximize a lowerbound on the evidence (ELBO ) by minimizing the KL-divergence between the approximate and the unknown true distribution, while simultaneously optimizing for the reconstruction performance.

The Evidence Lower Bound has various equivalent derivations which allow for different inter-pretations of the optimization process. In this document we will give two of the most common vindications of variational inference and their corresponding interpretations.

One of the first derivations emerged as a consequence of Jensen’s inequality with respect to an expectation over a logarithmic function. The derivation is as follows:

log p(x) = log Z p(x|z)p(z)dz ! = log Z p(x|z)p(z)q(z) q(z)dz !

(Introduce approximate distribution q)

= log Z q(z)p(x|z)p(z) q(z)dz ! ≥ Z q(z) log p(x|z)p(z) q(z) !

dz (By Jensen’s inequality)

= Z

q(z)

log p(x|z) + log p(z) − log q(z) dz = Eq(z)[log p(x|z)] | {z } reconstruction loss − DKL(q(z)||p(z)) | {z } regularization loss = L.

This yields the lower bound L on the evidence. Since the evidence is independent of q, maximizing the lower bound with respect to q leads to better approximation of the log marginal, where equality holds iff q = p.

Another common derivation emerges from the minimization of the KL-divergence between an approximate posterior and the intractable true posterior. Here, the lower bound follows from non-negativity of the KL-divergence. The derivation is as follows:

q(z)∗= argmin

q(z)

DKL(q(z)||p(z|x))

This KL-term is intractable due to its dependence on the intractable marginal p(x): DKL(q(z)||p(z|x)) = E[log q(z)] − E[log p(z|x)]

(12)

However, the foregoing expression can be rewritten as follows:

log p(x) = DKL(q(z)||p(z|x)) + Eq(z)[log p(x|z)] − DKL(q(z)||p(z)).

Since the KL-divergence is non-negative, it follows that:

log p(x) ≥ Eq(z)[log p(x|z)] − DKL(q(z)||p(z)) = L.

Instead of learning an approximate prior q(z), in practice this objective is rarely used due to the fact that it is inefficient: the approximation does not scale well with the dimensionality of z and we would need an infeasibly large amount of samples to accurately approximate it. Therefore, instead an approximate posterior q(z|x) is generally used:

log p(x) ≥ Eq(z|x)[log p(x|z)] − DKL(q(z|x)||p(z))

This objective pushes the prior towards the latent values that explain the observed data. Fur-thermore, it requires learning subareas q(z|x) of the prior due to optimization of the reverse KL-divergence (see Section 2.3). Hence this allows for the density to be more efficiently approx-imated using less samples, see Fig. 2.

Figure 2: Two-dimensional latent samples and prior and posterior contours. The posterior requires learning a subarea of the prior, and hence needs less samples to esimate it accurately. Image taken from https://uvadlc.github.io/

(13)

2.3 Information Theoretic Preliminaries

In the current section we reiterate some concepts from information theory that are fundamental within this project. For extensive definitions and proofs, we refer the reader to [4]. Please note the following:

• All quantities in the following are expressed in bits, which is a primary unit for measuring amounts of information based on the binary logarithm. Similarly, it is possible to express information quantities in nats; note that in any scenario a conversion between bits and nats is possible using a simple conversion between the logarithmic bases. For notational ease, we omit base specification for the natural logarithm and leave it implicit.

• We provide definitions for the case of discrete random variables. In most cases these definitions extend naturally to the continuous case by integrating over the continuous random variable rather than summing over its discrete counterpart.

• Note that for any event that has zero probability its surprisal value contains a term log(0) and is therefore technically undefined, and consequently so is its entropy. For these cases the convention log(0) = 0 is generally used; this report complies with that convention. • Probability distributions will often be subscripted by their corresponding random variable

(i.e. PX). We leave the subscript implicit in cases where there is no ambiguity about the

random variable in question.

• Similarly, the entropy will sometimes be subscripted by the distribution of the correspond-ing random variable (i.e. HP) in cases where there might be ambiguity.

Surprisal Value

The surprisal value expresses the (un)likelihood of an event in units of information. Let Ω be a sample space and P(Ω) its powerset. For an event A ∈ P(Ω) with probability P (A) the surprisal value is given by the log reciprocal of the occurrence probability:

log₂ 1 P (A).

Note that the surprisal value thus assigns more information to events that are less likely to occur and vice versa. This property is a core concept in compression and communication: if we seek to communicate information in the most effective and shortest way possible, we can assign the least amount of bits to code words that are most likely to occur, and only use longer code words the few times that the unlikely words are to occur.

Shannon Entropy & Differential Entropy

Let X be a discrete random variable with image X and probability distribution P . For discrete random variables the entropy can be expressed in terms of the Shannon entropy H(X):

H(X) := −X x∈X P (x) log₂P (x) = X x∈X P (x) log₂ 1 P (x) | {z } surprisal value ,

(14)

it is then straightforward to see that the Shannon entropy can be interpreted as the expected surprisal value for a distribution:

= Ehlog2

1 P (x)

i .

The Shannon entropy has the following upper and lower bounds: it is minimized if X is a constant random variable with P (X = c) = 1 for event c and 0 otherwise; this minimizes the surprisal value and yields a Shannon entropy of 0. Similarly, the Shannon entropy is maximized if all events are equally likely, and thus its probability distribution is uniform, causing maximum uncertainty. The Shannon entropy is therefore upper bounded by log₂(|X |)1_:

0 ≤ H(X) ≤ log₂(|X |).

For a continuous random variable X with density function p and support set X , the differential entropy h(X) will be used throughout this document and is defined by:

h(X) := − Z

X

p(x) log₂p(x) dx.

Codes and Code Lengths

A Code C for a distribution P with image X is an injective function C : X → A∗, where A denotes a finite alphabet and ·∗ denotes the Kleene star. A code is thus a function that maps samples from P to strings over A, which are refered to as the codewords.

Let C be a code for a source with probability distribution P such that every element x ∈ X of the image is mapped to a codeword C(x). We denote the lengths of the codewords `(C(x)). The average code length for a code C is then defined as:

`C(P ) := E[`(C(X))].

Let C be the code space for a source with probability distribution P . The minimal code length `min(P ) is the average length under a code C ∈ C that minimizes the average message length

for a given source:

`min(P ) := min C∈C`C(P ).

Information Inequality

The information inequality states that for any source P , and any other distribution Q such that Q 6= P , the following inequality holds:

EP[− log2Q(X)] > EP[− log2P (X)].

That is, the expected surprisal value under any distribution different from the true one is always higher than when using the true distribution P .

(15)

Shannon’s Source-Coding Theorem and Optimal Codes

Shannon’s Source-Coding Theorem provides theoretical bounds on the minimal code length that can be achieved in a lossless compression setting. For any source P there exists a code which minimal code length satisfies the following bound:

H(X) ≤ `min(P ) ≤ H(X) + 1.

It follows that for any source P the message length for an optimal code then turns out to be its surprisal value:

− log₂P.

A code for which the average length satisfies these bounds is said to be an optimal code.

Cross Entropy

The cross entropy expresses the amount of information for a source with true distribution P if we assume the random variable X to be distributed according to probability distribution Q

HC(P, Q) := −

X

x∈X

P (x) log₂Q(x).

In a coding setting, it expresses the average amount of bits needed when coding source Q using source P . From the information inequality (Section 2.3), it can easily be seen that this quantity is minimized when P = Q.

Kullback-Leibler Divergence

The Kullback-Leibler divergence (KL-divergence) is an asymmetric similarity measure for a dis-tribution Q with respect to a reference disdis-tribution P . It is defined as:

DKL P ||Q := −

X

x∈X

P (X) log₂Q(X) P (X).

It is the expected logarithmic difference between P and Q if we take P as a reference distribution. In a coding setting, it is the expected loss when coding source Q using P .

= HC(P, Q) − HP(X)

This quantity too is minimized when P = Q.

The KL-divergence is a weighted logarithmic difference between two distributions. Note that the weighting is determined by the reference distribution; the asymmetry of the divergence clearly follows when evaluating values where either P (x) = 0 or Q(x) = 0. Consequently, the asymmetry yields two different behaviours when minimizing the divergence, which are referred to as the forward and reverse KL-divergence.

Forward KL-divergence Minimizing the forward KL, DKL(P ||Q), is said to be zero-avoiding,

since it ignores the divergence of P and Q where Q places mass but P does not. The forward KL pushes the approximate Q towards the support of P .

(16)

Reverse KL-divergence The reverse KL, DKL(Q||P ) is said to be zero forcing. It only

pe-nalizes values in the support of Q, and hence pushes the approximate to accurately model P where the support of P and Q overlap.

Hence the reverse KL focuses more on accurately approximating subareas of P , while the forward KL focuses more on covering the entire support of P with probability mass.

Mutual Information and Conditional Entropy

The Mutual Information between two random variables X, Y with corresponding supports X , Y, respectively, is a symmetric measure of the information we gain about X if we know Y and vice versa.

It is defined as:

I(X; Y ) := H(X) − H(X|Y ), where H(Y |X) is the conditional entropy:

H(X|Y ) :=X

y∈Y

P (Y = y) · −X

x∈X

P (X = x|Y = y) · log2P (X = x|Y = y).

The conditional entropy quantifies the expected uncertainty about X when knowing Y .

Discrete Communication Channel

A (discrete) channel is a tuple (X , PY |X, Y) consisting of finite input and output alphabets X and

Y, respectively, and a conditional probability distribution PY |X which specifies the transition

probabilities between X and Y . Thus, PY |X(y|x) defines the probability of receiving y given

that input x was sent over the channel.

(M, n)-Code

An (M, n)-code for a channel (X , PY |X, Y) encapsulates an index set [M ] = {i}Mi=0enumerating

all possible input messages, a (possibly probabilistic) injective encoding function and a decoding function:

enc : {1, 2, . . . , M } → Xn,

where n represents the number of channel uses necessary to sent a single input message. The encoding function maps elements of the index set to elements of the codebook, which is the set of all possible codewords {enc(1), enc(2), . . . , enc(M )}. Furthermore, Xn _{denotes the set of all}

possible messages of length n consisting of elements from the finite alphabet X . Similarly, The decoding function maps from the output alphabet to the index set:

dec : Yn→ {1, 2, . . . , M }.

An (M, n)-code encodes M possible inputs to messages of n bits; or equivalently, a (2k, M )-code maps input messages of length k to messages of n bits.

(17)

Rate

The rate of a code defines the amount of information transmitted per channel use. For an (M, n)-code, the rate is defined as:

R := log2M

n .

Distortion

The distortion between a variable x and its reconstruction ˆx is a non-negative measure of similarity between the two. Hence, the distortion function d is a function d : X × X → R+. Many possible distance functions can be used for the distortion function (i.e. Hamming Distance, Itakura-Saito distance [4]); here we will use the mean squared error (MSE) as a distortion function. For a single value x and its reconstruction ˆx, the MSE distortion is defined as follows:

d(x, ˆx) := (x − ˆx)2.

Distortion is more generally applied on a symbol-to-symbol basis between sequences. Let xn denote a sequence of length n by concatenation of a finite alphabet. Then the distortion applied to sequences is defined as follows:

d(xn, ˆxn) := 1 n n X i=1 d(xi, ˆxi).

(18)

2.4 Source Coding

Source coding is the act of encoding source values such that they can be reliably decoded; its objective is to find encoding and decoding functions for a source so as to retain most information about the source in the process (Fig. 3).

Figure 3: Communication setup. Image taken from https://github.com/cschaffner/ InformationTheory/

Let PX be a source for which we wish to encode and decode the samples x ∼ PX. A source

coding setup for this setting is then to find some encoding and decoding functions, enc(·, PX)

and dec(·, PX) respectively, such that

m = enc(x, PX), x = dec(m, Pˆ X).

One of the main and arguably most interesting use cases for source coding is compression; if we constrain the encodings to contain less information than the source, then the encoding forms a compressed representation of the source.

Here we make a distinction between lossless and lossy compression. Lossless compression, as the name suggests, requires that no information is lost in the coding scheme; upon decoding, the source can always be retrieved exactly, implying that we require that ˆx = x. In lossy compression, we do allow some loss of information. Examples of lossless coding methods include Huffman coding [19] and Golomb coding [9]. An example of a lossy compression scheme is an Auto-encoding framework [17].

Stochastic Source Coding

In source coding, the encoding function is an injective function which ensures unique decodability of the code. However, in [7] it is shown that it is possible to efficiently encode a source using a one-to-many code, even though the average code length under such codes is often longer than that of one-to-one codes. What can be gained through the excess costs is model uncertainty, which allows us to do model selection.

(19)

(a) One-to-one code.

Figure 4: One-to-one and one-to-many codes. In (b) a message is chosen by using a stream of random bits to sample from the posterior. Clearly, for any non-degenerate distribution, (b) has a higher expected communication costs com-pared to that of (a). Also note that both meth-ods are equivalent if (b) contains only degener-ate posteriors that are maximized at the optimal values.

(b) One-to-many code.

The one-to-many code introduces stochasticity by requiring a stream of random bits to draw sample code words from a prior over possible messages. The random bits get absorbed into the message and thus, under such a model, the average message lengths generally get increased (Fig. 4).

The authors of [7] describe a communication scheme through which they demonstrate how it is possible to get back some of the excess cost such that – for the purpose of communicating the data– the cost is no higher than when using one-to-one codes. In addition, however, we gain knowledge about the model through the information hidden in the extra costs. As a concrete example, assume the source p(x) to be a Gaussian mixture model, which values we will code using different Gaussian likelihoods (Fig. 5). Sampling a model with which to code a source value will often be less optimal than using the likelihood that minimizes the communication costs of the value all the time.

(20)

Figure 5: Mixture of Gaus-sians marginal. Dashed line represents a source value x0_.

It is clear that source val-ues near x0 can be commu-nicated quite cheaply using G5 or G6. If we had

in-stead used G1, the

descrip-tion length would be much longer. Sampling a Gaussian to encode x0 with will often be less optimal than just us-ing G5for source values near

x0 all the time.

The intuition is the following: if it is possible to find subsets of the source that are efficiently encoded using the same likelihoods, then we can efficiently communicate the source if we first communicate which subset the source belongs to and then communicate the source using the efficient likelihood for that subset. A common way to find such subsets is to introduce latent variables i such that every source forms a pair (x, i). The source is thus assumed to factorize as follows: p(x) =X i p(x, i) =X i p(x|i)p(i).

Section 2.6.2 will describe how it is possible to find these subsets and how to uncover the information about the model hidden in the extra costs using the bits-back coding scheme.

(21)

2.5 Rate-Distortion Theory

Continuous sources can emit infinite precision samples. Indeed, when using finite representations it is impossible to reconstruct such samples exactly, hence we have to allow some distortion in order to describe them [4].

Given that exact representation is impossible, rate-distortion formalizes the following ques-tions: How well can one represent a source given a finite amount of information, and conversely, how much information would one need to describe a source given that we would only allow at most some fixed level of distortion?

Rate-distortion theory formalizes this trade-off by providing the theoretical bounds of the rate at which a source can be compressed without exceeding some performance or error threshold on the reconstruction. The rate R quantifies the amount of information needed to express a data sample, and it is often represented as a function of the level of data misfit or distortion that we allow.

Figure 6: Example Rate-Distortion curve of a Gaussian source. The gray area represents all achievable (R, D) pairs for the given Gaussian source. The RD-curve represents the best compression-reconstruction performance possible for a given source. In contrast, all (R, D) pairs in the gray region are suboptimal in the sense that for any given pair, either a better rate or distortion can be achieved without hurting the other.

We turn to the setting of rate-distortion theory. Let xn _{∈ X}n _{be a sequence of length n}

by concatenation of elements of a finite alphabet X . A (2nR_{, n)-rate distortion code consists of}

encoding and decoding functions such that:

encn: Xn → {1, 2, ..., 2nR},

decn: {1, 2, ..., 2nR} → ˆXn.

Thus, in a rate-distortion setting, all possible different input messages Xn _{must be represented}

by at most 2nR _{different representations. Note that enc}

n is therefore surjective if |Xn| > 2nR.

We can then get a notion of the expected distortion for a given (2nR_{, n) rate-distortion code. D}

denotes the expected distortion applied element-wise to the sequence, and is defined as follows:

D = X xn_∈Xn p(xn) · d(xn, decn(encn(xn))) = Ehd(Xn, decn(encn(Xn))) i .

(22)

A rate-distortion pair (R, D) is achievable if (2nR_{, n)-rate distortion codes exist such that its}

encoding and decoding functions satisfy [4]: lim n→∞E h d(Xn, decn(encn(Xn))) i ≤ D.

That is, (2nR_{, n)-rate distortion codes for which the expected distortion of increasingly longer}

sequences never exceeds D.

As a rough definition, the rate-distortion function R(D) is the set of achievable rates R that are minimal for given values of D (Fig. 6).

The current chapter will provide the intuition behind rate-distortion theory and its relation to quantization. Furthermore, we will build up the intuition by turning to the case of representing a single random Gaussian random variable using a single bit.

2.5.1 Quantization

Quantization is the process of representing a (possibly continuous) set of input values by a smaller set of output values through a surjective function. Since the cardinality of the output values is smaller than that of the input, there is no one-to-one correspondence between the two, meaning that we cannot represent each input value exactly. Conversely, multiple different inputs might have to share the same representation. The goal is to find the output values such that they form the best possible representations of the input.

Rate-distortion theory is directly related to the objective of effective quantization: underly-ing both is the problem of representunderly-ing possibly infinite precision values usunderly-ing finite precision representations. To provide some intuition, we closely follow the example of representing a zero-mean, fixed variance single Gaussian source x ∼ N (x; 0, σ2) using a single bit from [4]. Note that we can represent at most 2 values, and it is straightforward to see that the bit should encode whether x > 0 or x ≤ 0.

(23)

Assuming a squared error distortion function, the function ˆX that minimizes the expected distortion corresponds to:

ˆ X(x) =    q 2 πσ if x ≥ 0, −q2 πσ otherwise.

More generally, let us turn to the case of representing a single random variable X using R bits. Using R bits, we can construct 2R _{different bitstreams and consequently represent at most}

2R _{different values. Let ˆ}_{X denote the function that maps X to its representation using at most}

R bits: ˆX : X → ˆX . The set of possible representations ˆX is then referred to as the reproduction points.

A good quantizer should possess the following two properties [4]:

• For a given set of reproduction points ˆX , the quantizer should map a source value x to the representation value ˆx ∈ ˆX that minimizes the distortion between the two. Each re-construction point ˆx ∈ ˆX has an associated reconstruction region: it is a Voronoi partition over the input support: all values within a reconstruction region of the input share the same reconstruction point as a representation.

• The reconstruction points should minimize the expected distortion for all values in their corresponding reconstruction regions.

The goal is to find the reproduction points and regions that minimize the expected distortion. The precise definition is as follows:

R(D) = min

p(ˆx|x):P(X , ˆX )

(x, ˆx) p(ˆx|x)p(x)·d(x,ˆx)≤D

I(X, ˆX)

The rate-distortion function corresponds to the conditional distributions that minimize the mutual information between the inputs and the reconstruction points, subject to the constraint that the expected distortion under the conditionals does not exceed D. For more thorough proofs en definitions we refer the reader to [4].

(24)

2.6 The Minimum Description Length Principle

The current section will cover the Minimum Description Length principle. We will provide a predominantly conceptual/intuitive coverage of the subject matter and explain how it can be used in the context of deep learning. As such, it is not meant to be exhaustive.

The Minimum Description Length principle is a concept in model selection and communication that states that the least complex model that best explains the data is the best one to use [36, 11]. The principle is based on the idea that learning about data is related to finding regularities in the data. The amount of regularity in a dataset is subsequently related to the extent to which the data can be compressed. Hence, the MDL principle asserts that the model that is best able to compress the data has found most regularity in it, and is therefore its best explanation [11]. It can be seen as a direct application of Occam’s razor, which, more generally, states that for any problem the simplest explanation is usually the most likely one. MDL learning helps prevent overfitting by implicitly coupling model- and data complexity: given that two models explain the data equally well, MDL will give preference to the simpler one [11].

The remainder of this chapter will cover how MDL can be used for efficient density estimation, which is the setting in which MDL is used in this work. Under the MDL principle, density estimation is framed in a lossless communication setting, where it is furthermore argued that the best model is also the one that is cheapest to communicate in a communication scheme. Quantitatively, this can be interpreted as the model that requires the least amount of information to describe both the model itself and the data misfit achieved under that model. Let L(·) denote the amount of information used to express its argument, and H be the hypothesis space. The MDL principle then states that the best model H∗ ∈ H minimizes the total cost when first communicating the model and then using that model to retrieve the data misfit.

L(D) = min H∈H(L(H) | {z } model cost + misfit cost z }| { L(D|H))

The following sections will discuss a practical use case of MDL model selection in the context of neural networks.

2.6.1 MDL for Neural Networks

Suppose a generic discriminative setup where the objective is to predict some target variable Y given inputs X using a neural network fθ, such that fθ : X → ˆY. Here, a model refers to the

the functional form or architecture of the neural network, denoted by f , and a simple hypothesis corresponds to a specific parameterization θ of the network, which will be denoted by fθ–that

is, an instantiation of network weights.

Assume the architecture f and the inputs X to be known by both parties (which therefore do not need to be incorporated into the communication costs). Accordingly, we define the model cost as the description length of the network parameters θ. The data misfit can then be defined as the discrepancy between the predictions of the simple hypothesis and the targets: Y∆_{= {y}i_{− ˆ}_yi_}N

i=1, where ˆyi= fθ(xi).

An MDL scheme for this setting would thus require communicating the weights θ (model cost) and the prediction errors Y∆_{(misfit cost) between a sender and receiver. The sender sends both}

(25)

Both Y∆_{and θ are assumed to be normally distributed around zero, and hence we will model}

them using zero-mean, fixed variance Gaussians.

MDL for Discrete Random Variables

For a discrete source X and a hypothesis H in the form of a probability distribution P parame-terized by a set of parameters θ, Shannon’s Source Coding Theorem (2.3), provides the cost for an optimal code [18]:

L(X) = L(θ) + L(X|θ)

= − log P (θ) − log P (X|θ).

MDL for Continuous Random Variables

In accordance with Chapter 2.5.1, we require quantization of the source density in order to describe and communicate it. For sufficiently small bin sizes, a Gaussian density can in practice be well approximated by a Riemann integral [16].

Figure 8: Riemann Approximation of a Gaussian error distribution with fixed bin size t. Cour-tesy of [16].

If the tolerances are sufficiently small compared to the first order model moments, (i.e. σ in the case of a Gaussian model) the true density can be well approximated by

L(θ) ≈ − log θ· P (θ) L(X|θ) ≈ −log X· P (X|θ)

= L(θ)θ, = L(X|θ)X.

Minimizing these costs leads to mode matching of the true and approximate distributions. Fur-thermore, note that using these approximations, there is a clear trade-off between communication costs and the level of precision of the approximation: The smaller the tolerances θ, X, the

better the approximations, but the larger the communication costs.

However, since the quantization tolerances are fixed in advance for all parameters, there is an implicit independence assumption between the model variances and the data misfit variances: if the tolerances are fixed in advance for all parameters –and they are small– then there might be some values for which the precision is unnecessarily high, resulting in unnecessarily large communication costs for these values (Fig. 9).

(26)

(a) Precise parameter, communicated precisely. (b) Imprecise parameter, communicated imprecisely.

(c) Precise parameter, communicated imprecisely. This is cheap to communicate, but causes extra costs in the data misfit.

(d) Imprecise parameter, communicated precisely. This is unnecessarily expensive to communicate.

Figure 9: Different cases of densities and approximate distributions for a single parameter θ with different relative quantization tolerances. Here, (a) and (b) are desirable. (c) is undesirable but gets optimized by minimizing the data misfit costs. (d) is the case where we unnecessarily waste bits.

We only wish to spend large communication costs on parameters that need to be communi-cated very precisely: it would be wasteful to spend a lot of bits describing an imprecise parameter very precisely.

Note that the width or precision of the true density expresses the importance of the parameter for modelling the data. A broad posterior over a parameter implies the weight is unimportant: it can be sampled from a very large range without harming model performance. On the other hand, a narrow posterior implies a precise parameter because the parameter can only take on a specific value, where only small deviations are tolerated. Communication would be much more efficient if it is possible to couple the parameter precision and the quantization tolerances, since they are, in practice, not independent.

The following section will touch upon Bits-back coding, a communication scheme that has a direct MDL vindication. The scheme quantifies the two-part message set up in terms of their number of bits, and provides an efficient method for communicating a source while gaining

(27)

The bits-back argument asserts that we can argue about these individual tolerances if we evaluate the message lengths of these variables under a posterior and prior distribution.

2.6.2 Bits-Back Coding

The Bits-back principle states that communicating auxiliary bits might lengthen the communi-cation costs for a dataset initially but can result in implicit model selection while still commu-nicating the data optimally [25, 16]. The auxiliary bits can be retrieved from the message at no extra cost; if the auxiliary bits contain useful information, this side-information can be com-municated nearly for free. The auxiliary bits are used as a source of stochasticity for sampling codewords.

The scheme is as follows: consider a sender that seeks to communicate a dataset X to a receiver according to an MDL scheme; this implies the two part message setup where the sender first needs to communicate the model and then the data misfit achieved under that model. Let H be a hypothesis for source X parameterized by θ. The sender fits a posterior density to the data p(θ|X) and uses this choose a value for the parameters θ∗. However, by similar reasoning of that of 2.2.2, this posterior is intractable and we therefore instead assume that the sender has fitted an approximate posterior q(θ). The message then requires communication of θ∗ and the prediction error achieved under p(X|θ∗).

The resulting average message length turns out to be an expectation under the approximate posterior:

Eq(θ)[L(X)] = Eq(θ)[L(θ)] + Eq(θ)[L(X|θ)].

By sampling the parameters we are communicating noisy weights. By doing so, we have to communicate a mean and variance per weight, which is more expensive to communicate than just sending fixed values.

However, the amount of random bits needed to sample a value carries information on the choice of model for the source. The prior over the parameters needs to be broad and precise enough such that it can communicate all possible parameter values. On the other hand, the posterior maps parameter values conditioned on the input to subsets/subareas of the prior due to optimization of the reverse KL-divergence between the approximate posterior and the prior. The amount of bits needed to sample from the posterior is equal to the entropy of that area.

Parameter values for which the posterior is already close to the prior are thought to be non-informative for explaining the observations. This implies p(θ) ≈ q(θ), which implies a weak dependence of the parameter on the source.

Furthermore, the stream of random bits used for sampling parameters is independent from the source and thus does not contribute to the description length of the source. The scheme shows how it is possible for the receiver to recover these excess bits from the message.

(28)

Figure 10: Bits Back communication scheme. Here enc(·, p) denotes an encoding function of value · using probability distribution p. Similarly, dec(m, p) denotes a decoding of encoding m using probability distribution p. The sender constructs the two-part message set up and communicates these using the shared knowledge. The receiver can decode the messages using the shared knowledge and retrieve the same fitted model as the receiver.

The scheme is as follows. Assume the architecture f to be known by sender and receiver. Both parties have also agreed on a prior over the parameters p(θ). Note that the sender cannot encode the parameters using the approximate posterior q(θ) since the receiver does not have access to the source X and hence can not use this posterior to decode the message. Instead, the sender needs to encode the parameters θ according to their agreed upon prior p(θ). This scheme is visualized in Fig. 10.

Once the receiver has decoded the model and data misfits, they can reconstruct the source exactly: they can use the decoded model to produce model predictions, and they can retrieve the source by adding the data misfits to the predictions.

Now that the receiver knows X, they are able to learn the same posterior density q(θ) over the parameters as the sender. The receiver can now calculate the message lengths had the model been encoded using the posterior instead of the prior. By evaluating the discrepancy in length between these encodings, the receiver can retrieve which messages needed many bits to collapse the posterior.

The extra bits added in the sampling procedure only convey information about the model and do not contribute to the description of the data. The amount of extra bits is equal to the entropy of θ under q:

Hq(θ) = −

X

θ

(29)

any arbitrary sequence of random bits could have been used for this procedure, and this sequence is independent of the source.

The receiver can retrieve the amount of random bits used to collapse the posterior for a specific weight by evaluating the discrepancy between the description length under the true posterior and the description length under the collapsed posterior. This amount corresponds to:

R = Eq(θ)[log q(θ)] − Eq(θ)[log p(θ)]

= Z

q(θ) log q(θ) p(θ)dθ.

This yields the following expected average description length for the bits-back coding scheme: L(X) = Eq(θ)[log p(X|θ)] + Eq(θ)[p(θ)] − Hq(θ) = −X θ q(θ) log q(θ) p(X|θ)p(θ) = DKL(q(θ)||p(θ|X)) − Eq(θ)[p(X|θ)].

From the properties of the KL-divergence (Chapter 2.3), we know that this quantity is minimized if q(θ) = p(θ|X), so it follows that the best distribution to use is the true one.

(30)

2.7 Dropout

The fact that neural networks are heavily overparameterized has long been known [ref]. The expressiveness of deep neural networks is one of the main features that allow them to model very complex scenarios, which is often where interesting real-world applications reside. However, in practice this expressiveness is often excessive with respect to the task at hand. The result of such over-expressiveness is that networks will have the capacity to memorize large portions of the training data rather than finding interesting generalizing features [16] [10].

Part of the variation in an observed dataset is a result of stochasticity of a source: these relations emerge as a consequence of noise in the data [37]. Large networks are able capture these relations, though they do not represent features of the underlying distribution and are only specific to the training instances. Overfitting is therefore a threat that is bounded to the advantages of overly expressive networks.

One particularly successful and efficient attempt at preventing overfitting and reducing weight redundancy has been the introduction of dropout [15] [37]. In dropout, random neurons are dropped out of the training process in order to prevent co-adaption of neurons. Specifically, sub-networks are sampled at train time by sampling a dropout mask from a Bernoulli distribution during training. The mask sets random weights to zero, therefore ’dropping’ them from the training process.

Figure 11: Conceptual representation of generic feedforward (a) and dropout feedforward (b) training procedures. In dropout, thinned networks are sampled according to a Bernoulli prior. Courtesy of [15].

For a generic feedforward layer containing weight matrix W ∈ RO×I_{, input vector x, bias b}

and any element-wise activation function a(·), the output unit y corresponds to: y = a(W x + b).

(31)

is referred to as the dropout rate α. The mask is applied to the weights before calculating its activations, resulting in training with thinned networks (Fig. 11). In its most common form, the feedforward procedure corresponds to:

m ∼ Bern(p),

y = f (W · (x m) + b),

where denotes an element-wise multiplication. During inference time, no weights are dropped but are instead weighted by their respective dropout rates. Thus, the inference feedforward procedure corresponds to an expectation over the training procedure.

This method naturally lead to the exploration of more carefully chosen dropout distributions, such as Gaussian dropout [39], where network weights are multiplied by a Gaussian noise mask sampled from mij ∼ N (1, α) at train time [35, 39]. In Gaussian dropout, the dropout rate

α = _1−pp is often used, which corresponds to the logit of the Bernoulli dropout rate [39, 35]. In practice it was shown that adding multiplicative noise to layer inputs during training prevents overfitting [23].

This method suffers from an important limitation that resides in the choice of the dropout rates α: finding their optimal values generally entailed extensive hyperparameter search, which made optimizing for individual dropout rates per weight infeasible [35]. As an example, for binary dropout over K model parameters, optimizing for the dropout rate per parameter would require evaluating 2K _{configurations of dropout rates [28, 21].}

It was later shown that randomly dropping out weights during train time had a Bayesian justification as being mathematically related to Deep Gaussian Processes and Bayesian Regu-larization [8, 39, 23].

Ultimately, the dropout method was generalized by efficiently finding more flexible dropout distributions through the use of variational inference [23, 35], which omits extensive hyperpa-rameter search of the dropout rates and instead allows them to be learned during optimization. Chapter 3.2.1 will discuss Variational Dropout, which is the approach the current research adopts in order to model network complexity. The relation between training using variational dropout and the MDL-principle has widely been made explicit [10].

(32)

3 Related Work

There have been several studies on deep source compression and deep function compression in parallel. The current chapter will highlight some of the recent advances in these areas while building up to the joined method that the current research proposes.

A fully Bayesian approach to variational autoencoders has already been hinted towards in Appendix F by the authors of the original VAE paper [24]. Here they propose the use of variational inference to model the density of the true parameters θ of the decoder pθ(x|z). In

our research we additionally show that it is in practice possible to use this method on the variational parameters φ of the encoder qφ(z|x) as well, despite the hierarchical variational

stochasticity that is entailed here. We adapt and extend the fully Bayesian approach to fit the purpose of model sparsification by assuming sparsity inducing priors over the model densities.

3.1 Variational Autoencoders for Source Compression

[24] Introduced the Autoencoding Variational Bayes (AEVB) framework, where variational in-ference is used to construct an encoder-decoder architecture that learns a mapping from the observed samples x to some (lower dimensional) density over latent samples z (the encoder ) and a reverse –but not inverse– mapping from z to x (decoder ), such that the encoder approx-imates p(z|x) and the decoder approxapprox-imates p(x|z). The intractable true posterior pθ(z|x),

parameterized by θ, is approximated by using some known approximate posterior qφ(z|x)2 ,

parameterized by φ that is closest to the true posterior in terms of KL-divergence. Such models are commonly referred to as Variational Autoencoders (VAE). The graphical model as provided by [24] represents this scenario (Fig. 12).

Figure 12: Graphical model from [24] for Variational Autoencoders. Dashed lines represent variational approximations.

Specifically, the goal is to learn a density over the latent values z; as such, the encoder qφ(z|x) is constructed to learn the distribution parameters for a given pair xi, zi. The decoder

(33)

pθ(x|z) then has to learn to reconstruct x given a noisy sample z. As an example, assume the

approximate posterior qφ(z|x) = N (z; µz, σzI), where µz, σz are learned parameterizations

of z conditioned on inputs x. Let f denote the encoder and decoder as neural networks, parameterized by their respective subscript. Then a forward procedure can be visualized as follows:

(a) Encoding procedure. The encoder learns the parameterization of z conditioned on the observation xi_.

(b) Sampling procedure. The conditional parameters are used to sample an observation zi

(c) Decoding procedure. The decoder learns to map the latent observations to reconstructed inputs.

Figure 13: Example of a forward procedure with Gaussian approximate posterior.

3.1.1 Reparameterization Trick

The procedure as previously described has an important caveat: the sampling step yields the stochastic source z ∼ qφ(z|x), on which the decoder is conditioned. End-to-end optimization

with respect to the encoder parameters φ is therefore impossible since sampling causes disconti-nuity in the procedure [5]. For this purpose, the reparameterization trick was introduced. This trick allows certain distributions –subject to a specific set of constraints [24] – to be rewritten in terms of a deterministic function g of its inputs and an independent noise variable , such that qφ(z|x) = gφ(x, ).

As an example, let us again consider the case where qφ(z|x) is Gaussian. Then the regular

sampling procedure:

z ∼ qφ(z|x) where qφ(z|x) = N (z; µ, σ)

Can be replaced by the reparameterized sampling procedure:

z = gφ(x, ) where gφ(x, ) = µ + σ , ∼ p()

Where g is differentiable with respect to φ. The reparameterized samples allow for end-to-end backpropagation. For rigorous proofs and a list of conditions for distributions that can be reparameterized, we refer the reader to [24].

3.1.2 Objective

Then, instead of maximizing the likelihood directly, they maximize the lower bound L(θ, φ) on the evidence (ELBO). The lowerbound given a single sample xi _{then corresponds to:}

log pθ(xi) ≥ −DKL(qφ(z|xi)||pθ(z)) + Eqφ(z|xi)log pθ(x

i_|z) = L(θ, φ; xi)

(34)

Here, pθ(x|z) and qφ(z|x) are modeled using neural networks, where the objective L(θ, φ, xi)

is optimized w.r.t. φ, θ, the parameters of the networks respectively. The Monte-Carlo approx-imation derived by [24] for this objective then turns out to be:

log pθ(xi) ≥ −DKL qφ(z|xi)||pθ(z) + 1 L L X l=1 log pθ(xi|z(i,l)) = L(θ, φ; xi),

where the authors assume a fully factorized marginal X = {xi_}N i=1: log pθ(X) = N X i=1 log pθ(xi) = N X i=1 L(θ, φ; xi).

An MC-approximation for the expected reconstruction loss is also given by [24] (Eq. 8). For batches of size M and L samples per datapoint:

Eqφ(z|x)[log pθ(x|z)] = N M M X i=1 1 L L X l=1

log pθ(x(i)|z(i,l)).

For known distributions, the KL-divergence term can often be calculated analytically. As an example, assume the latent observations z to be in RJ_{, such that z = [z}

1, z2, ...zJ]T .

Then KL-divergence for two Gaussians p(zj) = N (0, 1) and q(zj) = N (µj, σj2) can be calculated

as follows: (see [24] Appendix B for a full derivation)

−DKL(qφ(z)||pθ(z)) = 1 2 J X j=1 (1 + log((σj)2) − (µj)2− (σj)2). (1)

(35)

3.2 Pruning

A wide array of effective methods for neural network weight pruning exist for discriminative models, which have been shown to reach significant amounts of sparsity in large models [20, 13, 31]. Iterative methods usually alternate between some regular loss optimization step and a pruning step, where pruning is generally based on some criterion of the weights gathered during the optimization step [13]. Such criteria can for example include weight values (i.e. magnitude based pruning) or their derivatives (i.e. pruning the weights by gradient magnitude [13, 31]).

Bayesian methods often induce network sparsity by imposing some penalty for network com-plexity through carefully designed model priors. These methods are generally motived by MDL and allow for joint pruning and optimization. Examples include L0-regularization methods

[29, 33, 38], Variational Dropout (see chapter 3.2.1) [35]. Furthermore, an upside is that they allow for joint training and compression.

Weight-based pruning methods suffer from the drawback they do not actually reduce the amount of computation. Structured pruning methods have been devised that carefully prune groups of weights on, for example, filter level [28, 31], or row/column level of a weight matrix [28], which allows for these weights to be dropped from computation completely.

Yet other methods perform under the assumption that well-performing sparse networks al-ready exist within larger networks at initialization; this assumption is called the Lottery Ticket Hypothesis [6]. Methods in this direction generally focus more on finding these networks and hence assert that minimal training is needed.

3.2.1 Variational Dropout for Function Compression

In [23] variational dropout was introduced as a means to incorporate Bayesian model uncertainty into neural networks. Here, Bayesian posterior inference over the model parameters is used to find the optimal model structure for an observed dataset.

Assume a dataset D with N i.i.d. observations xiand corresponding targets yisuch that D = {(xi_{, y}i_)}N

i=1. A common discriminative machine learning objective is to model the conditional

of the targets given the observed data p(y|x). In their approach, they treat the parameters of the model –which we will denote w, a set of weights– as latent random variables in order to learn a density over them. They use variational inference to approximate a density over the model parameters: they use a known distribution qφ(w) to approximate the true density

over the weights by using a derivation similar to that of Eq. (1). They derive the following lowerbound on the objective:

L(φ) = −DKL(qφ(w)|p(w)) + N X i=1 Eqφ(w)log p(y i_|xi_{, w)} | {z } L(φ) .

Where L(φ) is the expected log-likelihood [23]. For batches of size M , the unbiased differen-tiable minibatch-based Monte Carlo estimator from [23] (Eq. 3) is then given by:

L(φ) ' LMC(φ) = N M M X m=1 log p(ym|xm_{, w = f (}m_{, φ)).}

(36)

Where f (, φ) applies the Reparametrization Trick [24], the Local Reparametrization Trick [23] or Additive Noise Reparameterization [35] (See section 3.2.1).

They assume a Gaussian approximate posterior over individual weights w s.t. q(w|θ, α) = N (w|θ, αθ2_{). For the purpose of model sparsification, they furthermore assume a log uniform}

distribution p(|w|) ∝_|w|1 for the magnitudes of the weights.

The log uniform distribution is sparsity inducing since it assigns increasingly smaller likelihoods as weight magnitudes grow, and hence explicitly forces weights to be close to zero when maxi-mizing the likelihood.

Here, φ = (θ, α) and the KL-divergence derivation obtained by [35] can be used (4.2). For a weight matrix w ∈ RI×O _{with I input neurons and O output neurons:}

DKL(qφ(w|θ, α)||p(w)) = I X i=1 O X j=1 DKL(qφ(wij|θij, αij)||p(wij)), where −DKL(qφ(wij|θij, αij)||p(wij)) ≈ k1σ(k2+ k3 log αij) − 0.5 log(1 + α−1ij ) + C (2)

and the following estimations for the constants were provided:

k1= 0.63576, k2= 1.87320, k3= 1.48695.

Local Reparameterization

In practice the reparameterization trick as described in 3.1.1 yields high variance MC-estimates of the expected log-likelihood L(φ), which can significantly slow down optimization and con-vergence [35, 23]. The Local Reparameterization trick was introduced as an alternative param-eterization which yields lower variance gradients and estimates of the log-likelihood. They note that the weights wij influence the expected log-likelihood L(φ) only through their

correspond-ing layer activations. The alternative parameterization is efficient since the noise source p() – which, in the current setting would yield noise samples ij for every weight wij – is now

local-ized to the layer output activations, which typically has much lower dimensionality and hence requires significantly less samples [35, 23].

As an example, let us again turn to the case where we assume the factorized posterior over the weights to be Gaussian, i.e. qφ(wij) = N (wij|µij, σij2). For a layer with weight matrix

W ∈ RI×O_{with I input neurons, O output neurons, using batches of size M , let a}

imdenote the

ith layer input for batch m. The entire set of inputs corresponds to A ∈ RM ×I_{. Similarly, let}

B ∈ RM ×O _{denote the set of output activations of said layer, where O denotes the number of}

output neurons. Then it turns out that the factorized posterior over the layer output activations conditioned on its inputs are Gaussian as well [35, 23]. The locally reparameterized sampling procedure then corresponds to:

qφ(bmj|A) = N (b|γmj, δmj), where γmj= I X i amiµij, δmj= I X i a2_miσ_ij2,

(37)

Using this parameterization, we only need M × O noise samples, whereas we would require M × I × O samples in the original setting.

For a convolutional layer the procedure is defined similarly. Let Am ∈ RH×W ×C be an

input tensor with dimensions (height × width × channels). A single filter wk ∈ Rh×w×C is

parameterized by its mean and variance denoted µh×w×C_k and σh×w×C_k , respectively. Then the output activations bmk∈ RH

0_×W0

for a given filter corresponds to [35, 23]:

vec(bmk) ∼ N (γmj, δmj), where γmk= vec(Am∗ µk), δmk= diag(vec(A2m∗ σ 2 k)).

(38)

3.2.2 Structured Pruning

The variational dropout method from the previous section showed how to learn individual dropout rates for every weight. In practice, one cannot simply drop individual weights from the training process since this changes the dimensionality of features in the weight matrices. [28] Created a practical method for pruning by coupling the variance (or scale) of weights per input or output neuron, which enables dropping input and output connections all at once. This results in weight matrices where entire rows or columns can be dropped from the model. They introduce a prior over the scales of the weights, and specify the weight and scale distributions as follows:

Assume each weight w to be drawn from a zero-mean Gaussian which scale in turn is assumed to come from some density:

z ∼ p(z), w ∼ N (w; 0, z2).

Such densities are generally referred to as Gaussian scale mixtures. By assuming a log-uniform distribution over the scales and marginalizing over them, they show that the log-uniform over the weights can alternatively be formulated as [28]:

p(w) ∝

Z ₁

|z|N (w|0, z

2_{)dz =} 1

|w|.

Hence we arrive at the same prior distribution over the weights as in 3.2.1, along with its desired sparsifying properties. However, by treating the scales of the weights as random variables, it is possible to share scales among certain groups of weights. Specifically, they group scales according to their corresponding output neuron, such that the joint probability for a weight matrix W ∈ RI×O _{with I input features and O output features is as follows:}

p(W, z) ∝ O Y j 1 |zj| I,O Y i,j N (wij|0, z2j).

Furthermore, they assume the variational posterior on z to be Gaussian as well, and the approximate joint to factorize as follows:

q(W, z) = O Y j N (zj|µzj, µ 2 zjαj) I,O Y i,j N (wij|ziµij, z2iσ 2 ij)

The following ELBO objective is then optimized over this joint formulation:

L(φ) = Eqφ(z)qφ(W|z))[log p(D|W)] − Eqφ(z)[DKL(qφ(W|z)||p(W|z))] − DKL(qφ(z)||p(z)).

They show that under these specifications, the kl-divergence does not depend on the scales and the divergence can be calculated by:

DKL(qφ(W|z)||p(W|z)) = 1 2 i,j X A,B log σ_ij2 + σ2_ij+ µ2_ij− 1. (3)

(39)

4 Joint Objective

We now join the objectives for representation compression and function compression: Assume an i.i.d. dataset D = {xi}Ni=1 where the samples xiare generated by corresponding latent variables

zi. Furthermore, assume an encoder model parameterized by φ such that zi ∼ pφ(z|x) and,

similarly, a decoder model parameterized by θ such that xi∼ pθ(x|z). We wish to optimize the

marginal of the observed dataset while jointly optimizing for the latent variables as well as the parameters of the network. The graphical model for this scenario can be found in Fig. 14.

(a) inference model (b) generative model

Figure 14: Graphical models for the sparse Bayesian variational autoencoder

Here we implicitly perform bits-back coding on the model parameters θ, φ as well as the latents z. We derive the following lowerbound for this architecture:

log p(x) = log Z Z Z q(z|x, φ) · qξ(θ) · qχ(φ) · p(x|z, θ, φ) · p(z) · p(θ) · p(φ) q(z|x, φ) · qξ(θ) · qχ(φ) dzdθdφ ≥ Eq(z,θ,φ|x)log p(x|z, θ, φ) | {z }

expected reconstruction error

− Eqχ(φ)[DKL(q(z|x, φ)||p(z))] | {z } latents regularizer − DKL(qξ(θ)||p(θ)) | {z } decoder complexity − DKL(qχ(φ)||p(φ)) | {z } encoder complexity = L(ξ, χ).

For the full derivation, we refer the reader to Appendix C.

4.1 Practical Objective by Monte-Carlo Sampling

For the latent variable models qφ(z|x) and pθ(z) we follow the specification of [24] and assume

the prior over the latents to be standard normal: p(z) = N (z; 0, 1) and the approximate posterior to be Gaussian q(z|x) = N (z|µ, σ).

For both the regularization terms qχ(θ) and qξ(φ) we use the approach of [28] and

in-duce group sparsity over the output activations in the encoder and decoder parameters. For an encoder with K layers and a decoder with K hidden layers, the parameterizations θ = {Wθ(1)_{, ...W}θ(L)_{} and φ = {W}φ(1)_{, ...W}φ(K)_}3

(40)

qξ(θ) = L Y l=1 qξ(Wθ(l), sθ(l)), qχ(φ) = K Y k=1 qχ(Wφ(k), sφ(k)).

Then, for weight matrices Wφ_{, W}θ _{of the encoder and decoder respectively, their}

corre-sponding scale variables4 _sθ_{, s}φ _{we specify the approximate densities as in [28]:}

qξ(Wθ, sθ) = O Y j N (sθ j|µ θ sj, (µ θ sj) 2_αθ j) I,O Y i,j N (wθ ij|s θ iµ θ ij, (s θ i) 2_(σθ ij) 2_), qχ(Wφ, sφ) = O Y j N (sφ_j|µφ sj, (µ φ sj) 2_αφ j) I,O Y i,j N (w_ijφ|sφ_iµφ_ij, (sφ_i)2(σφ_ij)2).

We also follow [28] in their specification of the true densities: p(Wθ, sθ) ∝ O Y j 1 |sθ j| I,O Y i,j N (wθ ij|0, (s θ j) 2_), _p(Wφ_{, s}φ_{) ∝} O Y j 1 |sφ_j| I,O Y i,j N (w_ijφ|0, (sφ_j)2),

where optimization is done with respect to ξW = (µθ, log(σθ)2), ξs= (µθs, log(σ θ s) 2_{), χ} W = (µφ, log(σφ)2), χs= (µφs, log(σ φ s) 2_).

Furthermore, assume: q(z, θ, φ|x) = q(z|φ, x) · qξ(sθ) · qξ(Wθ|sθ) · qχ(sφ) · qχ(Wφ|sφ). Then

we estimate the expected log-likelihood through Monte-Carlo estimation: Eq(z,θ,φ|x)[log p(x|z, sθ, Wθ, sφ, Wφ)] ' LMC(ξW, ξsχW, χs).

Additionally we specify an MC objective for the expected log-likelihood as follows: LMC(ξW, ξsχW, χs) = N M BCDEF X m,b,c,d,e,f log p(x(m)|z(m,b), sθ(m,c), Wθ(m,d), sφ(m,e), Wφ(m,f )).

Where the sum ranges over all tuples (m, b, c, d, e, f ) with 1 ≤ m ≤ M, 1 ≤ b ≤ B, etc, and z(m,b)= gφ(x(m), (m,b)z ), s θ(m,c)_{= g(ξ} s, (m,c) sθ ), W θ(m,d)_{= g(ξ} W, (m,d) Wθ ) sφ(m,e)= gχ(χs, (m,e) sφ ), W φ(m,f )_{= g(χ} W, (m,f ) Wφ ).

In this case gφ applies the reparameterization trick (see chapter 3.1.1) and all other g apply the

local reparameterization trick (chapter 3.2.1). Furthermore, for all noise terms: ∼ N (; 0, 1). L(ξW, ξsχW, χs) = LMC(ξW, ξsχW, χs) − Eqχ(sφ)qχ(Wφ|sφ)[DKL(q(z|x, W φ_)||p(z)] − Eqχ(sφ)[DKL(qχ(W φ_|sφ_)||p(Wφ_|sφ_{))] − D} KL(qχ(sφ)||p(sφ)) − Eqξ(sθ)[DKL(qξ(W θ_|sθ_)||p(Wθ_|sθ_{))] − D} KL(qξ(sθ)||p(sθ)).

The second term of the right hand side is an expectation of the divergence between the true and approximate latent distribution under the encoder parameters. The KL-divergence term in this setting can be calculated by 1. The divergence between the weight densities conditioned on their scales can be calculated as in 3. The divergence of the scale densities can be calculated by 2.

Efficient Coding for Learned Source Compression

MSc Artificial Intelligence

Master Thesis