Smart Regularization of Deep Architectures

(1)

M.Sc. Artificial Intelligence

Machine Learning Track

Master Thesis

Smart Regularization of Deep

Architectures

by Christos Louizos 10653007 July 2015 42 EC January - July 2015 Supervisor/Examiner: prof.dr. Max Welling

Assessor: dr. Joris Mooij

Machine Learning Group University of Amsterdam

(2)

Acknowledgements

I would like to thank my supervisor prof. Max Welling for his excellent guidance through-out this thesis. I would not have been able to carry through-out this work withthrough-out the help and encouragement I received these last seven months. I would also like to thank Diederik Kingma, Kevin Swersky and prof. Richard Zemel. The interesting discussions we had proved to be really helpful and inspiring for this thesis. Furthermore, I would also like to thank Joris Mooij for agreeing to be in my thesis defence committee. Finally, I would also like to thank my friends and parents for their support.

(3)

Abstract

In this thesis I explored two “smart” regularization approaches for deep neural networks. The first one was about Spike and Slab priors; they provide a principled way of removing parameters from a model, thus leading to improved generalization. This greatly helps to combat overfitting, particularly on datasets where the number of instances is far less than the number of features. The second one was about “Invariant Representations”. The purpose here was to describe the generation of the data in such a way such that some “sensitive” or “irrelevant” observed factors of variation are “removed”. In other words, we transform the data to a new representation that fulfills two criteria: it is maximally informative about an observed random variable (e.g., class label) and min-imally informative about the “sensitive” or “irrelevant” variables. This inductive bias would in turn better regularize the network and provide representations that are more appropriate for tasks such as Domain Adaptation or “Fair” Classification. The validity of both approaches is confirmed through extensive experimentation.

(4)

List of Figures

1.1 A deep neural network . . . 2

2.1 Spike and Slab distribution . . . 8

3.1 Graphical model representation of the Spike and Slab (SS) neural network. The SS distribution is only placed on the weights of the network, with w = v z. The biases b have Gaussian distributions. H corresponds to the amount of layers. . . 13

3.2 Spike and Slab neural network compared to normal and Bayesian neural networks. (A): Neural network with either point or distribution estimates for the parameters. (B): Same networks with the inclusion of the spike variables; when the spikes become zero the connection is dropped. Orig-inal image taken and altered from [22]. . . 14

4.1 Unsupervised Model . . . 20

4.2 Semi-Supervised Model . . . 20

4.3 Encoder - Decoder diagram for the proposed VAE model. Each ellipsis corresponds to an intermediate variable. Solid arrows correspond to de-terministic neural network transitions that accept as input variables and produce as output the parameters of the probability distribution over the target. Dashed lines correspond to optional transitions/distributions. . . . 22

5.1 Log-normalized frequency for the Spike probabilities from the binary IS-PRO dataset. Both VSSNN models were using a diagonal Gaussian slab . 34 5.2 Weight histogram from all models in the binary IS-PRO dataset . . . 34

6.1 Results on the Adult dataset . . . 39

6.2 Results on the German dataset . . . 40

6.3 Results on the Health dataset . . . 40

6.4 t-SNE [44] visualizations from the Adult dataset on: (A): original x , (B): latent z1 without s and MMD, (C): latent z1 with s and without MMD, (D): latent z1 with s and MMD. Blue colour corresponds to males whereas red colour corresponds to females . . . 41

6.5 t-SNE [44] visualizations from the German dataset on: (A): original x , (B): latent z1 without s and MMD, (C): latent z1 with s and without MMD, (D): latent z1 with s and MMD. Blue colour corresponds to male persons whereas red colour corresponds to female persons . . . 41

(7)

List of Figures vi

6.6 t-SNE [44] visualizations from the Health dataset on: (A): original x , (B): latent z1 without s and MMD, (C): latent z1 with s and without MMD,

(D): latent z1 with s and MMD. Blue colour corresponds to individuals

with age > 65 whereas red colour corresponds to individuals with age ≤ 65 42

6.7 t-SNE [44] visualizations on the Books and Kitchen domains from the Amazon reviews dataset where the datapoints correspond to embeddings from: (A): the original x , (B): the latent z1 without s and MMD, (C):

the latent z1 with s and without MMD, (D): the latent z1 with s and

MMD. Blue colour corresponds to the Books domain whereas red colour corresponds to the Kitchen domain . . . 43

6.8 Proxy A-distances (PAD) for the Amazon reviews dataset: (A) on x and z1 from VFAE, (B) on x and z1 from VAE, (C) on z1 from VAE and

(8)

List of Tables

5.1 Dataset statistics . . . 28

5.2 Classification accuracy on small datasets . . . 30

5.3 Pairwise Poisson Binomial test results . . . 31

5.4 Classification error on MNIST. . . 33

6.1 Random chance accuracies for the S on each dataset . . . 38

6.2 Results on the Amazon reviews dataset . . . 42

(9)

Chapter 1

Introduction

Neural networks are powerful non-linear functions that can be used to approximate any function of interest. Their main strength stems from their highly flexible parametric form. This form essentially provides a lot of degrees of freedom, allowing the network to “adapt” to almost any problem. Due to their biological inspiration they are conceptually simple to understand; they are composed from “layers” where each layer is composed from a set of “neurons”, or else “units”. The information is allowed to “flow” through the network according to the network “connections”. These connections can be seen as “weights” where each weight corresponds to the “strength” of the connection between “neurons” from consecutive layers.

Lately, there is an increasing interest in Neural Networks in Machine Learning, and particularly “deep” neural networks. “Deep” corresponds to networks that are consisted from multiple layers of computation. These layers are essentially composing information in an increased level of abstraction; e.g. if a deep network is trained on images, then we would expect that the first layers identify edges and corners that are available in the image, and as we move deeper in the architecture we arrive at layers that represent concepts, such as a cat or a human face. An example of such a deep network can be seen in figure1.11.

Adapting the network to perform well on a task requires altering the above mentioned “connections” according to an objective. However, due to the sheer amount of “con-nections” that need to be tweaked, particularly in the case of deep neural networks, overfitting is a serious issue. Overfitting limits the “generalization” ability of the net-work to unseen data; the netnet-work essentially adapts to very specific quirks that are available only in the training data. Therefore, in this case some form of “regulariza-tion” is necessary. Regularization can be seen as employing prior assumptions about

1_{taken from:} _{http://scyfer.nl/wp-content/uploads/2014/05/Deep_Neural_Network.png}

(10)

Chapter 1. Introduction 2

Figure 1.1: A deep neural network

the nature of the problem so as to introduce an inductive bias that provides improved generalization.

There are two ways that regularization can be introduced to the network, “explicit” and “implicit”. Explicit regularization deals with the parameters, or else “connections”, of the network. It can be seen as injecting prior information in the parameters, which in turn makes the network more robust against overfitting. For example we might consider a case where the number of datapoints is way less than the parameters of the network. In this scenario we face the danger of the network simply “memorizing” the training set. Therefore we can introduce the prior assumption that the parameters should be sparse, i.e. only a small subset has a value different than zero. As a result, the network would be forced to “switch off” a lot of the parameters, thus limiting its ability in modelling training set specific noise, and consequently improving generalization.

Implicit regularization deals with the intermediate layers of the network. These layers can be seen as providing a new, latent, representation for the data. This representation encodes information that is relevant for the specific task that the network tries to solve. By injecting prior information about the problem in these intermediate layers we can perform implicit regularization of the entire network; the parameters of the network are forced to generate representations that are more appropriate for the task at hand. For example we might consider a case where we want to optimize an objective while avoiding “sensitive” information that is present in the data. By introducing this prior assumption, the network would try to remove as much “sensitive” information as possible from these latent representations.

(11)

1.1 Motivation

Let’s consider a problem where “explicit” regularization arises naturally. Consider a scenario where the objective is to predict whether a patient has a disease or is healthy. Also assume that the available amount of data is limited; this is usually the case in datasets that originate from a medical domain. Furthermore assume that the amount of features is very large, thus dominating the amount of datapoints. This results into a severely underdetermined problem, since there are simply not enough data to robustly estimate the parameters of a model.

Neural networks perform especially poor in this case. The parameters grow according to the number of units in a layer, in contrast to simple linear models where the pa-rameters are usually equal to the number of covariates, or else features. The danger of overfitting is particularly evident, since there are many excess degrees of freedom that the network could use to model training set specific “noise”. Therefore one naturally thinks about aggressively regularizing the network, hoping it will use just a small subset of its’ parameters.

Thus we could view “explicit” regularization, as a way to remove or else “switch off” some parameters of a model, i.e. we can impose the assumption that the parameters are sparse with only a handful being active. This would alleviate overfitting and con-sequently improve generalization. A way to achieve this effect is by placing a sparsity favouring prior distribution on the parameters. Spike and slab priors [1–3], which from a Bayesian perspective are the golden standard for sparse estimation, can naturally handle this task. They are assuming that each parameter is composed from two parts: a con-tinuous slab variable that corresponds to a probability distribution over the value of the parameter, and a binary spike variable that corresponds to the probability of “keeping” that particular parameter.

Besides “explicit” regularization, there is also “implicit” regularization, as we previously mentioned. By “implicit” we refer to the scenario where we impose some form of regular-ization on the layers, or else the latent representations that the neural network provides, which essentially affects all the parameters of the network. Let’s similarly motivate the need for this kind of regularization via an example. Imagine that we are faced with the task of predicting whether a given individual is engaged in a particular criminal activity. Let’s also assume that the dataset that we are provided with, is extremely imbalanced towards a particular “sensitive” variable that denotes the race of the individual. Simply training the network on the raw data thus faces the danger of providing decisions that discriminate against the individuals that belong to the “sensitive” group. This effect is obviously not desired in a real application.

(12)

As a consequence, we need a way to force the network to explicitly avoid encoding these discriminative bits of information in its’ latent representations. Handling such a case can be done naturally by considering the neural network as a generative model; we can assume that the data are generated in such a way that there is separation, or else independence, between the latent representations that the the network provides and the “sensitive” variables. More specifically, we can employ a factorized prior that admits two independent sources of variation for our data. As a result, the subsequent prediction task from this purged, or else invariant with respect to the sensitive information, latent representation would be unbiased towards, in our example, the race of the individual.

1.2 Contributions

There are two main contributions in this thesis. The first contribution is presented in chapter 3 and is based on the “explicit” regularization front. We propose a way to improve the generalization ability of neural networks, particularly on problems where the amount of instances is significantly smaller than the amount of features. We will show how we can employ Spike and Slab priors on the parameters of the network, and how this will lead into learning a Spike and Slab posterior distribution with only a small subset of the parameters active. This process can thus be seen as a general framework that is able to automatically adapt to the complexity of the task at hand by “switching off” unnecessary parameters.

The second contribution is presented in chapter 4 and it corresponds to the “implicit” regularization front. We will show how we can train a neural network to provide rep-resentations that are independent of a-priori known “nuisance” or “sensitive” variables. This framework can be seen as a generative process that involves two separate factors of variation for our data; the first corresponds to the “nuisance” or “sensitive” variables whereas the second corresponds to all the remaining information. The second factor of variation can thus be seen as a latent representation that is invariant with respect to these “nuisance” or “sensitive” variables. This inductive bias would in turn greatly improve the performance of the network, particularly on problems that require unbiased, or else fair, representations for the data, with one example being Domain Adaptation. Furthermore, we will also provide a straightforward extension that allows the network to jointly learn the invariant representations and perform classification according to this new invariant representation.

(13)

1.3 Organization

This thesis is organized as follows: on chapter 2 we provide a brief introduction on the main concepts that the proposed models are based upon. These correspond to Variational Inference [4, 5], Spike and Slab distributions [1–3] and Variational-Auto-Encoders [6,7]. Chapter 3provides details about the neural network that utilises Spike and Slab distributions, coined as Variational Spike and Slab Neural Network (VSSNN). We provide connections with two recently introduced regularization techniques for neural networks, namely Dropout [8, 9] and DropConnect [10]. We show how essentially our model is a generalization of the two aforementioned methods, since it can facilitate adaptive drop rates via the Spike and Slab posterior.

Chapter 4 introduces the proposed framework for constructing and utilising invariant representations. We show how this can be formulated as a general probabilistic model which admits a generative process with two independent factors of variation. Further-more, we introduce a straightforward extension that allows the incorporation of the class labels (if available) in the generative process. We similarly show how this model corre-sponds to a generalization of a recently proposed model for semi-supevised classification that utilises Variational Auto-Encoders [11]. We also show how we can incorporate an extra regularization term called Maximum Mean Discrepancy (MMD) [12] on the lower bound of the Variational Auto-Encoder, so as to ensure that we remain as invariant as possible with respect to the “nuisance” or “sensitive” variables.

Continuing to chapters 5 and 6, we provide the results obtained by the experiments performed by both models and an extensive discussion about these results. Finally, in chapters 7 and 8 we present the conclusions drawn and provide possible directions for future research.

(14)

Chapter 2

Preliminaries

In this chapter we give a brief introduction to the main concepts that are used throughout this thesis. In section2.1we provide information about Variational Inference [4,5], which is extensively used for the derivation of the models in chapters 3 and 4. In section 2.2

we give more details about the Spike and Slab distributions [1–3] which form the basis of the methods proposed in chapter 3, as they will be used as priors and posteriors on the parameters of a neural network. In section 2.3we cover Variational Auto-Encoders which in turn form the basis for the model proposed in chapter 4. They provide a principled generative framework that allows to incorporate prior assumptions, such as independence between variables.

2.1 Variational inference

Inference is a process where we compute probability distributions over unobserved, or else latent, variables based on observed quantities. Denoting as h and o the hidden and observed variables the required distribution over the unobserved variables can be simply estimated via Bayes rule:

p(h|o) = p(h, o) p(o) = p(h, o) P hp(h, o) (2.1)

As it can be seen, inference requires calculation of marginal probabilities p(o) for the observed variables o, also known as data likelihood or evidence.

However, obtaining this quantity for complex models is challenging, since the marginal-ization over the unobserved variables is usually intractable due to computation of expen-sive sums (for discrete hidden variables h) or integrals (for continuous hidden variables

(15)

Chapter 2. Preliminaries 7

h). Therefore in this context we usually seek an approximation to the distribution of interest (p(h|o) in this case). Variational methods is a way to obtain a deterministic approximation q(h|o) to p(h|o) by maximizing the Evidence Lower Bound (ELBO) with respect to our approximate distribution q(h|o):

log p(o) =X h

q(h|o) logp(h, o)q(h|o) p(h|o)q(h|o) =X h q(h|o) logp(h, o) q(h|o) + X h

q(h|o) logq(h|o) p(h|o)

= Eq(h|o)[log p(h, o) − log q(h|o)] + DKL(q(h|o)||p(h|o))

≥ Eq(h|o)[log p(h, o) − log q(h|o)] = L(q) (2.2)

where the last step is due to the positivity of the Kullback-Leibler (KL) divergence. Thus we have the freedom to restrict q(h|o) to a simple family of distributions (e.g. fully factorized distributions) so as to make the summations or integrals tractable. It is clear that when our approximation q(h|o) is the same distribution as p(h|o) then DKL(q(h|o)||p(h|o)) = 0 and consequently the lower bound is the same as the log-evidence, i.e. L(q) = log p(o).

2.2 Spike and Slab distributions

Spike and Slab distributions [1–3] provide a principled Bayesian way for variable selec-tion. They are two point mixture distributions that are composed from a continuous random variable (the slab) and a binary discrete variable (the spike), where essentially the latter either ‘activates’ or ‘deactivates’ the slab part. Their formulation can be seen as the following hierarchical Bayesian model:

v ∼ N (µ, σ) (2.3)

z ∼ Bern(π) (2.4)

w = v z (2.5)

where w is the spike and slab distributed variable with a two point marginal distribution (marginalizing over z):

(16)

Chapter 2. Preliminaries 8 4 3 2 1 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 2.1: Spike and Slab distribution

It should be noted that the distribution formulation is not limited to Gaussian slab variables and can be generalized to arbitrary continuous distributions. These distribu-tions are particularly effective in problems where the number of features is much bigger than the number of instances (n d). By assuming that only a few of the covariates are relevant for the prediction a Spike and Slab model is forced to have a lot of coeffi-cients exactly zero, which in turn achieves more robust parameter estimation as well as reduced over-fitting. They have been successfully used in the context of simple linear model models [13–15] as well as RBM’s [16–18]. We refer any interested reader to [19] for more information.

2.3 Variational auto-encoders and the reparametrization

trick

Consider a directed generative model with parameters θ given by pθ(x, z) = p(z)pθ(x|z) where z is a set of continuous latent variables. For flexible likelihood functions employing neural networks, inference and learning in this model can be extremely challenging. Recent work [6,7] has addressed these problems using a variational lower bound,

pθ(x) ≥ L(θ, φ; x)

= Eqφ(z|x)[− log qφ(z|x) + log pθ(x, z)] (2.7)

(17)

Chapter 2. Preliminaries 9

where qφ(z|x) is an approximate posterior. The model can be interpreted as a proba-bilistic auto-encoder, where qφ() represents the probabilistic encoder network and pθ() represents the probabilistic decoder network.

To learn the variational (φ) and model (θ) parameters we can perform gradient opti-mization on the lower bound2.8. To successfully obtain the gradients for the variational parameters φ from Eqφ(z|x)[log pθ(x|z)] we can use the “reparametrization trick” [6, 7].

More specifically, we can sample auxiliary noise variables from a noise distribution p() and then transform those to posterior samples via a deterministic function. For example, if we have a location scale family of distributions (such as a Gaussian) we can sample the standard distribution and then transform that sample via a simple linear transformation that involves the parameters of the posterior:

∼ p() (2.9)

z = µ_z+ σz (2.10)

where µ_z and σz correspond to the mean and standard deviation of the posterior dis-tribution qφ(z|x). Therefore the lower bound2.8can be transformed to an expectation over the auxiliary noise distribution p():

pθ(x) ≥ −DKL(qφ(z|x)||p(z)) + Ep()[log pθ(x|µz+ σz )] (2.11)

It is clear now that the sampling process in the expectation is independent of the pa-rameters of the posterior; therefore we can proceed in performing joint optimization of the variational φ and generative model parameters θ by backpropagation where the true gradients are approximated by stochastic gradients:

∂L(θ, φ; x) ∂φ = ∂ − DKL(qφ(z|x)||pθ(z)) ∂φ + Ep() ∂ log pθ(x|µz+ σz ) ∂φ (2.12)

Finally, to approximate all the expectations the authors in [6,7] use Monte Carlo inte-gration, e.g.: Eqφ(z|x)[log pθ(x|z)] ≈ 1 L L X l=1 log pθ(x|z(l)) (2.13) z(l)= µ_z+ σz (l) (2.14) (l)∼ p() (2.15)

(18)

Chapter 3

Spike and Slab priors and

posteriors

Deep neural networks are powerful non-linear functions that can effectively disentan-gle complicated relationships between input and outputs due to their highly flexible parametric form. This flexible parametric form however has a cost; there are a lot of parameters that need to be optimized which in turn requires large datasets so as to gen-eralize effectively. This fact makes neural networks perform poorly on problems where the dataset is “grown in the opposite direction”, i.e. datasets with a handful of instances but with a huge number of features (n d). Overfitting is a serious issue in this sce-nario since the parameters are way more than the actual datapoints thus giving a lot of freedom to the network to even model ‘noise’ that is only present in the training set.

Several techniques have been proposed to address this problem, with regularization being the most common and related to this chapter. It usually penalizes either the L2 or L1 norm of the weights thus forcing them to be either small or sparse. Regularization can also be seen from a probabilistic perspective; we can place a prior over the weights and consequently seek the maximum a-posteriori (MAP) value for the weights. From this perspective L2 and L1 regularization correspond to Gaussian and Laplace distributions over the weights respectively. However, even with regularization we are simply making a point estimate of our parameters which still faces the danger of overfitting. Ideally we would like to be Bayesian and obtain a proper posterior distribution and make predictions by using the predictive distribution, i.e. by marginalizing over the weight posterior distribution. This would avoid overfitting since essentially “we are not fitting” and the predictions are being made on the basis of the datapoints alone.

The remaining of this chapter is organised as follows. In section3.1, Dropout [8,9] and DropConnect [10], two of the most successful regularization techniques for deep neural

(19)

Chapter 3. SS priors and posteriors 11

networks, are described. In section3.2, we provide details on the proposed models that essentially learn the drop rates for Dropout and DropConnect.

3.1 Dropout and DropConnect

Dropout [8, 9] and its generalization DropConnect [10] are two very recent techniques for regularizing neural networks. Both are relying on a very strong Bernoulli multi-plicative noise that essentially “drops” either hidden units for Dropout or weights for DropConnect:

z ∼ Bern(1 − drop rate) (3.1)

h = z g(WA + b) (for Dropout) (3.2)

h = g((z W)A + b) (for DropConnect) (3.3)

where g() corresponds to the activation function and A corresponds to the input. The introduction of this noise prevents co-adaptation of the network parameters and creates an exponentially large network “ensemble” of thinned networks that share weights. This results into a model that is more robust against overfitting. Despite the fact that many of the sub-networks will never be visited the weight sharing still allows the network to make useful predictions [8].

However, at test time we would need to obtain predictions from an exponential num-ber of models which is infeasible; for this reason [8, 9] proposed a deterministic ap-proximation that corresponds to a geometric average of all the thinned networks. In-stead of this approximate averaging, the authors of DropConnect [10] employed moment matching and approximated the weighted sum of Bernoulli variables with a Gaussian distribution. Their justification for this alternative stemmed from the fact that for relu (g(x) = max(x, 0)) activations the Dropout approximate averaging is not justified mathematically.

Despite the fact that Dropout and DropConnect have been very successful in regularizing neural networks, learning their discrete drop rates has not been explored as much. To the best of our knowledge this has only been attempted at [20] where they introduce a binary belief network that learns the Dropout rates. This belief network essentially provides an approximation to the Bayesian posterior distribution over models for each input to the network:

p(zj|{ai : i < j}) = sigmoid( X i:i<j

(20)

where zj corresponds to the mask for a hidden unit j, ai corresponds to the input unit i, πji corresponds to the weight of moving from unit i to unit j and sigmoid() corresponds to the logistic function, sigmoid(x) = _1+exp(−x)1 . Instead of using a belief network, we will approach the problem of learning the Dropout rates via a spike and slab distribution. Moreover instead of learning input specific drop rates we will show a way to learn “global” dropout rates, which can be seen as an instance of variable selection. To learn the parameters of the belief network the authors in [20] made several simplifications and approximations. This is in contrast to our proposed approach which is better justified mathematically; the drop rates will arise naturally as the byproduct of variational inference. Nonetheless they report state-of-the art results on the MNIST and NORB datasets.

3.2 Variational Spike and Slab Neural Network

As we saw in section 3.1 Dropout and DropConnect are essentially randomly ‘zeroing’ hidden units and weights respectively. This reminisces the SS priors discussed in2.2. In both Dropout and DropConnect the slab variable can be considered as a delta spike that is either at the Maximum Likelihood (uniform weight prior) or Maximum A-Posteriori (MAP) value of the parameters, whereas the spike variable follows a Bernoulli distribu-tion with probability of success π = 1−drop rate. So a reasonable quesdistribu-tion is: instead of treating the drop rates as fixed hyper-parameters could we instead learn them through a SS distribution? The answer is yes; we can introduce the SS distribution as a prior over the neural network weights and consequently obtain the drop rate for each weight or hidden unit via the posterior SS distribution. Such a process can be formally defined as (where we also assume a separate distribution over the biases of the network):

p(y, z, v, b|x) = H Y h=1 p(bh)p(vh)p(zh) N Y n=1 p(yn|xn, w = v z, b) (3.5)

with h denoting the layers of the network, vh, zhand bhdenoting, similar to the nomen-clature in section 2.2, the slab, spike and bias variables of layer h. Finally, xn and yn denote the n-th datapoint and its corresponding label. A graphical model representation for this model can be seen in figure3.1.

In order to obtain the posterior distribution of this model we have to use Bayes rule:

p(v, z, b|x, y) = P p(y|x, v z, b)p(v)p(z)p(b) zR R p(y|x, v z, b)p(v)p(z)p(b)dbdv

(3.6)

= p(y|x, v z, b)p(v)p(z)p(b)

(21)

Chapter 3. SS priors and posteriors 13 y x v z b N H

Figure 3.1: Graphical model representation of the Spike and Slab (SS) neural network. The SS distribution is only placed on the weights of the network, with w = v z. The

biases b have Gaussian distributions. H corresponds to the amount of layers.

where the explicit products over the variables are dropped for notational simplicity. Un-fortunately the posterior distribution can not be calculated analytically in this case since both the integration over v, b and the summation over z are intractable. Following [21], we can instead use variational inference and hence obtain an approximation q(v, z, b) to the true posterior distribution p(v, z, b|x, y), by maximizing a lower-bound on the log-evidence:

log p(y|x) = logX z Z Z p(y|x, w, b)p(v)p(z)p(b)dbdv = logX z Z Z qφ(v, z, b) p(y|x, w, b)p(v)p(z)p(b) qφ(v, z, b) dbdv ≥X z Z Z qφ(v, z, b) log p(y|x, w, b)p(v)p(z)p(b) qφ(v, z, b)

dbdv (from Jensen’s Inequality)

=X z Z Z qφ(v, z, b) log p(y|x, w, b)dbdv+ +X z Z Z qφ(v, z, b) log p(v)p(z)p(b) qφ(v, z, b) dbdv (3.8)

where w = v z and φ are the parameters of the variational distribution qφ(v, z, b). To simplify the estimation even further we can assume that the posterior qφ(v, z, b) is also factorized over the spike, slab and bias variables, i.e. qφ(v, z, b) = qφ(v)qφ(z)qφ(b). With this assumption equation 3.8becomes:

log p(y|x) ≥X z Z Z qφ(v)qφ(z)qφ(b) log p(y|x, w, b)dbdv− − D_KL(qφ(v)||p(v)) − DKL(qφ(z)||p(z)) − DKL(qφ(b)||p(b)) = Eqφ(v)qφ(z)qφ(b)[log p(y|x, w, b)]− − D_KL(qφ(v)||p(v)) − DKL(qφ(z)||p(z)) − DKL(qφ(b)||p(b)) (3.9) = LE(y, x, v, z, b; φ) + LC(v, z, b; φ) (3.10)

(22)

where LE(y, x, v, z, b; φ) = Eqφ(v)qφ(z)qφ(b)[log p(y|x, w, b)] and L

C_{(v, z, b; φ) = −D}

KL(qφ(v)||p(v))− DKL(qφ(z)||p(z)) − DKL(qφ(b)||p(b)) denote the error loss and complexity loss,

simi-lar to the notation in [21]. Therefore to obtain the bias and SS posterior distribu-tions we can simply maximize equation 3.9 with respect to the variational param-eters φ. To approximate all the expectations we use Monte Carlo integration, e.g. Eqφ(v)qφ(z)qφ(b)[log p(y|x, w, b)] =

1 L

PL

l=1log p(y|x, w(l), b(l)) where w(l) = v(l) z(l) with v(l) ∼ q_φ(v), z(l) ∼ q_φ(z) and b(l) ∼ q_φ(b). In the following sections we will show how we can apply this framework with different choices of distributions for the continuous (slab) variables.

(a) (b)

Figure 3.2: Spike and Slab neural network compared to normal and Bayesian neural networks. (A): Neural network with either point or distribution estimates for the pa-rameters. (B): Same networks with the inclusion of the spike variables; when the spikes become zero the connection is dropped. Original image taken and altered from [22].

3.2.1 Delta slab posteriors

The simplest slab posterior distribution is a delta spike at the MAP estimate, similarly to the delta posteriors presented in [21]. For a Gaussian prior we recover ordinary L2 regularization, whereas for Laplace and Uniform priors we recover L1 regularization and Maximum Likelihood. Additionally v, b = φ for the slab variables since the only parameter is the point estimate of the weight. Furthermore, since the slab variable is deterministic the expectation of the error loss depends only on qφ(z). Obtaining the gradient for each of the continuous variables is pretty straightforward:

∂LE(y, x, v, z, b; φ) ∂v = Eqφ(z) ∂ log p(y|x, w, b) ∂w z (3.11) ∂LC(v, z, b; φ) ∂v = − ∂DKL(δ(v)||p(v)) ∂v (3.12) ∂LE(y, x, v, z, b; φ) ∂b = Eqφ(z) ∂ log p(y|x, w, b) ∂b (3.13) ∂LC(v, z, b; φ) ∂b = − ∂DKL(δ(b)||p(b)) ∂b (3.14)

(23)

with ∂LC(v,z,b;φ)_∂v ,∂LC(v,z,b;φ)_∂b = 0 for a Uniform prior whereas for Gaussian and Laplace priors the gradient can be found in [21].

3.2.2 Slab posterior

By having a “delta spike” distribution for the slab distributions we still face the danger of overfitting. A more appropriate distribution is a diagonal Gaussian; we only introduce one extra parameter (the variance) and furthermore the KL-divergence between the Gaussian prior and posterior can be calculated in closed form (derivation in appendixA). To be able to draw efficiently samples from qφ(v) and qφ(b) so as to estimate the gradient of LE(y, x, v, z, b; φ) with respect to the mean and the variance of qφ(v), i.e. φ = (µ_v, σv), and qφ(b), i.e. φ = (µb, σb), we use the “reparametrization trick” proposed in [6] appendix F. To get a sample from the variational posteriors qφ(v) and qφ(b) we simply draw a sample from the standard isotropic Gaussian distribution (N (0, I)) and transform the sample via v = f (µ_v, σv, ) = µv+ σv and b = f (µb, σb, ) = µ_b+ σb . It is clear that the sampling process does not depend on the variational parameters and thus we can obtain stochastic gradients for both µ_v, µ_b and σv, σb. The same trick was also used for variational inference with Gaussian prior and posterior in [22]. The gradients are equally simple to obtain:

∂LE(y, x, v, z, b; φ) ∂µ_v = Eqφ(v)qφ(z)qφ(b) ∂ log p(y|x, w, b) ∂w z (3.15) ∂LE(y, x, v, z, b; φ) ∂σv = Eqφ(v)qφ(z)qφ(b) ∂ log p(y|x, w, b) ∂w z (3.16) ∂LE(y, x, v, z, b; φ) ∂µ_b = Eqφ(v)qφ(z)qφ(b) ∂ log p(y|x, w, b) ∂b (3.17) ∂LE(y, x, v, z, b; φ) ∂σb = Eqφ(v)qφ(z)qφ(b) ∂ log p(y|x, w, b) ∂b (3.18)

The gradients for the complexity loss can be again found in [21].

3.2.3 Adaptive drop rate for weights

As we mentioned in3.2, by having an SS posterior on the weights we can readily learn the drop rate for each weight, i.e. we can learn to DropConnect. However, learning the spike posterior for the weights is a little more involved, as we cannot backpropagate through discrete variables. Therefore we have to use a different estimator for the gradient:

∂LE(y, x, v, z, b; φ)

∂π = Eqφ(v)qφ(z)qφ(b)

log p(y|x, w, b)∂ log qφ(z) ∂π

(24)

where we used the identity∂qφ(z)

∂π = qφ(z)

∂ log qφ(z)

∂π and with π = φ for the spike posterior. However, this estimator is notorious for its high variance [6, 23, 24]. For this reason, we follow [24,25] and use control variates, more specifically baselines (c), to reduce it. Introducing this baseline still keeps the gradient unbiased [24]:

Eqφ(v)qφ(z)qφ(b)

(log p(y|x, w, b) − c)∂ log qφ(z) ∂π

=

Eqφ(v)qφ(z)qφ(b)

log p(y|x, w, b)∂ log qφ(z) ∂π − Eqφ(v)qφ(z)qφ(b) c∂ log qφ(z) ∂π = Eqφ(v)qφ(z)qφ(b)

log p(y|x, w, b)∂ log qφ(z) ∂π − c∂ Eqφ(v)qφ(z)qφ(b)[1] ∂π = Eqφ(v)qφ(z)qφ(b)

log p(y|x, w, b)∂ log qφ(z) ∂π

As for the choice of baseline, [24] proposed to use an exponentially moving average of the learning signal (log p(y|x, w, b) in our case) as a constant baseline, whereas [25] proposed to use a first order Taylor expansion of the learning signal around the discrete sample z(l) evaluated at a point ˜z as a non-constant baseline. We adopted the latter, and following [25] we also set ˜z = 1₂. Despite the fact that this estimator is biased for non-quadratic functions, it was performing better in our experiments. The gradient estimator in our case becomes:

∂LE(y, x, v, z, b; φ)

∂π = Eqφ(v)qφ(z)qφ(b)

(log p(y|x, w, b) − log p(y|x, w, b)+

+∂ log p(y|x, w, b) ∂z ( 1 2 − z) ∂ log qφ(z) ∂π = Eqφ(v)qφ(z)qφ(b) ∂ log p(y|x, w, b) ∂z (z − 1 2) ∂ log qφ(z) ∂π (3.20)

3.2.4 Adaptive drop rate for hidden units

Instead of learning the drop rate for each weight we could instead learn a global drop rate for both the input x and subsequent hidden units h, by imposing a SS distribution on both x and h (where on the input x we only learn the spike variable). This corresponds to learning to Dropout. Furthermore, instead of having an explicit slab variable for each hidden unit, we can have an implicit one by using the “local reparametrization trick” presented in [26] and obtaining the hidden unit distribution via the linear combination of the weight w and bias b distributions. If the weights and biases have delta posteriors then the hidden units also have a delta slab distribution p(h) = δ(wTx + b) (where x is the input of the linear combination). When the weights and biases have diagonal Gaussian posterior distributions we obtain a Gaussian distribution p(h) = N

µT_wx +

(25)

Chapter 3. SS priors and posteriors 17 µ_b, q (σ2 w)Tx2+ σ2b

for the hidden unit (where p(h) is the probability distribution of the linear combination before the activation). The derivation for this implicit slab distribution is shown in AppendixB. The spike and slab distributed hidden unit variable can thus be formulated as:

h ∼ N µT_wx + µ_b, q (σ2 w)Tx2+ σ2b (3.21) z ∼ Bern(π) (3.22) ˜ h = h z (3.23)

where µ_w, σ2_w correspond to the mean and variance of the weight posterior qφ(w), µb, σ2_b correspond to the mean and variance of the bias posterior qφ(b) and π corresponds to the posterior probabilities from qφ(z). It should be noted that in this setting the posterior distribution over spikes qφ(z) is not datapoint specific, i.e. as in a latent variable model, but it can be seen as a posterior distribution over model architectures. We leave the extension to datapoint specific spike variables (i.e. qφ(z|x) instead of qφ(z)) for future work. We expect that the datapoint specific spikes would be able to better capture the possible heterogeneity among instances that is present in difficult datasets. It should also be noted that we can use the “local reparametrization trick” even when we are having drop rates for the weights instead of the hidden units, and it should also be preferred due to the lower variance on the gradients, as it was pointed out in [26]. The distribution of the linear combination before the nonlinearity in this case is p(h) = N (z µ_v)Tx + µ_b, q ((z σv)2)Tx2+ σ2_b .

Obtaining stochastic gradients for the spike variables in this scenario can be done via the same estimator that was used for the weights, i.e. equation 3.20. Furthermore, in order to achieve lower variance in the gradients, we draw a different spike sample z for each datapoint in the minibatch.

(26)

Chapter 4

Learning Invariant

Representations

In “Representation Learning” one tries to find representations of the data that are informative for a particular task while removing the factors of variation that are unin-formative and are typically detrimental for the task under consideration. Uninunin-formative dimensions are often called “noise” or “nuisance variables” while informative dimen-sions are usually called latent or hidden factors of variation. Many machine learning algorithms can be understood in this way: principal component analysis, nonlinear dimensionality reduction and latent Dirichlet allocation are all models that extract in-formative factors (dimensions, causes, topics) of the data. On the other hand, linear discriminant analysis and (deep) neural networks learn representations that are good for classification.

In this chapter we consider the scenario where we wish to learn latent representations where (almost) all of the information about certain known factors of variation are purged from the representation while still retaining as much information about the data as possible. In other words, we want a latent representation z that is maximally informative about an observed random variable y (e.g., class label) while minimally informative about a sensitive variable s. This problem was considered before by the authors of [27] who coined the term “fair representations”.

Fair representations can be useful for many problems. Imagine that you need to conduct random searches at an airport. You would like to search the persons that are most likely to commit a terrorist act. However, the law prohibits selecting people based on race and gender. Our algorithm can be used to learn representations that are cleansed of race and gender information but retain as much information as possible about terroristic

(27)

Chapter 4. Learning Invariant Representations 19

tendencies.1 This application is closely related to concerns around privacy: one may wish to publish information about users or patients while guaranteeing that all information about specific user or patient attributes is obfuscated (such as “disease”).

While in this example removing information about s will be detrimental to the task at hand, there are also situations where removing information can be helpful. Consider the case where one wants to learn a classifier to predict Alzheimer’s disease based on MRI images acquired in different hospitals. As it happens, one hospital treats patients with more advanced forms of Alzheimer than the other. Therefore, the variable “hospital ID” will be strongly correlated with the probability of detecting Alzheimer’s disease. However, this regularity in the data is unlikely to generalize to future patients in different hospitals and should be removed so as to improve the classifier. This is a form of ”Domain Adaptation”.

Therefore, to accommodate for both of these applications, we can formulate this problem as learning representations that are invariant with respect to a-priori known variations s. In the subsequent section we will show how this can be achieved both unsupervised and (semi) supervised (incorporating label information) by a Variational Auto-Encoder [6,7], i.e. a neural network that can be seen as a generative model.

4.1 Variational Fair Auto-Encoder

In this section we will introduce a novel model called Variational Fair Auto-Encoder (VFAE), which is based on (deep) Variational Auto-Encoders (VAE) [6,7]. These models can naturally encourage separation between latent variables z and sensitive variables s by using factorized priors p(s)p(z). Furthermore, since some dependencies may still remain when mapping data-cases to their hidden representation using the variational posterior qφ(z|x, s), we also employ a “Maximum Mean Discrepancy” term [12] that penalizes differences between all order moments of the marginal posterior distributions qφ(z|s = k) and qφ(z|s = k0) (for discrete random variables s).

4.1.1 Unsupervised model

Factoring out undesired variations from the data can be easily formulated as a general probabilistic model which admits two distinct (independent) “sources”; an observed variable s, which denotes the variations that we want to remove, and a continuous

1

There is actually a trade-off: completely removing sensitive information may result in useless repre-sentation.

(28)

x z s

N

Figure 4.1: Unsupervised Model

x z1 s z2 y N Figure 4.2: Semi-Supervised Model

latent variable z which models all the remaining information. This generative process can be formally defined as:

z ∼ p(z); x ∼ pθ(x|z, s)

where pθ(x|z, s) is an appropriate probability distribution for the data we are modelling. With this formulation we explicitly encode a notion of “invariance” in our model, since the latent representation is, marginally, independent of the factors of variation s. There-fore the problem of finding an invariant representation for a data point x and variation s can be cast as performing inference on this graphical model and obtaining the posterior distribution of z, p(z|x, s).

For our model we will employ a Variational Auto-Encoder architecture; namely we will parametrize the generative model (decoder) pθ(x|z, s) and the variational posterior (encoder) qφ(z|x, s) as (deep) neural networks which accept as inputs z, s and x, s re-spectively and produce the parameters of each distribution after a series of non-linear transformations. Both the model (θ) and variational (φ) parameters will be jointly op-timized with the SGVB [6] algorithm according to a lower bound on the log-likelihood. This parametrization will allow us to capture most of the salient information of x in our embedding z. Furthermore the distributed representation of a neural network would allow us to better resolve the dependencies between x and s thus yielding a better dis-entangling between the independent factors z and s. We choose a Gaussian posterior qφ(z|x, s) and standard isotropic Gaussian prior p(z) which subsequently provides the following lower bound (where the Kullback-Leibler divergence can be found analytically):

N X n=1 log p(xn|sn) ≥ N X n=1 Eqφ(zn|xn,sn)[log pθ(xn|zn, sn)] − KL(qφ(zn|xn, sn)||p(z)) (4.1) = F (φ, θ; xn, sn)

(29)

with qφ(zn|xn, sn) = N (zn|fφ(µ; xn, sn), fφ(σ; xn, sn)) and pθ(xn|zn, sn) = fθ(xn; zn, sn) with fθ(xn; zn, sn) being an appropriate probability distribution for the data we are modelling.

4.1.2 Semi-Supervised model

Factoring out variations in an unsupervised way can however be harmful in case that we want to use this invariant representation for a subsequent prediction task. In particular, if we have a situation where the nuisance variable s and the actual label y are correlated, then training an unsupervised model could yield random or degenerate representations with respect to y. Therefore it is more appropriate to try to ‘inject’ the information about the label during the feature extraction phase. This can be quite simply achieved by introducing a second ‘layer’ of latent variables to our generative model where we try to correlate z with the prediction task. Assuming that the invariant features are now called z1, we enrich the generative story by similarly providing two distinct (independent) sources for z1; a discrete variable y which denotes the label of the data point x and a continuous latent variable z2 which encodes the variation on z1 that is not explained by y (x dependent noise). The process now can be formally defined as:

y, z2 ∼ Cat(y)p(z2); z1 ∼ pθ(z1|z2, y); x ∼ pθ(x|z1, s)

Similarly to the unsupervised case we use a variational auto-encoder and jointly optimise the variational and model parameters. The lower bound now becomes:

N X n=1 log p(xn|sn) ≥ N X n=1

Eqφ(z1n,z2n,yn|xn,sn)[log p(z2) + log p(yn) + log pθ(z1n|z2n, yn)+

+ log pθ(xn|z1n, sn) − log qφ(z1n, z2n, yn|xn, sn)] (4.2)

(30)

with fθ(xn; z1n, sn) again being an appropriate probability distribution for the data we are modelling. It is clear that, under this model, we can use qφ(y|z1) to make predictions about the label y from the representation z1 that is not polluted from the s specific variations. A graphical representation of the flow of computation among the proposed encoder-decoder architecture is presented in figure 4.3.

Figure 4.3: Encoder - Decoder diagram for the proposed VAE model. Each ellip-sis corresponds to an intermediate variable. Solid arrows correspond to deterministic neural network transitions that accept as input variables and produce as output the parameters of the probability distribution over the target. Dashed lines correspond to

optional transitions/distributions.

The model proposed here can be seen as an extension to the ‘stacked M1+M2’ model originally proposed in [11], where we have additionally introduced the nuisance variable s during the feature extraction. Thus following [11] we can also handle the ‘semi-supervised’ case, i.e. missing labels. In case that the label is observed the lower bound takes the following form (exploiting the fact that we can compute the Kullback-Leibler divergences explicitly in our case):

N X n=1 L_s(φ, θ; xn, sn, yn) = Ns X n=1 Eqφ(z1n|xn,sn)[−KL(qφ(z2n|z1n, yn)||p(z2)) + log pθ(xn|z1n, sn)]+ + Eqφ(z2n|z1n,yn)[−KL(qφ(z1n|xn, sn)||pθ(z1n|z2n, yn))] (4.8)

and in the case that it is not observed we use q(yn|z1n) to ‘impute’ our data:

M X m=1 Lu(φ, θ; xm, sm) = M X m=1

Eqφ(z1m|xm,sm)[−KL(q(ym|z1m)||p(ym)) + log pθ(xm|z1m, sm)]+

+ Eqφ(z1m|xm,sm)qφ(ym|z1m)[−KL(qφ(z2m|z1m, ym)||p(z2))]+

+ Eqφ(ym|z1m)qφ(z2m|z1m,ym)[−KL(qφ(z1m|xm, sm)||pθ(z1m|z2m, ym))]

(31)

therefore the final objective function is:

F_VAE(φ, θ; xn, xm, sn, sm, yn) = N X n=1 L_s(φ, θ; xn, sn, yn) + M X m=1 L_u(φ, θ; xm, sm)+ + α N X n=1 Eq(z1n|xn,sn)[− log qφ(yn|z1n)] (4.10)

where the last term is introduced so as to ensure that the predictive posterior qφ(y|z1) learns from both labeled and unlabeled data. This semi-supervised model will be called as “VAE” in our experiments.

However, there is a subtle difference between [11] and our approach. Instead of training separately each layer of stochastic variables we optimize the model jointly. The potential advantages of this approach are two fold: as we previously mentioned if the label y and the nuisance information s are correlated then training a (conditional) feature extractor separately poses the danger of creating a degenerate representation with respect to the label y. Furthermore, the label information will better ‘guide’ the feature extraction towards the more salient parts of the data, thus maintaining most of the (predictive) information.

4.1.3 Further invariance via Maximum Mean Discrepancy

Despite the fact that we have a model that encourages statistical independence between s and z1 a-priori we might still have some dependence in the (approximate) marginal posterior qφ(z1|s). In particular, this can happen if the label y is correlated with the sensitive variable s, which can allow information about s to “leak” into the posterior. Thus instead we could maximise a “penalised” lower bound where we impose some sort of regularisation on the marginal qφ(z1|s). This technique reminisces “posterior regularisation” which has been explored in a different context in [28]. In the following we will describe one way to achieve this regularisation through the Maximum Mean Discrepancy (MMD) [12] measure.

4.1.3.1 Maximum Mean Discrepancy

Consider the problem of determining whether two datasets {X} ∼ P0 and {X0} ∼ P1 are drawn from the same distribution, i.e., P0 = P1. A simple test is to consider the distance between empirical statistics ψ(·) of the two datasets:

1 N0 N0 X i=1 ψ(xi) − 1 N1 N1 X i=1 ψ(x0_i) 2 . (4.11)

(32)

Expanding the square yields an estimator composed only of inner products on which the kernel trick can be applied. The resulting estimator is known as maximum mean discrepancy (MMD) [12]: `MMD(X, X0) = 1 N₀2 N0 X n=1 N0 X m=1 k(xn, xm) + 1 N₁2 N1 X n=1 N1 X m=1 k(x0_n, x0_m) − 2 N0N1 N0 X n=1 N1 X m=1 k(xn, x0m). (4.12)

Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0) = exp(−γkx− x0k2_{), `}

MMD(X, X0) is 0 if and only if P0 = P1. Equivalently, minimising MMD can be viewed as matching all of the moments of P0 and P1. Using MMD provides a way to use flexible, distributed representations and achieve invariance in the higher order moments.

Therefore, it is an appropriate penalty to remove sensitive information from a represen-tation since it can measure the “distance” between two distributions via the difference in their statistics. We can use it as an extra “regulariser” and force the model to try to match the moments between the marginal posterior distributions of our latent variables, i.e. qφ(z1|s = 0) and qφ(z1|s = 1) (in the case of binary nuisance information s). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we obtain our proposed model, the “Variational Fair Auto-Encoder” (VFAE):

In case that we have more than two states for the nuisance information s, we minimise the MMD penalty between each marginal q(z|s = k) and q(z), i.e. PK

k=1`MMD(Z1s=k, Z1) for all possible states K of s.

One could argue that instead of having a variational auto-encoder with the marginal independence properties in the prior and the MMD penalty, we could have paired a regular VAE (without s) with the MMD penalty. However, in this case the objective

(33)

function would be degenerate and harmful for the generative model, since removing the nuisance information from z would result into a significant loss in the reconstruction ability of the model. On the contrary, by introducing s to our generative model we do not have this effect, since the information that we lose from z are essentially “stored” in the parameters that correspond to the nuisance variables s.

Furthermore, while we motivated the addition of the MMD penalty to our lower bound as necessary for handling the cases when the nuisance variables s are correlated with the label y, we still expect improvement even when they are not. The MMD objective is entirely complementary; the VAE with the factorized prior essentially tries to provide a posterior that achieves marginal independence between z1 and s, or else removes information about s from z1. As a result, we obtain a posterior distribution qφ(z1|s) that is invariant with respect to s, and consequently, also satisfies the MMD penalty, i.e. when we have true marginal independence then `MMD will be 0.

4.1.4 Fast MMD via Random Fourier Features

A naive implementation of MMD in minibatch stochastic gradient descent would require computing the K × K Gram matrix for each minibatch during training, where K is the number of latent features. Instead, we can use random kitchen sinks [29] to compute a feature expansion such that computing the estimator (4.11) approximates the full MMD (4.12). To compute this, we draw a random K × D matrix W where each entry of W is drawn from a standard Gaussian. The feature expansion is then given as:

ψW(x) = r 2 Dcos r 2 γxW + b . (4.15)

Where b is a D-dimensional uniform random vector with entries in [0, 2π]. The idea of using random kitchen sinks has been successfully applied to approximate MMD in [30]. This estimator is fairly accurate, and is typically much faster than the full MMD penalty.

4.2 Related work

There is a significant amount of similar work in the literature. Most related to our “fair” representations view is the work done in [27]. They proposed a neural-network based semi-supervised clustering model for learning fair representations. The idea was to learn a localized representation that maps each datapoint to a cluster in such a way that each cluster gets assigned roughly equal proportions of data from each group in

(34)

s. Although their approach was successfully applied on several datasets, the restriction to clustering means that it cannot leverage the representational power of a distributed representation. Furthermore, this penalty does not account for higher order moments in the latent distribution. For example, if p(zk = 1|xi, s = 0) always returns 1 or 0, while p(zk= 1|xi, s = 1) returns values between values 0 and 1, then the penalty could still be satisfied, but information could still leak through. Both of these issues were addressed in our model.

Domain adaptation can also be cast as learning representations that are “invariant” with respect to a discrete variable s, the domain. Most similar to our work are neural network approaches which try to match the feature distributions between the domains. This was performed in an unsupervised way with mSDA [31] by training denoising auto-encoders jointly on all domains, thus implicitly obtaining a representation general enough to explain both the domain and the data. This is in contrast to our approach where we instead try to learn representations that explicitly remove domain information during the learning process. For the latter we find more similarities with “domain-regularized” supervised approaches that simultaneously try to predict the label for a data point and remove domain specific information. This is done with either MMD [32, 33] or adversarial [34] penalties at the hidden layers of the network. In our model however the main “domain-regularizer” stems from the independence properties of the prior over the domain and latent representations. We also employ MMD on our model but from a different perspective since we consider a slightly more difficult case where the domain s and label y are correlated; we need to ensure that we remain as “invariant” as possible since qφ(y|z1) might ‘leak’ information about s.

(35)

Chapter 5

VSSNN Experiments

Our main motivation for deriving the Spike and Slab distribution for a Neural Network was generalization on under-determined problems, namely datasets where the number of features is significantly higher than the number of datapoints (n d). For this reason our experiments were performed primarily on datasets that correspond to this situation. Nevertheless we also performed experiments on MNIST, which is the traditional bench-mark dataset for classification. In this dataset we expect that the SS distribution would not be as beneficial, since there are enough datapoints to achieve robust estimation of the parameters.

5.1 Datasets

5.1.1 IS-PRO

This dataset is composed from 99 patients (datapoints) where each patient is represented by 2283 features. It was obtained from the Vrije University Medical Centre (VUmc), and more specifically from the IS-Diagnostics team 1[35]. Each feature represents the abundance (intensity of a DNA fragment) of a specific bacteria in the patients’ gut. The objective is to classify whether the patient is “Healthy”, has “Crohn’s” disease 2 or “Colitis Ulcerosa” 3, i.e. a three class classification problem. We also experimented with a binary version “Healthy”-“Non-Healthy” classification problem, since, according to the IS-Diagnostics team, both “Crohn’s” disease and “Colitis Ulcerosa” are subtypes of another disease, the Inflammatory Bowel Disease (IBD)4.

1 http://www.isdiagnostics.nl/team.html 2_{https://en.wikipedia.org/wiki/Crohn%27s_disease} 3 https://en.wikipedia.org/wiki/Ulcerative_colitis 4_{https://en.wikipedia.org/wiki/Inflammatory_bowel_disease} 27

(36)

Chapter 5. VSSNN Experiments 28

5.1.2 Gene Expression Data

We also experimented on the preprocessed gene expression datasets 5 that were used in [36]. The specifications of these datasets fall well within our domain; we again have under-determined problems with significantly more features than datapoints. Table5.1

summarizes the dataset characteristics. More information about these datasets can be found in [37].

Table 5.1: Dataset statistics

Dataset # instances # features # classes

IS-PRO 99 2283 2-3

Prostate 102 6033 2

Colon 62 2000 2

5.2 Experimental Setup

For the IS-PRO, Prostate and Colon datasets the neural network hyper-parameters were the same; one hidden layer of 100 units with softplus (f (a) = log(1 + exp(a))) non-linearities. The prior for the slab variables v was set to a standard isotropic Gaussian N (0, I) whereas the prior for the spike variables z was set to a Bernoulli with a low probability of success Bern(0.1). We used minibatches of 10 datapoints along with 10 posterior samples per minibatch. Optimization was done with Adam [38] using moment coefficients β1 = 0.1, β2 = 0.001 and learning rate α = 0.001 for the networks that did not have spike variables, z, and α = 0.01 for the networks that had. We found out empirically that a higher learning rate was needed in the latter case, so as to allow the parameters of q(z) to converge. The only preprocessing that was done to the data was standardization, i.e. subtracting from each feature its mean and dividing it by its’ standard deviation.

As a baseline we used two of the most popular algorithms that provide sparse parameter estimation; an L1 regularized Logistic Regression along with an L1 regularized Support Vector Machine (SVM). Due to the flexibility in the choice of distributions for the Vari-ational Spike and Slab Neural Network (VSNN), we had many different settings in our experiments. “VSSNN HU” corresponds to a VSSNN where a spike and slab distribution is placed on the hidden units, whereas “VSSNN W” corresponds to a VSNN where the spike and slab distribution is placed on the weights W. “Dropout” and “DropConnect”

5

(37)

similarly correspond to spike and slab distributions where the parameters of q(z) are not learned and stay fixed at π = 1 − drop rate. We used drop rate = 0.2 for the input and drop rate = 0.5 for the hidden layer in our experiments. “delta” and “slab” correspond to a delta and Gaussian posterior distribution for the slab variable v. Finally, “BNN” and “L2Net” correspond to a neural network with either a Gaussian slab posterior or a delta slab posterior with no spike variables.

For the MNIST experiment we trained 6 models; the standard L2 regularized neural network (which constitutes our baseline), a Bayesian neural network with Gaussian posterior distributions on the weights and biases and neural networks that had spike and slab distributions either on the weights or on the input and hidden units, with both deterministic and Gaussian slab variables. We experimented with a varying amount of hidden layers and hidden units per layer. Our objective here is to show the relative gains on the performance among the model choices and not the actual accuracy per se. The prior for the slab variables v was again set to p(v) = N (0, I) whereas the prior for the spike variables z had a much higher probability of success now (since we expect to “keep more parameters”), p(z) = Bern(0.5).

Sampling for the continuous parameters was done locally using the “local re-parametrization trick” [26] as described in section 3.2.4. For prediction we used the MAP estimate of our parameters, i.e. ˜h = πµ_h for the adaptive Dropout, ˜W = πµ_W for the adaptive DropConnect and ˜h = µ_h, ˜W = µ_W when we had no spike variables.

5.3 Classification experiments

5.3.1 Small datasets

Table 5.2 present the results for the IS-PRO and Gene expression datasets, obtained after a 10-fold stratified cross-validation. On average the better performing method across datasets was the neural network with a spike and slab distribution on the input and hidden units, followed by the neural network with a delta slab and spike distribution on the weights and L1 Logistic Regression. Furthermore the networks without the spike variables had generally lower performance than the networks that had spike variables. This serves as an indication that the introduction of these parameter “switches” helps against overfitting, thus improving generalization and performance.

For the IS-PRO dataset the neural network with an spike and slab distribution on the hidden units performs better than the other methods. It achieved 8% and 7% relative

(38)

improvement over the simple baselines for the binary and multi-class problems respec-tively. Interestingly, the Bayesian neural network with a Gaussian posterior performed poorly (even worse than the simple L2 regularized neural network) on this dataset. Probably the high dimensionality of the dataset along with the introduction of noise to all of the parameters during the learning process prevented the network from achieving a good minimum. This effect makes spike and slab distributions even more favourable for the n p setting; by including the spike variables in the model we are able to better guide the learning process to more salient information, as well as reduce the noise that is injected to the slab posteriors.

Table 5.2: Classification accuracy on small datasets

Method IS-PRO

Binary

IS-PRO

Multi Prostate Colon Overall

L1 SVM 0.798 0.646 0.902 0.839 0.796 L1 LR 0.848 0.677 0.902 0.855 0.821 BNN 0.808 0.626 0.892 0.823 0.787 L2Net 0.818 0.657 0.755 0.855 0.771 Dropout slab 0.828 0.657 0.902 0.839 0.807 Dropout delta 0.828 0.677 0.784 0.839 0.782 DropConnect slab 0.838 0.667 0.882 0.839 0.807 DropConnect delta 0.768 0.646 0.735 0.871 0.755 VSSNN HU slab 0.919 0.727 0.922 0.839 0.852 VSSNN HU delta 0.838 0.606 0.794 0.823 0.765 VSSNN W slab 0.838 0.636 0.931 0.855 0.815 VSSNN W delta 0.848 0.687 0.902 0.855 0.823

For the gene expression datasets (Prostate and Colon) the neural network with a spike and slab distribution on the weights performed better than the equivalent neural net-work that had the distribution on the hidden units. Nevertheless, the performance of both models was similar (Colon dataset) or better (Prostate dataset) than the simple baselines. For the Colon dataset, it seems that having a delta slab is better than a Gaussian slab distribution. In addition, learning the drop rates for the spike variables did not help, since ‘DropConnect delta’ (that had fixed drop rates) achieved the best performance. However, for the Prostate dataset there is another interesting observa-tion; it has the highest dimensionality among the small datasets and it is exactly this dataset that we see the biggest gap among the performance of the simple L2 regularized neural network and the neural network with a spike and slab distribution (18% drop in performance for the ‘L2Net’). Furthermore, all the neural networks with a delta slab distribution had worse performance than their equivalent Gaussian slab counterparts. This exactly demonstrates the severe overfiting of the point estimates of the parameters on the n p scenario.

Smart Regularization of Deep Architectures

M.Sc. Artificial Intelligence

Machine Learning Track

Master Thesis

Smart Regularization of Deep

Architectures

Acknowledgements

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Contributions

1.3

Organization

Chapter 2

Preliminaries

2.1

Variational inference

2.2

Spike and Slab distributions

2.3

Variational auto-encoders and the reparametrization

trick

Chapter 3

Spike and Slab priors and

posteriors

3.1

Dropout and DropConnect

3.2

Variational Spike and Slab Neural Network

Chapter 4

Learning Invariant

Representations

4.1

Variational Fair Auto-Encoder

4.2

Related work

Chapter 5

VSSNN Experiments

5.1

Datasets

5.2

Experimental Setup

5.3

Classification experiments