Collaborative Filtering with Variational Autoencoders and Normalizing Flows

(1)

MSc Artificial Intelligence

Track: Learning Systems

Master Thesis

Collaborative Filtering with Variational

Autoencoders and Normalizing Flows

by

Francesco Stablum

6200982

August 28, 2018

36 EC Supervisors: Christos Louizos, Msc Assessors: Dr. Mettes Pascal Prof. Dr. Max Welling Dr. Miguel Angel Rios Gaona Dr. Wilker Ferreira Aziz

(2)

(3)

Abstract

In this work we integrate collaborative filtering models that make use of Stochastic Gradient Variational Bayes with more recent posterior distribu-tion approximadistribu-tion improvements, such as Planar and RealNVP Normaliz-ing Flows. A model based on the AutoRec collaborative filterNormaliz-ing autoen-coder model is used as baseline in order to compare it to our Variational-Autoencoder-based, named VaeRec and its variant VaeRec-NF which makes use of Normalizing Flows. Modifications to gradient-based parameter update algorithms are introduced in order to take into account the sparsity of the data tensors. Extensive hyperparameter search is performed and regulariz-ing techniques have been investigated, such as soft free bits, which employs an adaptive coefficient to the Kullback-Leibler divergence of the variational lower bound. Methods to prevent gradient explosion are also utilized. A novel collaborative filtering input schema that makes use of the concatena-tion of user and item vectors has been tried, alongside inputs that make use of solely the item or user vectors.

(8)

(9)

Preface

When I was proposed me the topic of collaborative filtering I accepted with enthusiasm. The idea of being able to infer user-on-item preferences with-out any description of either users or items fascinated me. How would it be possible to do machine learning having only relational information be-tween different entities? This less known application of machine learning still puzzles me and makes me wonder about the incredible potential of these models.

This thesis has been a long journey with peaks and flats in which I could experience both the excitement of attempting new ideas on how to solve the problem, as well as reconsidered expectations. This is normal part of the life of any research scientist and I’m glad of the opportunity of getting to know what this is all about.

The main aspect that motivates me into this thesis and in a broader scope to Machine Learning and Artificial Intelligence is the sheer amount of new discoveries and techniques that are being relentlessly produced by the scientific community and my desire to combine the state of the art in terms of neural models, update algorithms, regularization techniques, and proba-bilistic interpretations in order to push the boundary of the best achievable precision of the predictions. I’m always been interested in the nature of hu-man conceptualization and how this can be related to computability. This thesis allowed me to get an additional perspective on this matter.

I believe that Collaborative Filtering techniques will find a broader appli-cation that goes way beyond mere user/item rating prediction. They provide another way to model learning by association, in which observations of how objects interact lead to answers about what these objects actually are, also via interpretation of their location in the so-called latent space.

I hope that the reader will find interesting how techniques that are typi-cally used for dimensionality reduction and probabilistic inference, with

(10)

10 Preface ational approximations, have been employed for attempting a solution of user/item rating modeling. I tried my best to derive all necessary math in order to lead the reader to understand, step by step, the topics of variational inference, the variational autoencoder and improvements to the approxima-tion such as the normalizing flows.

I also included a description of the experiments and their results of many attempts at combining various algorithms into searching the synthesis that would lead to the best results and some attempts at explaining different outcomes.

Acknowledgements

I would like to express my gratitude to my adviser Christos Louizos for the considerable patience and useful advices that got me unstuck many times, professor Max Welling for the thesis topic, the team behind the DAS4 su-percomputer [Bal et al., 2016] for letting me use very intensely their compu-tational resources, to my company, ORTEC, and my colleagues for allowing me flexibility.

(11)

Chapter 1 Introduction

This work presents an exploration on the use of Variational Autoencoders for collaborative filtering. The baseline model has been chosen to be AutoRec [Sedhain et al., 2015], which uses latent representation to reconstruct missing ratings. The natural evolution of this model has been considered to be similar model based on a Variational-AutoEncoder, which we called VaeRec. Various extensions to this model have been examined. Specifically, Normalizing Flows transformations of the posterior approximation have been investigated, as well as regularization techniques.

The structure of this thesis represents how this research has evolved in time: Chapter 2 offers a highlight of the notions required to delve into the actual contributions of the model ; Chapter 3 presents similar models that have been used as inspiration ; Chapter 4 presents our models with their variants ; Chapter 5 illustrates the experiments that have been performed on our models ; Chapter 6 summarizes the contributions of the models with insights that emerged from all the experiments ; Most detailed derivations and proofs, very useful for a beginner that is trying to figure out mathematical details of the models, have been left to the Appendix.

(12)

(13)

Chapter 2 Background

2.1 Representation Learning

Representation Learning (RL) is a developing branch of Machine Learning that has one of its focuses on extracting representation codes Z = {zi}Ni=1

from the datapoints in a dataset X = {xi}Ni=1. It is usually desirable for these

representations to be characterized by properties such as low-dimensionality, clusterability, increased linear separability (especially when used for further classification tasks) and intuitive ”semantic” explainability of the dimensions of the learned manifold.

One common attempt to achieve such properties is the use of Principal Component Analysis (PCA) which transforms the original features of the raw input into a set of linearly uncorrelated variables. The main drawback of PCA is the assumption that the explaining dimensions are linearly related to the directions of maximum variance in the data. This assumption is not true for most complex datasets, in which the original features of a dataset might actually be the result of arbitrarly complicated nonlinear unknown functions.

For this reason alternative approaches to RL are being employed, such as Autoencoders (AE) as specific forms of Artificial Neural Networks (ANN). AEs have typically a ”diabolo” shape, as an arbitrarly highly dimensional input is progressively being reduced to lower dimensionalities over layers of progressively shrinking sizes. This part of the neural network is an encoder, as its purpose is to generate low-dimensional compressed codes from a high-dimensional input. The last layer of the encoder is the usually the smallest

(14)

14 CHAPTER 2. BACKGROUND layer of the network, hence called bottleneck. This is not always the case, for example with instances with sparse or over-complete codes. To the bottleneck is attached a decoder network with layers of progressively increasing size. The last layer of the decoder is matching the dimensionality of the input layer. The learning of the network’s parameters is performed by minimizing a loss function containing the error between the reconstructed output of the network and its input. This objective function is then minimized via Gradient Descent and its variants.

2.2 Collaborative Filtering

Collaborative Filtering [Bobadilla et al., 2013] is a recomendation system technique apt to predict user-item ratings solely via the sparse matrix R of the available ratings given by users to items without using any information about either users or items. The main aspect that makes CF work is that similar users are recognizable as similar by having similar ratings on the same items. Hence, it’s possible to predict a missing rating of a user to an item by considering the ratings of the users that are similar to him.

R =          r1,1 . . . r1,M . .. . .. .. . ri,j ... . .. . .. rN,1 . . . rN,M         

← sparse user row ri,·

↑

sparse item column r·,j

(2.1)

2.3 Variational inference

Bayesian inference is concerned on updating an existing hypothesis on a statistical model on a data source, with data samples empirically obtained from that data source.

In other words, an existing model hypothesis is called a prior distribution p(M); the probability of the samples D under the model M is called the

(15)

2.3. VARIATIONAL INFERENCE 15 likelihood p(D|M). The usually not available true probability of the samples D is called evidence p(D).

By using Bayes’ theorem it’s possible to obtain the posterior distribution p(M|D) of the model M after observation of the data D:

p(M|D) = p(D|M)p(M)

p(D) (2.2)

In representation learning it’s assumed that each datapoint x are gen-erated by unknown (latent) variables z. Hence the problem of finding the generative model of the data becomes learning the parameters of a system that, given instances of z is able to produce as faithfully as possible, the respective datapoints x. In this scenario, inference is concerned with the dual problem of finding a distribution over z conditioned by the datapoint x. The initial hypothesis on how the latent variables are distributed, which is described by the prior distribution pθ(z), is updated to the datapoint x and

likelihood pθ(x|z) within the framework of a generative model represented by

θ. This framework describes how x relates to a certain latent variable assign-ment z. In this new context the bayesian rule is used to infer the posterior distribution of an arbitrary setting of the latent variables z:

pθ(z|x) =

pθ(x|z)pθ(z)

pθ(x)

(2.3) As the true posterior pθ(z|x) is typically unavailable, being pθ(x) =

R

zpθ(x|z)pθ(z)dz intractable, an approximation q(z|x) is looked for via

vari-ational inference methods. Varivari-ational inference is concerned to minimize the distance between the approximation and the true posterior [Fox and Roberts, 2012], which is typically done by minimizing the Kullback-Leibler distance KL [q(z|x)||pθ(z|x)].

The KL can be decomposed into: KL [q(z|x)||pθ(z|x)] = Eq(z|x) log q(z|x) pθ(x, z) + log pθ(x) (2.4)

We can use the shorthand L (x) = −Eq(z|x)

h

log_pq(z|x)

θ(x,z)

i

.It is clear that log pθ(x) is a fixed quantity w.r.t. z, and that KL quantities are always

non-negative, hence it’s easy to see how L (x) is a lower bound to pθ(x) and the

maximization of L (x) implies necessarily the minimization of KL [q(z|x)||pθ(z|x)]

(16)

16 CHAPTER 2. BACKGROUND This is the basis of variational inference, and you can refer to Appendix A.2, A.3 and A.4 for details on the derivations of the lower bound.

2.4 The Variational Auto-Encoder

[Kingma and Welling, 2013] introduced a model aimed at posterior inference on datasets with high-dimensional datapoints.

The model is based on a generator network which outputs a conditional distribution pθ(x|z) in datapoint-space given a realization of the latent

vari-ables z.

The posterior distribution pθ(z|x) =

R

zpθ(x|z) pθ(z) dz is intractable,

hence an approximating recognition network qφ(z|x) is introduced whose

parameters φ are optimized via variational inference. The optimization of φ happens simultaneously with the parameters θ.

It was also shown experimentally how a Monte Carlo approximation of the ELBO (section A.3) by sampling the posterior approximation is sufficient to achieve good learning performances.

Moreover, [Kingma and Welling, 2013] experimentally demonstrated how just a single Monte Carlo samples might achieve good approximation.

Since values of z are being sampled, this would prevent gradients from flowing in a backpropagation-like way. To circumvent this problem, a repa-rameterization trick has been employed by using a sample which is always drawn from a N (0, I) Normal distribution. By using the transformation:

ˆz = µφ+ σφ· (2.5)

a sample is obtained from the distribution N (µφ, σφ).

The sum-based form that allows for SGD-like updates described in section A.3 and the fact that a Monte Carlo approximation is used for the approxi-mation of one datapoint term are the reason that [Kingma and Welling, 2013] gave Stochastic Gradient Variational Bayes as a name for this technique.

2.5 Normalizing Flows

The original VAE model is charachterized by having a simple diagonal-covariance Gaussian posterior approximation. In order to achieve more com-plex distribution forms, multi-step transformations of the latent variable z are being employed.

(17)

2.5. NORMALIZING FLOWS 17 In our work we focused on two forms: Planar Flows and RealNVP.

2.5.1 Planar Transformations

It has been proposed [Rezende and Mohamed, 2015] to achive a more complex posterior approximation by using a type of transformations with the following form:

t(z) = z + uh(w>z + b) (2.6)

This transformation can be applied to a simpler distribution, such as the diagonal-covariance gaussian introduced in [Kingma and Welling, 2013].

The parameters are: b which is a scalar, w>

∈ RD _{and u ∈ R}D_{; h is an}

element-wise nonlinearity, such as a tanh.

The expression w>_{z + b is a scalar value, and h(w}>_{z + b) can be seen as}

one perceptron layer with a single output unit. u is a parameter that acts as a coefficient vector representing the amount of the transformation h(w>_z+b)

applied to the input z vector.

The derivations in Appendix A.1 show how just the determinant of Jaco-bian of the transformation is used in order to express the probability of the transformed variable as a function of the probability of the original variable z0. For the derivation of the Jacobian please refer to Appendix A.5

2.5.2 RealNVP Transformations

[Dinh et al., 2016] introduced a very simple invertible function of the form: t(z)1:d = z1:d

t(z)d+1:K = zd+1:K exp (s(z1:d)) + a(z1:d)

(2.7) The inverse can be trivially obtained as:

   z1:d = t(z)1:d zd+1:K = (t(z)1:d− a(z1:d)) exp(s(t(z1:d))) | {z } exp(−s(t(z1:d))) (2.8) s(·) can be any dimensionality-preserving nonlinear function, such as a neural network with nonlinear activations. a(·) is an affine transformation. In this work’s implementation d is set d = K/2.

(18)

18 CHAPTER 2. BACKGROUND The main advantage of using such transformations is that the Jacobian matrix is triangular, hence its determinant is obtained from the diagonal, culminating with the form expP

js(z1:d)j

Another great advantage over planar flows is that, while planar flows force the transformation to be channeled to a scalar value, RealNVP do not have this restriction, as the nonlinearity is applied to a dimensionality which is the same as the latent variable’s.

2.6 Variational posterior approximation

col-lapse

It was observed [Kingma, 2017] [Chen et al., 2016] that in the initial phases of training, due to weakness of the term pθ(x|z) the term KL [qφ(z|x) ||pθ(z)]

promotes qφ(z|x) to collapse to the prior pθ(z).

If the latent variables are independent, then this phenomenon can be diagnosed by looking at the individual Kullback-Leibler divergences at each latent dimension, as shown in A.50 and, for the diagonal-covariance Normal, in A.51, A.48.

The KL [qφ(z|x) ||pθ(z)] term of the L (x), if seen in the context of

av-eraging within a minibatch M, as in Ex∼M[KL [qφ(z|x) ||pθ(z)]], can be

interpreted as an approximation to a mutual information term I (z; x). The implied minimization of the mutual information during optimization of the ELBO forces a high dependence of the x datapoints to the prior pθ(z),

lead-ing to over-regularization of qφ(z|x).

2.6.1 KL Annealing

[Bowman et al., 2015] has done extensive experiments with variational au-toencoders in recurrent neural networks, and points out that it’s very likely that the KL term is much easier to be optimized and is quickly brought to 0, forcing the qφ(z|x) term to collapse to the prior p(z). He proposes annealing

of the KL term to prevent this phenomenon by lowering the contribution of the term in the initial phases of the learning.

A simple implementation for the annealing is the following: γ = min(t, T )

(19)

2.6. VARIATIONAL POSTERIOR APPROXIMATION COLLAPSE 19 Where t is the current epoch number, T is the amount of epochs required to reach regimen and γ is the coefficient to the KL term.

2.6.2 Free Bits and Soft Free Bits

In order to prevent the collapse of the posterior approximation to the prior, the gradients of the KL term can be zeroed by setting a lower-bound value to the nats expressed from that term, as in:

max [λ, Ex∼M[KL [qφ(z|x) ||pθ(z)]]] (2.10)

Alternatively, as described in a revision of [Chen et al., 2016] Soft Free Bits can be used by adapting a KL annealing rate γ by updating it at every iteration. γ is hence repeatedly multiplied by 1 + or 1 − , according to the KL being, respectively, larger or lower than γ. This is described by the following algorithm:

Algorithm 1 Soft Free Bits Require:

(1) Initial annealing rate γ (to the KL) (2) value to adjust the annealing rate (3) λ desired target nats from the KL Ensure:

(1) The annealing rate γ will be adjusted to ease the convergence of the KL to the target value λ

1: _{if KL > λ then} 2: γ ← γ * (1 + )

3: else

4: γ ← γ * (1 - )

(20)

(21)

Chapter 3 Related work

3.1 Probabilistic Matrix Factorization

Probabilistic Matrix Factorization [Salakhutdinov and Mnih, 2008] is dimensionality-reduction technique for the CF problem that learns a matrix factorization

of the sparse matrix of observed ratings R into two low-dimensional factor matrices U ∈ RD×N _{and V ∈ R}D×M _{where D is the size of the low}

dimen-sionality. Hence, R = U>_V_.

The learning algorithm is based on a probabilistic assumption: p(R|U, V, σ2_{) =} N Y i=1 M Y j=1 N Rij|Ui>Vj, σ2 Iij (3.1) Here Iij is 0 if the Rij is not set and is 1 if it is set.

In PMF, the log-likelihood is a sum of terms, each dependent on a specific user and item with Iij = 1. This allows for SGD-like updates of the vectors

Ui and Vj that are progressively refined trough the iterations.

3.2 AutoRec

AutoRec [Sedhain et al., 2015], differently from PMF, does not store learned latent vectors, but is able to produce them on-the-fly via an encoder-decoder neural network architecture.

This model is particularly interesting as a single query with an entire sparse ratings vector results in all the missing ratings to be estimated at once.

(22)

22 CHAPTER 3. RELATED WORK The missing ratings are not provided to the encoder but an estimation of those is nevertheless being provided by the decoder, making use of the representation learning and denoising intrinsic capabilities of an autoencoder with low dimensional bottleneck layer.

The loss function is hence the error between the user (or item) vector r and its reconstruction, but considering, via element-wise multiplication with the vector mask M, only the existing ratings, otherwise the learning would be incorrectly taking account of the 0 placeholders for the missing ratings in the sparse matrix:

minX

k

|| [rk− Dec (Enc (rk))] Mk||22 (3.2)

Even with this model, the sum allows for SGD-like updates.

3.3 Matrix Factorizing Variational

Autoen-coder

MFVA [van Baalen, 2016] makes use of the findings in [Kingma and Welling, 2013]: variational autoencoders are being used in order to yield posterior distributions approximations as diagonal Gaussians q(ui|ri·) and q(vj|r·j).

The decoder/recognition functions that have been used, differently from AutoRec, output the single rij rating. A dot-product between ui and vj as

well as MLP have been employed for the task, with rij being either expressed

with a Gaussian distribution or with a multinomial distribution. The lower bound assumes this form:

L = −X i KL [Qu(ui|ri,·)||p(ui)] −X j KL [Qv(vj|r·,j)||p(vj)] +X i,j EQu,Qv[log p(rij|ui,vj)] (3.3)

(23)

Chapter 4 Method

This chapter will mostly expose the original contributions of this thesis. Our models, as typical in machine learning, are constitued of many ”building blocks”, which will be individually elucidated trough the following sections.

4.1 VAERec

VAERec is one of the contributions of this thesis. It extends the AutoRec model making use of the VAE framework.

It has been implemented in three variants:

U-VAERec assumes the presence of latent variables ui, which represent

a specific user i in latent space, whose observed ratings are represented by the sparse row ri·. This model reconstructs user rows by learning the

condi-tional distribution pθ(ri·|ui) assumed to be a diagonal-covariance Gaussian

and, jointly, the variational approximation to the posterior qφ(ui|ri·), also

assumed to be a diagonal-covariance Gaussian.

I-VAERec is dual to U-VAERec. It assumes the presence of latent vari-ables vj which represent a specific item in latent space, whose observed

rat-ings are represented by the sparse column r·j. Hence, the target of the

learning are the parameters of the distribution pθ(r·j|vj) and qφ(vj|r·j).

UI-VAERec reconstructs a vector consisting of the concatenation of a user row and item column. It learns pθ(ri·, r·j|z) and qφ(z|ri·, r·j). This differs

(24)

24 CHAPTER 4. METHOD from the MFVA model with the MLP decoder proposed by [van Baalen, 2016], as a distribution on a single latent vector zij representing the

user-item pairing is being produced instead of having two distinct distributions on ui and vj.

4.1.1 Sampling the ratings for the UI variants

Training the UI-VAERec would require very long epochs, as the number of training datapoints would be the number of ratings O(N ∗ M ). To prevent problems related to excessive memory usage, as one rating would be stored as a concatenation of its user vector ri· and item vector r·j, epochs have

been implemented as random samplings of a fixed amount (5000) of the ratings. The validation set is comprised by a similar sampling, on different ratings. It’s worth noting that in our implementation the ratings selected in the training set will never be present in the vectors of the validation set and vice-versa. Moreover, ratings are being split between training ratings and validation ratings at the very beginning and this sampling is kept unchanged through the epochs, effectively creating two non-overlapping sparse matrices R(t) _{for training and R}(v) _{for validation.}

4.2 VAERec with Normalizing Flows

The VAERec-NF model extends the VAERec by improving the posterior approximation with Normalizing Flows [Rezende and Mohamed, 2015] as explained in section 2.5.1, A.6 and A.7.

4.2.1 Normalizing Flow using RealNVP’s invertible

trans-formation

In this VAERec-NF variant, a transformation of the type previously de-scribed in section 2.5.2 is introduced. This transformation was considered interesting because of its implementation ease and very simple determinant of the Jacobian. Specifically, the function s(·) is implemented as a single perceptron layer with nonlinear activation function tanh. The function a(·) which is required to be an affine transform, is implemented as a single per-ceptron layer with linear activation function.

(25)

4.2. VAEREC WITH NORMALIZING FLOWS 25 All the parameters of the transformation are being produced as output of the encoder network, exactly as happens with the parameters of qφ(z|x)

in a Variational Auto-Encoder. This differs from the model of 2.5.2 as their network parameters are not given as a function of the inputs, but is rather a globally initialized global network which is the same for every input. The weights of the a(·) and s(·) layers are implemented as vectors of size K2_{, then}

reshaped into (K, K) dimensions. This limits the model into very low latent dimensionalities.

4.2.2 Masking

To ease the implementation, the selection of the first and second parts of z have been implemented with random hyperparameter masks. These masks are unique for each transformation step, and are computed as follows: Algorithm 2 Half-full random masks for RealNVP transformations Require:

(1) Latent dimensionality K

(2) Number of transformation steps k Ensure:

(1) Random masks m1. . .mk which have half of their elements set at 1 ‘

1: for i ∈ {1 . . . k} do 2: (a)j ← 1 j < K/2 0 K/2 ≤ i < K 3: mi = shuffle(a) 4: end for 5: return m1. . .mk

The invertible function, for a transformation step i, is hence implemented as:

zi+1= zi mi+ (1 − mi) [zi exp (s(mi z)) + a(mi z)] (4.1)

The masks are computed before training and are left unchanged, as they should be considered as hyperparameters.

(26)

26 CHAPTER 4. METHOD

4.3 Tackling underfitting

It’s very common, when working on new models, to have difficulties in get-ting the model to even learn from the training data. In other words, the model may be configured in such a way that, even before overfitting arises, underfitting is still an issue.

This might be caused by many factors: limited width or depth of the networks, over-regularization, low quality datasets, learning rate too small.

In our models attempts to tackle underfitting have been widening the network and adding more hidden layers, as well as KL annealing, described in the following section.

4.3.1 KL annealing

In VAERec models, one source of regularization is the KL divergence in the ELBO.

In our models, a linear slowly annealing coefficient on the KL has been tried. With the following rule:

a= max(t, T )

T (4.2)

Where a is the value of the annealing coefficient to the KL, t is the current epoch and T is the epoch number from which the coefficient will be 1.0.

4.4 Dealing with overfitting

One of the major issues in some AutoRec/VaeRec models is overfitting. Notwithstanding the fact that VAE models are plagued by over-regularization caused by the posterior approximation collapse described in section 2.6, under specific circumstances overfitting is prominent, as in using UI (user+item) datapoint schemas. In these cases, the model performs very well on the training set but unsatisfactory on the test set.

Dropout on the input layer Dropout [Srivastava et al., 2014] is a tech-nique aimed at preventing overfitting which employs randomly dropping units and their connections during training. This would ensure to obtain a neural network which can function even when parts are deactivated. Moreover, it

(27)

4.5. OPTIMIZATION 27 would prevent that a single unit becomes entirely representative of a certain aspect of the training data data.

It has been observed that applying Dropout on just the input layer of the VaeRec models, overfitting can be prevented to a certain degree, possibly by a similar way of action as denoising autoencoders. [Vincent et al., 2010a]. Narrowing the bottleneck This is a common technique, typically used with regular autoencoders. Information is being channeled trough a limited number of nodes, forcing the neural network to lose hopefully unrelevant information which might be dataset noise.

Regularization Regularization has been tested with either L1 or L2 norms, without significant improvements.

Increasing the depth Depth has been studied in [Mhaskar et al., 2017], [Poggio et al., 2015] as a method to improve representations. Better repre-sentations usually lead to a better identification and disposition of overfitting noise.

4.5 Optimization

The application of gradients from the objective function to the parameters is nontrivial. Problems such as choice of learning rate, prevention of exploding gradients and vanishing gradients, as well as avoiding getting stuck in local minima are well known challenges of model training. In this sections are described a few optimization methods as well as details on implementations to tackle sparsity.

4.5.1 RPROP update algorithm

RPROP [Riedmiller and Braun, 1993] is a gradient-based parameter update schema that does not take into account the magnitudes of the gradients, but only their sign.

The idea is simple: if the gradient keep pointing towards the same (ei-ther positive or negative) direction, then the parameter-specific update delta needs to be increased, otherwise, in case the gradient for a parameter keeps

(28)

28 CHAPTER 4. METHOD changing sign, then the update delta needs to decrease, in order to fine-tune the parameter to its optimum.

The change of variation is detected by the product of the gradient of an parameter wi calculated to minimize an objective function J parameter a

time step t by the gradient of that same parameter at the previous time step t −1: p= ∂J ∂wi (t−1) ∗ ∂J ∂wi (t) (4.3) The sign of p determines the increase or decrease of the parameter-specific delta ∆i: ∆(t)_i =      min{η+_{∗ ∆}(t−1) i ,∆max} if p > 0 max{η−_{∗ ∆}(t−1) i ,∆min} if p < 0 ∆(t−1)_i if p = 0 (4.4) Where 0 < η−_<_{1 < η}+_.

Typical parameter settings are η− _{= 0.5, η}+_{= 1.2, ∆}

min = 1e−6, ∆max =

50.0 and initial delta values ∆0 = 0.1.

RPROP has been used specifically for AutoRec, as suggested in the orig-inal paper [Sedhain et al., 2015].

4.5.2 Adam

Given its properties of adaptive momentum , Adam [Kingma and Ba, 2014] has been choosen as optimization algorithm.

The VAERec, similarly to the AutoRec, needs to be selective on which parameters needs to be updated: both in the first layer of the encoder and the last layer of the decoder, only the weights that are connected to existing ratings can be updated.

Provided a binary mask M of a parameters tensor θ then Adam has the two assignments ofm_btandvbtmodified from the original algorithm as follows:

b mt← mt 1 − βt 1 M + mt−1 (1 − M) b vt← vt 1 − βt 2 M + vt−1 (1 − M) (4.5)

(29)

4.5. OPTIMIZATION 29

4.5.3 Learning rate annealing

Annealing of the learning rate is a technique that uses the progressive reduc-tion of the learning rate in order to facilitate achieving an optimum of the parameters.

Intuitively one might think that, as the learning progresses, the param-eters get progressively near the optimum, and, as a consequence, the pa-rameter adjustment needs to be of inferior magnitude than the initial one. This desirable aspect of the learning is achieved by progressively reducing the learning rate over the epochs.

This is demonstrated by [Robbins and Monro, 1951] which established conditions on the sequence of learning rates that would ensure reaching an optimum.

The learning rate annealing schema that has been chosen is described by the following formula [Orr, 1999]:

γ(t) = γ

(0)

1 + t/T (4.6)

where γ(t) _{is the learning rate at epoch t, γ}(0) _{is the initial learning rate}

and T is a hyperparameter whose amount is the number of epochs it takes for the learning rate to halve.

(30)

(31)

Chapter 5 Experiments

This chapter presents experiments on our models. At first some contextual information is given, such as details on technologies that have been employed, dataset used, technicalities about scaling of learning rate and regularization coefficients and hyperparameter settings.

Follow results on the experiments runs. Interesting information on how the model determines learning progress can be obtained by observation of the RMSE progress over the epochs , both for the training set and validation set. Analysis of the charts provides insight on issues such as hyperparameter choice, overfitting and underfitting.

5.1 Technologies used

Python version 3 The Python programming language is well suited for data science applications. Its large number of libraries available makes it suitable for avoiding re-inventing the wheel. Its conciseness make it very readable and hands-on.

Theano [Al-Rfou et al., 2016] It’s a Python library useful to create com-putational graphs and automatic differentiation, specifically using tensors. DAS4 [Bal et al., 2016] Grid computing environment from a partnership between Dutch universities. Allows the use of concurrent jobs, also on nodes that are provided with GPUs to speed-up deep learning computations.

(32)

32 CHAPTER 5. EXPERIMENTS

5.2 The Movielens datasets

To test the models the Movielens datasets [Harper and Konstan, 2015] have been used. Specifically, the small dataset was used for local debugging, and the 1M dataset was used for the main experiments. The reason that the 1M has been chosen as dataset for the main results was that most papers report results for their model being trained on this specific dataset.

Follow statistics on the two Movielens datasets.

number of users N 668 number of items M 10325 average rating 3.51685 standard deviation 1.04487

Table 5.1: small Movielens dataset statistics

(33)

5.3. SCALING ISSUES AND REGULARIZATION 33 number of users N 6040

number of items M 3706 average rating 3.58156 standard deviation 1.1171

Table 5.2: 1M Movielens dataset statistics

Figure 5.2: 1M Movielens dataset ratings distribution

5.3 Scaling issues and regularization

In order to achieve the same intensity of learning per epoch even by varying the minibatch size it is necessary to re-scale some hyperparameters.

Let’s consider a complete objective for a typical learning task of quan-tities yi from respective datapoints xi, using a dataset D = {xi, yi}

|D|

i . As

the datapoints are independent and identically-distributed, then it can be expressed as a sum over all the datapoints. One step of learning from this objective is usually referred as an ”epoch”.

J =

|D|

X

i=1

(34)

34 CHAPTER 5. EXPERIMENTS Where Ω is a regularization term, Θ is the set of the regularizable pa-rameters and λ is a fixed hyperparamter that determines the regularization amount.

This objective is subject to Gradient Descent learning on the trainable parameters:

Θt+1 = Θt− γ∇ΘJt (5.2)

Where γ is the learning rate hyperparameter.

As J is defined as a sum over independent datapoints, it is possible to use Stochastic Gradient Descent learning strategies, which take into account only a limited number of datapoints at each time.

It’s desirable to consider the average contribution Ja of each datapoint to

the objective J 5.1: Ja= 1 |D|J = 1 |D| |D| X i=1 `(xi, yi) + λ |D|Ω(Θ) (5.3)

If learning using Ja is repeated |D| times within an epoch, then the

al-gorithm achieves the same learning intensity as with J by keeping the same learning rate γ.

By considering splitting the dataset and the objective J 5.1 into a num-ber of minibatches of size B, an approximation to Ja, useful for SGD-like

algorithms, can be obtained: Jb = 1 B B X i=1 `(xi, yi) + λ |D|Ω(Θ) (5.4)

An important consequence of using Jb is that the intensity of the learning

is altered, because less updates would be applied to Θ at each epoch. In order to balance this phenomenon, the learning rule can be modified as follows:

Θt+1= Θt− Bγ∇ΘJb,t (5.5)

The presence of the B coefficient cancels out the effect of the 1

B coefficient

in the Jb average on the datapoints:

Θt+1 = Θt− γ " X i=1 ∇Θ`(xi, yi) + Bγ |D|∇ΘΩ(Θ) # (5.6)

(35)

5.4. PREVENTION OF EXPLODING GRADIENTS 35

5.4 Prevention of exploding gradients

It is possible, under specific circumstances, that the gradients may become unstable and compromise the parameters of the model with infinities or ”not a number” values.

In order to prevent this phenomenon, a few ”tricks” have been imple-mented:

Gradient clipping [Markou, 2017] has been implemented with the follow-ing norm-based scalfollow-ing algorithm:

Algorithm 3 Norm-based gradient clipping Require:

(1) Gradient tensor g

(2) Threshold θ (defaulted to value 10) Ensure:

(1) Scaled gradients ˆg whenever their L2 norm surpasses a threshold θ. ‘

1: if ||g||2 > θ then 2: g ←ˆ θ ||g||2g 3: else 4: g ← gˆ 5: end if 6: return ˆg

Scaled tanh activation function . Some layers have ”log σ” outputs. As the output of these layers needs to be processed to exponentiation in the likeli-hood function, if the activation is kept linear there is a great risk of instability and value explosion. For these reason a ”pseudo-linear” soft-bounded activa-tion funcactiva-tion has been implemented by re-scaling a tanh activaactiva-tion funcactiva-tion. A tanh activation function has the co-domain of the function bounded be-tween -1 and +1. Moreover, its derivative is approximately 1 near the origin. It follows that tanh perfectly suits the role of pseudo-linear bounded activa-tion funcactiva-tion if it’s rescaled as follows:

(36)

36 CHAPTER 5. EXPERIMENTS Where K is a small integer which is greater than 1. In this way this activation function will be bounded between −K and +K. A good value for K might be 5, in order to obtain σ-values properly bounded between 0.0067 and 148.4.

For layers that are supposedly linear in their outputs, such as Planar Flow’s w, b, and u quantities, as well as the means µ of the gaussian distri-butions, a ”pseudo-linear” function on the same guise has been implemented with K = 20.

Learning rate ”warm-up” has been implemented to prevent immediate divergence in the first epochs due to steep gradients. Hence, the learning rate has been raised from a very small value to regimen value during the course of the very first epochs.

Algorithm 4 Learning rate warm-up Require:

(1) Initial learning rate γ (2) Current epoch number t

(3) Number of initial warm-up epochs K

(4) Base value of the warm up coefficient B which has to be less than 1 Ensure:

(1) Adjusted and progressively increasing learning rate ˆγ during the first K epochs 1: if t >= K then 2: γˆ= γ 3: else 4: γˆ= BK−t_γ 5: end if 6: return ˆγ

Appropriate parameter settings have been found as K = 3 and B = 0.1.

5.5 Soft Free Bits settings

During experiments with VaeRec, it was noted how the KL values differ greatly from the marginals to the KL. This is because as the latent

(37)

dimen-5.5. SOFT FREE BITS SETTINGS 37 sionality increases, it gets harder to match the prior and the posterior. For this reason, for larger latent dimensionalities, it can be observed a posterior collapse trough the KL marginals, even if the KL still returns values that are reasonably high.

Within our Soft Free Bits (see section 2.6) implementation our solution was just to set the λ linearly proportional to K, as in λ = 2 ∗ K.

The annealing was set, as suggested in [Chen et al., 2016], to the value 0.05. The value of γ was updated at every minibatch learning iteration.

Figure 5.3: Free Soft Bits: evolution of kl annealing coefficient vs. kl diver-gence. The values are sampled after evey minibatch update. This plot has been obtained with about 50 epochs of VaeRec, without Normalizing Flows and with latent dimensionality K=1. In blue: annealing coefficient value; in green: KL measure

(38)

Figure 5.4: Zoom-in of the last part of the previous figure 5.3 . It is no-ticeable how the KL divergence measure succesfully converges towards the desired amount of 2. The annealing coefficient keeps oscillating, reflecting the dynamic nature of the annealing-vs-kl system. In blue: annealing coefficient value; in green: KL measure

(39)

5.5. SOFT FREE BITS SETTINGS 39

Figure 5.5: Similar plot to 5.3, but with the more interesting case of K=5. The KL divergence also succesfully converge to the target value 10. In blue: annealing coefficient value; in green: KL measure

(40)

Figure 5.6: Zoom-in of the last part of the previous figure 5.5. The annealing coefficient converges, with oscillations, towards small values such as the case with K=1. In blue: annealing coefficient value; in green: KL measure

5.5.1 Soft free bits settings in a deeper model with

high latent dimensionality

In order to verify the effectiveness of the linear relationship between the latent dimensionality and the amount of ”soft free bits” λ, a different, deeper and with higher dimensionality VaeRec model was chosen (without Normalizing Flows).

In this model, the latent dimensionality chosen is 250 and there are two hidden layers in both the encoder and decoder. L2 regularization has been used with coefficient 100; an initial learning rate valued 0.000006 and anneal-ing T = 10 has been employed.

(41)

5.5. SOFT FREE BITS SETTINGS 41 the λ:

Figure 5.7: ”Deep” model with different settings for λ: 0.5 ∗ K, 1 ∗ K and 2 ∗ K

Table 5.3: Best results for deep VaeRec under different ”soft free bits” λ settings

λ best training RMSE best testing RMSE

0.5 ∗ K 0.8003 0.8565

1 ∗ K 0.4951 0.8536

2 ∗ K 0.3766 0.8598

While λ = 2 ∗ K is obiously under-regularized, there might be a ”sweet spot” between the under-regularized λ = K and the apparently over-regularized λ = 0.5 ∗ K. The latter case seem to be quite interesting as it seems to have not have converged, even after 400 epochs, to a definitive optimum of testing RMSE.

(42)

5.6 Choice of hyperparameters

The search for an acceptable model via hyperparameter search was a very long process. Since most experiments took about 8 days to complete, choos-ing the right hyperparameter settchoos-ings took many months.

At the end of this process it was understood that most hyperparameters were located on a specific trade-off minimum in a convex curve of validation error. For example, using minibatch of size 1 gave often unsatisfactory resuls. Better results were obtained with a minibatch of size 64, but increasing that value, for example to 128 or 256 gave worse results.

Other hyperparameters that were problematic to set were those dedicated to L2 regularization of the network weights. A good balance was found by using 100 or 200 as L2 regularization coefficient.

The learning rate was also located as a minimum in a convex curve. High learning rates might initially progress faster but may ”jump over” good minima, while lower learning rates might converge to a better minima but take a larger amount of epochs. These effects might be mitigated by the use of moment-based descent algorithm, such as Adam [Kingma and Ba, 2014] and by the use of the learning rate annealing described in section [Orr, 1999], with parameter T set at 10, meaning that the initial halving of the learning rate happens after the first 10 epochs (further decay is much slower) . A good initial learning rate was found being 2e-6.

The ideal number of epochs would have been 1000 but unfortunately reaching this target was highly inpractical by the sheer amount of time that the training required. This fact was aggravated by the limit on the number of concurrent jobs that was imposed by the distributed supercomputer DAS4 administration, which was about 10-20 long-running jobs at the same time. For these reasons many reported experiments have been trained for a lower number of epochs.

The depth of the network was finally chosen to be 1 hidden layer, as it eases the creation of useful intermediate-level representation values, as opposed to not using hidden layers at all. Latent dimensionalities explored were 5, 250 and 500.

(43)

5.7. NORMALIZING FLOWS 43

5.7 Normalizing Flows

Experiments with planar and RealNVP normalizing flows have been per-formed.

A base VaeRec model has been chosen with 1 hidden layer with dimen-sionality 1000 and latent dimendimen-sionality 5. The reason that such a limited model was chosed - in terms of latent dimensionality - is because RealNVP requires the production of K*K parameters for each produced weight matrix within the RealNVP step.

5.7.1 Experiments with Planar transformations

The efficacy of planar transformations was difficult to asses, because of nu-merical instabilities that kept being one of the hardest challenges.

The following plot shows the progress curves for a model without ”soft free bits”, but having fixed KL coefficient set to 1.

Figure 5.8: VaeRec with and without single planar transformation step The normalizing flow step introduce a conspicuous element of variability in the learning outcomes, that can be observed in the scattered learning curves, both the training and the testing curve. Especially observing the RMSE outcome with the training set, it seems that either the introduction

(44)

44 CHAPTER 5. EXPERIMENTS of the planar transformation step has a regularizing effect, or the tuning of the produced transformation parameters makes the learning difficult, as it might require a larger amount of epochs.

The following table summarizes the minima in RMSE:

Table 5.4: Best results for VaeRec with and without single planar transfor-mation step

best training RMSE best testing RMSE

without transformation step 0.7071 0.8620

with transformation step 0.8038 0.87053

It might be interesting how the ELBO of the two models compare. If the additional step has the effect of improving the posterior approximation, then the objective function of the training set should be inferior in the latter case. In these experiments the objective function is exactly the negative ELBO, as there was no L2 regularization set.

Figure 5.9: VaeRec with and without single planar transformation step: me-dian of objective function (negative ELBO)

(45)

5.7. NORMALIZING FLOWS 45 values with the planar transformation step, hence the posterior approxima-tion is necessarily more accurate.

5.7.2 Experiments with RealNVP transformations

To experiment with RealNVP transformation, a variant with learning rate annealing T=10, and soft free bits’ λ = 2 ∗ K = 10 has been used.

Figure 5.10: VaeRec with RealNVP Normalizing flows

Table 5.5: Best results for deep VaeRec under 0, 1 and 5 RealNVP steps Number of RealNVP steps best training RMSE best testing RMSE

0 steps 0.8191 0.8574

1 step 0.8159 0.8566

5 steps 0.8127 0.8554

These results show that an increase in number of RealNVP steps leads to a better testing accuracy, althogh the improvement is modest.

(46)

5.8 Equivalences between AutoRec and VaeRec

models

In order to perform an adequate comparison between the AutoRec and VaeRec models it’s important to establish if there are any available equiva-lences. In other words, it is interesting to see if a specific choice of hyperpa-rameters of the VaeRec leads to a model that is similar to the VaeRec both in its definition and its performance.

Luckily such a model can be found in the VaeRec by setting the KL coefficient to 0. This way that extra regularization term is absent and the VaeRec model becomes analogous to the AutoRec model.

The ELBO function per-datapoint of a VAE with posterior distribution being diagonal-covariance gaussians, without the KL divergence becomes:

L x(i)_{= E}

qφ(z|x(i))log pθ x

(i)_|z

(5.8) If pθ x(i)|z has a spherical gaussian (I covariance matrix) form, then this

objective, which comprises only the likelihood term, becomes very similar to the reconstruction error of a regular autoencoder, but differs for the fact that z(i) _{is stochastic and drawn from a distribution determined by the encoder.}

Since the KL term is absent, qφ z|x(i), unregolarized, will tend to collapse

to distributions that are centered in specific µ’s in latent space but have σ’s that tend to 0, hence with random sample from qφ z|x(i) being always µ.

Hence, an hypothesis can be formulated about the similarity of VaeRec without KL and AutoRec:

Experimental results confirm the hypothesis by showing similarity of test-ing error: Minibatch size hid.layer width num. hidden layers latent z dimensionality AutoRec (RProp) testing RMSE VaeRec (Adam) testing RMSE 64 1000 1 250 0.8700 0.8335 64 1000 2 250 0.8341 0.8365 64 1000 1 500 0.8767 64 1000 2 500 0.8696 0.8511

Table 5.6: Comparison of similar VaeRec and AutoRec models The testing error achieved by both AutoRec and VaeRec models are very

(47)

5.9. USER+ITEM CONCATENATION VS TRADITIONAL ITEM OR USER DATAPOINTS47 similar under similar hyperparameter settings.

5.9 User+Item concatenation vs traditional

Item or User datapoints

5.9.1 User+Item concatenation on AutoRec

For this experiment the base model AutoRec was used, with the purpose to observe difference in learning outcomes between using just item vectors versus the concatenation of user and item vectors.

For this experiment, a latent dimensionality of 250 and with hidden layer dimensionality set at 500. These hyperparameters reflect typical settings from the original AutoRec paper [Sedhain et al., 2015]. L2 regularization was set at 200.

Figure 5.11: AutoRec: comparing item learning vs user+item learning Unfortunately the user+item version is not able to generalize well on the dataset, Nevertheless it’s interesting to see how the user+item version overfits more than the item version. This indicates that using user+item concatenation datapoints might lead to better performance on the test set if a suitable regularization method is found.

(48)

48 CHAPTER 5. EXPERIMENTS One disadvantage of using user+item was the much longer times for train-ing, likely because of the datapoint dimensionality increase.

5.9.2 User+Item concatenation on VaeRec

Similar comparison experiments have been performed on VaeRec, with differ-ent model hyperparameters. Specifically, these experimdiffer-ents differ by having used a much lower dimensionality (5), which might have regularizing effects, as well as L2 regularization set at 100 and hidden layer dimensionality set at 1000. Moreover, learning rate annealing T parameter has been set to 10 and soft free bits have been employed with λ = 2 ∗ K = 10.

Figure 5.12: VaeRec: comparing item learning vs user+item and user learn-ing

Similarly as the previous AutoRec experiment, it can be observed how user+item overfits and performs poorly on the testing set.

Interestingly, the ’user’ version of VaeRec performs better than the base-line ’user’ variant of AutoRec as reported in their paper.

(49)

5.10. EXPERIMENTS WITH REGULARIZATION TECHNIQUES 49

Table 5.7: VaeRec: performance on the test set of item learning vs user+item and user learning, compared to the reported AutoRec outcomes [Sedhain et al., 2015]

VaeRec AutoRec

training testing testing

item λ = 10 0.8240 0.8599 0.831

user λ = 10 0.8262 0.8598 0.874

user+item λ = 10 0.8241 1.0893 N/A user+item λ = 1 0.9813 0.9930

In these experiments using λ = 1 very soon leads to NaN (Not A Number) interruption. This lead to the observation that there exists a lower-bound to the KL divergence, in this case about 2.46, which causes the annealing coefficient to diverge towards infinity.

5.10 Experiments with regularization techniques

5.10.1 Dropout layer on the input of an AutoRec model

Denosing autoencoders [Vincent et al., 2010b] improve the quality of the representations by forcing resiliance of the neural network by adding noise on the input and using the original datapoint in the objective function.

We tried a similar mechanism on our AutoRec re-implementation by adding a Dropout [Srivastava et al., 2014] layer with parameter p = 0.1 on the input.

The following plot shows that an improvement in generalization is indeed obtained:

(50)

Figure 5.13: Dropout layer on the input of an AutoRec model

5.10.2 Dropout layer on the input on a deep VaeRec

model

Additional experiments to very the effectiveness of adding a dropout layer to the input have been performed to a deeper model as described in section 5.5.1. This base model has been set with ”soft free bits” λ = 1 ∗ K, hence slightly under-regularized.

(51)

Figure 5.14: Dropout layer on the input of a deep VaeRec model

Table 5.8: Best results for deep VaeRec under different p settings of a dropout layer applied on the input

p best training RMSE best testing RMSE

0 0.5102 0.8536

0.5 0.6413 0.8443

0.8 0.8469 0.8861

The use of a dropout layer on the input to enforce a denosing-like behavior seems to be beneficial. While p = 0 results in overfitting and p = 0.8 results in underfitting, the VaeRec is able to achieve an extremely good performance of 0.8443 on the testing set with p = 0.5.

(52)

5.10.3 Tradeoff between KL divergence and weights

regularization

Both weights decay and KL divergence are regularizers that enable the model to achieve generalization. The contribution of both these terms is com-pounded, so that coefficients need to be tuned in order to avoid over-regularization.

As an example, this can be seen in the following plot, where a VaeRec model using L2 regularization with coefficient 200 is tested with, and without the KL term:

Figure 5.15: VaeRec with and without KL divergence, minibatch size set at 1

For additional comparison, here are the result by using a minibatch up-date schema with size set at 64:

(53)

Figure 5.16: VaeRec with and without KL divergence, minibatch size set at 64

Using the minibatch shows considerable improvements for both variants (with and without KL divergence). It is interesting how the model with KL improves in such a drastic way by using minibatch learning. This is probably caused by the KL regularizer being very noisy with individual samples, caus-ing over-regularization, as typical with SGD (without minibatch) schemas. By using a minibatch the KL regularizer becomes less noisy by smoothening via averaging.

(54)

(55)

Chapter 6 Conclusion

VAERec models introduce a straightforward extension to the AutoRec mod-els. Probabilistic information on latent variables representing users or items was exhamined by [van Baalen, 2016].

AutoRec has been chosen as the base model for its capability to recon-struct an entire sparse vector of ratings belonging to a user, or to an item by estimating all its missing values during a single query.

Our work introduces additional explorations trough the use of variational autoencoders specifically made to handle sparse input, implemented via pa-rameter masking.

Moreover, posterior approximation improvements have been added, in the form of planar flows [Rezende and Mohamed, 2015] and novel use of the more powerful invertible transformation introduced by RealNVP [Dinh et al., 2016]. Comparisons between different hyperparameter settings have been illustrated.

Experiments show that adding transformations to the posterior approxi-mations leads to a higher ELBO, hence the posterior approximation is nec-essarily improved. Additional experiments show how RealNVP transforma-tions improve the generalization capability of the model.

Overfitting and underfitting were some of the major obstacles in the at-tempt to obtain models with good generization capabilities. In VAE models these phenomenon can be tackled by altering the coefficient to the KL regu-larizer term. An adaptive method, named Soft Free Bits [Chen et al., 2016] has been employed in order to dynamically alter the KL coefficient according to the value of the KL term.

The novel use of datapoint comprised of a concatenation of user and item 55

(56)

56 CHAPTER 6. CONCLUSION vector indicated some promising prospects for AutoRec-like models. This input variant leads to a better fitting than item or user-based models under identical circumstances. The drawback of overfitting seems to be overcome by regularization techniques, in the case of the VaeRec, by careful handling of the coefficients of the KL divergence, with techniques such as soft free bits.

6.1 Future work

The field of representation learning and autoencoders is currently object of growing interest from researchers. Specifically, methods to improve posterior approximations of VAEs are being researched and could be applied to the base model VAERec. For instance, of particular interest is Autoregressive Flow [Kingma et al., 2016]

Specifically to this work, the RealNVP transformation could be further improved by changing the function a(·) into a nonlinear function instead of being an affine transformation.

The decoder, or generator network, has been chosen as having a spherical gaussian form with identity covariance matrix I on pθ(x|z). A more

infor-mative model with an arbitrarly-valued diagonal covariance matrix could be employed in order to give a measure of uncertainty on the estimated ratings. With an eye on different models, Generative Adversarial Networks [Good-fellow et al., 2014] seem to be well suited for collaborative filtering. The Generator-Discriminator networks might help obtaining predicted ratings that are as ”real” as they could possibly be.

Hyperparameter search for VaeRec models needs to be further investi-gated. Specifically, computation-intensive improvements such as increase in depth and width of the networks should be looked into. Alternative settings to the λ parameter for soft free bits need to be tested in order to find a proper optimum. Different libraries than Theano [Al-Rfou et al., 2016] should also be tried, as Theano’s development is currently discontinued, in favor of Ten-sorFlow [Abadi et al., 2015] or PyTorch [Paszke et al., 2017], which might have better support for sparse tensors.

(57)

Appendix A

Derivations

A.1 Density of a transformed multivariate

ran-dom variable

a random variable z0 is transformed via an invertible transformation T :

Rn_{→ R}n_:

z1 = T (z0) (A.1)

Then, it’s possible to express the pdf of z1 by using the distribution of

the original variable pz0(z0) and the invertible transformation T .

This can be demonstrated by making use of the cdf of z1, denominated

Fz1(z1), showing its relationship with the cdf of z0 and the inverse of T

[Watkins, 2009]: Fz1(γ) = P (z1 ≤ γ) = P (T (z0) ≤ γ) = P z0 ≤ T−1(γ) = Fz0 T −1 (γ) (A.2)

The integral is expanded and integration by substitution is used to high-light the role of the pdf of z0 and the determinant of the Jacobian matrix of

(58)

58 APPENDIX A. DERIVATIONS the inverse of T : Fz1(z1) = Fz0 T −1 (z1) = Z T−1(z1) pz0(z0) dz0 = Z z1 pz0 T −1 (z1) · det∂T −1_(z 1) ∂z1 dz1 (A.3)

Next, the derivative on z1 is applied to the form of the cdf of z1 just

achieved, in order to obtain a convenient formula for pz1(z1):

pz1(z1) = ∂ ∂z11· · · ∂z1n Fz1(z1) = ∂ ∂z11· · · ∂z1n Z z1 pz0 T −1 (z1) · det ∂T −1_(z 1) ∂z1 dz1 = pz0 T −1 (z1) · det ∂T −1 (z1) ∂z1 (A.4)

The matrix inverse of the Jacobian matrix of an invertible function, such as T , is the Jacobian matrix of T−1_{: [inv, 2017]}

∂T (z0) ∂z0 −1 = ∂T −1_(z 1) ∂z1 (A.5) The well known property of the determinant of inverse matrix det(A−1_{) =}

1/ det(A) can be used in order to just calculate the Jacobian of T instead of the Jacobian of T−1 _{. This leads to:}

pz1(z1) = pz0 T −1 (z1) · det∂T (z0) ∂z0 −1 (A.6) This form can be eventually be expressed as a function of z0:

pz1(z1) = pz0(z0) · det∂T (z0) ∂z0 −1 (A.7)

A.2 Variational Expectation Lower Bound

Given a dataset X = {x(1)_{. . .}_x(N )_{} and the respective unobserved latent}

(59)

A.2. VARIATIONAL EXPECTATION LOWER BOUND 59 derive a quantity to be maximized by considering the minimization of the Kullback-Leibler divergence KL [qφ(Z|X) ||pθ(Z|X)] by decomposing it as

follows: KL [qφ(Z|X) ||pθ(Z|X)] = Eqφ(Z|X) logqφ(Z|X) pθ(Z|X)

= Eqφ(Z|X)[log qφ(Z|X) − log pθ(X|Z) − log pθ(Z) + log pθ(X)]

(A.8) As the integrand inside the expectation Eqφ(Z|X)[log pθ(X)] is a constant

w.r.t. Z, then Eqφ(Z|X)[log pθ(X)] = log pθ(X). Hence, the previous

expres-sion can be rewritten as:

log pθ(X) = KL [qφ(Z|X) ||pθ(Z|X)] + L (X) (A.9)

Where we made use of this shorthand:

L (X) = Eqφ(Z|X)[− log qφ(Z|X) + log pθ(X|Z) + log pθ(Z)] (A.10)

As log pθ(X) is a constant w.r.t. the variational parameters φ, and

KL [qφ(Z|X) ||pθ(X|Z)] is always non-negative, then the quantity L (X) can

be interpreted as a lower-bound to log pθ(X) whose maximization implies

the minimization of KL [qφ(Z|X) ||pθ(X|Z)].

The lower-bound L (X) can also be expressed, as in [Kingma and Welling, 2013], by grouping some terms into a negative Kullback-Leibler divergence:

L (X) = −KL [qφ(Z|X) ||pθ(Z)] + Eqφ(Z|X)[log pθ(X|Z)] (A.11)

The Kullback-Leibler divergence KL [qφ(Z|X) ||pθ(Z)] can be also

ex-pressed by an entropy and a cross-entropy term, hence:

L (X) = −H [qφ(Z|X) , pθ(Z)] + H [qφ(Z|X)] + Eqφ(Z|X)[log pθ(X|Z)]

(A.12) More commonly, the lower-bound is re-arranged by making use of the en-tropy of the variational approximation and expectation over the joint prob-ability:

L (X) = H [qφ(Z|X)] + Eqφ(Z|X)[log pθ(X, Z)] (A.13)

This is described in [Hoffman and Johnson, 2016] as Average negative energy plus entropy.

(60)

60 APPENDIX A. DERIVATIONS

A.3 ELBO as sum of terms dependent on

in-dividual datapoints

As used by [Kingma and Welling, 2013], the ELBO can be decomposed into a sum of terms, each dependent only on an individual datapoint. This follows the assumption that each datapoint generated by a certain latent variable realization is independent from both the other datapoints:

pθ(X|Z) = N Y i=1 pθ x(i)|z(i) (A.14)

Same assumption is made on the prior distribution on the latent vari-ables: pθ(Z) = N Y i=1 pθ z(i) (A.15)

Hence this is the form for the joint probability:

pθ(X, Z) = pθ(X|Z) pθ(Z) = N

Y

i=1

pθ x(i)|z(i) pθ z(i) = N Y i=1 pθ x(i),z(i) (A.16) For convenience, the chosen form for L (X) will be the (A.12).

It’s possible to make use of information-theoretical properties [Bergstrom, 2008]:

H [qφ(Z|X)] = Hqφ z(1)|x(1)

+ Hqφ Z−(1)|X−(1) |qφ z(1)|x(1)

chain rule for joint entropy = Hqφ z(1)|x(1) + H qφ Z−(1)|X−(1) independence of datapoints =X i Hqφ z(i)|x(i) recursion (A.17) Similarly, for H [qφ(Z|X) , pθ(Z)]:

(61)

Hqφ z(i)|x(i) , pθ z(i)

(A.18) For the third term Eqφ(Z|X)[log pθ(X|Z)]:

Eqφ(Z|X)[log pθ(X|Z)] = Z z(1) · · · Z z(N ) Y qφ z(i)|x(i) N X i=1

log pθ x(i)|z(i) dz(N )· · · dz(1)

= Z z(1) qφ z(1)|x(1) · · · Z z(N ) qφ z(N )|x(N ) N X i=1

log pθ x(i)|z(i) dz(N )· · · dz(1)

= N X i=1 Z z(i)

qφ z(i)|x(i) log pθ x(i)|z(i) dz(i)

=

N

X

i=1

Eqφ(z(i)|x(i))log pθ x

(i)

|z(i)

(A.19) By plugging these forms into the ELBO (A.12), it can be shown as a sum of individual objective terms, each of those is dependent on only a single datapoint:

L (X) =

N

X

i=1

−Hqφ z(i)|x(i) , pθ z(i) + H qφ z(i)|x(i) + E_q

φ(z(i)|x(i))log pθ x

(i)

|z(i) (A.20)

It’s noteworthy that L (X) can be expressed by the following mutual information term: I_{(z; x) = E}x[KL [qφ(z|x) ||pθ(z)]] ≈ 1 N N X i=1

KLqφ z(i)|x(i) ||pθ z(i)

(A.21)

(62)

62 APPENDIX A. DERIVATIONS

L (X) = −N · I (z; x) +

N

X

i=1

Eqφ(z(i)|x(i))log pθ x

(i)_|z(i)

(A.22) This is reminiscent of the Average term-by-term reconstruction minus KL to prior interpretation of the ELBO formulated in [Hoffman and Johnson, 2016]

A.4 Rearranging the ELBO

The term KLqφ z|x(i) ||pθ(z) has an analytical solution within the

orig-inal VAE framework with Gaussian approximation of the posterior, whence it’s not subject to Monte-Carlo sampling.

Unfortunately, by using Normalizing Flows, the KL cannot be determined analytically, so it has to be subject to Monte-Carlo sampling. The negative lower bound L(x) can be interpreted as a negative Free-energy −F(x) that has to be minimized.

It’s useful to reduce the free energy into it’s ”atomic” probability compo-nents:

F x(i) = −L x(i)

= −Eqφ(z|x(i))log pθ x

(i) ,z − log qφ z|x(i) = Eqφ(z|x(i))− log pθ x (i)_{|z − log p} θ(z) + log qφ z|x(i) (A.23)

The random multivariate variable z can be interpreted as being the result of a transformation z = T (z0) of an initial random multivariate variable

which happens to have a simple distribution, such as multivariate gaussian with diagonal covariance matrix.

For the law of the unconscious statistician (LOTUS) [Ringner, 2009] the energy can have a form with expectations over the simpler distribution of z0:

F x(i)_{= E}

q0(z0|x)− log pθ x

(i)_{|T (z}

0) − log pθ(T (z0)) + E_q_φ₍_z|x(i))log qφ z|x(i)

Collaborative Filtering with Variational Autoencoders and Normalizing Flows

MSc Artificial Intelligence

Master Thesis

Collaborative Filtering with Variational

Autoencoders and Normalizing Flows

Francesco Stablum

August 28, 2018

Contents

Abstract

Preface

Acknowledgements

Chapter 1

Introduction

Chapter 2

Background

2.1

Representation Learning

2.2

Collaborative Filtering

2.3

Variational inference

2.4

The Variational Auto-Encoder

2.5

Normalizing Flows

2.5.1

Planar Transformations

2.5.2

RealNVP Transformations

2.6

Variational posterior approximation

col-lapse

2.6.1

KL Annealing

2.6.2

Free Bits and Soft Free Bits

Chapter 3

Related work

3.1

Probabilistic Matrix Factorization

3.2

AutoRec

3.3

Matrix Factorizing Variational

Autoen-coder

Chapter 4

Method

4.1

VAERec

4.1.1

Sampling the ratings for the UI variants

4.2

VAERec with Normalizing Flows

4.2.1

Normalizing Flow using RealNVP’s invertible

trans-formation

4.2.2

Masking

4.3

Tackling underfitting

4.3.1

KL annealing

4.4

Dealing with overfitting

4.5

Optimization

4.5.1

RPROP update algorithm

4.5.2

Adam

4.5.3

Learning rate annealing

Chapter 5

Experiments

5.1

Technologies used

5.2

The Movielens datasets

5.3

Scaling issues and regularization