Deep Matrix Factorization for Recommendation

(1)

MSc Artificial Intelligence

Track: Natural Language Processing and Learning

Master’s Thesis

Deep Matrix Factorization for

Recommendation

by

Mart van Baalen

6144136

September 30, 2016

42 EC

Supervisors:

Prof. Dr. M. Welling

Thomas Kipf, MSc

Assessors:

Dr. P. H. Rodenburg

Dr. E. Kanoulas

(2)

(3)

Abstract

Matrix factorization (MF) based methods have proven hugely successful in modern recommendation systems. MF methods learn a latent representation of users and items that, when combined in the dot-product, produce an approximation of the rating that a user would give to an item. Recent forays into deep learning based MF methods have shown interesting results.

In this thesis we expand upon these methods. We explore four different but related avenues in autoencoding research: 1) we introduce the Matrix Factorizing Autoencoder, which encodes sparse user and item rating vectors into latent user and item factors, 2) the Matrix Factorizing Variational Au-toencoder, which learns variational posterior distributions over latent user and item factors given their respective sparse rating vectors, 3) we introduce the Matrix Factorizing Graph Autoencoder, which encodes the graph struc-ture formed by a rating dataset and encodes it into user and item factors, and 4) the Matrix Factorizing Variational Graph Autoencoder framework, which learns variational posterior distributions over latent user and item rating factors by encoding the graph structure of the rating data.

We run a number of experiments on these models on three Movielens datasets, the Netflix dataset and two proprietary datasets that contain user clicks on a Dutch news website. We reach competitive results with some of these models. More importantly, we show that these Autoencoder frame-works, especially the Graph Autoencoder framework, are well suited for the problem of rating prediction and might reach state of the art results in future work.

(4)

(5)

List of Abbreviations

CF Collaborative Filtering MF Matrix Factorization

AEVB Auto Encoding Variational Bayes GCN Graph Convolutional (Neural) Network

List of Symbols

rij Rating from user i for item j

D A dataset of observed ratings containing triples {(i, j, rij)}

R Sparse matrix containing ratings, Rij = rij if (i, j, rij) ∈ D

ri· Sparse vector of observed ratings from user i

r·j Sparse vector of observed ratings for item j

ˆ

rij Predicted rating from user i for item j

ui Latent factor for user i

vj Latent factor for item j

U Matrix of latent user factors

V Matrix of latent item factors

Nu Number of users

Nv Number of items

Nr Number of ratings

gφ(·) Encoder function with parameters φ

fθ(·) Decoder function with parameters θ

We use lowercase bold face characters (e.g. v) to indicate vectors. We treat all vectors, including vectors that correspond to rows of matrices, as column vectors, unless explicitly stated otherwise.

We use uppercase bold face characters (e.g. M) to indicate matrices. The only exception to this rule is that, in the discussion of variational Bayesian methods, we use Z to denote the set of latent variables and parameters, all of which are considered stochastic variables.

In general, unless otherwise noted, we use i to indicate a user, and j to indicate an item.

(6)

Acknowledgements

I would like to thank my supervisors, Max Welling and Thomas Kipf for their knowledge, guidance and encouragement. Without their support this thesis would never have been possible.

I would also like to thank Scyfer B.V., in particular its CEO and my internship supervisor J¨orgen Sandig, for allowing me to use the BigScyfer server for the many experiments I ran for this thesis, and for letting me write this thesis as part of an internship. This was invaluable in completing my thesis.

On a personal note I would like to thank my parents, Jieles and Paulien, for their support throughout my academic career, and in all other aspects of my life. I would also like to thank my girlfriend Kate for her patience during the long process of completing my thesis.

(7)

Notice

(8)

Chapter 1 Introduction

In navigating the vast ocean of information available on the modern inter-net, recommendation systems are indispensible in helping users find relevant items. Recommendation systems are systems that make personalized item recommendations for users based on which items the recommendation sys-tem thinks are relevant to the user [38]. Both the size of inventories of web services, i.e. the items that web services can consider recommending to users, and the amount of web traffic necessitate that the recommendations be auto-mated. While in a brick-and-mortar store an employee might point a user to interesting items based on personal experience, it is intractable to let humans recommend items in a web-scale operation. For example, Amazon has 488 million items for sale in the United States, while Netflix has thousands of titles available for streaming and millions of monthly users [12].

Navigating inventories of this size without any sort of sense of direction is a daunting task, and companies have a large financial incentive to help users find the items they need. If a visitor to a webstore is recommended an item that they are interested in but did not specifically search for, they might purchase that item as well as the item(s) for which they initially visited the website. Likewise, if a member of a subscription video streaming service is regularly unsuccessful in finding a video to watch they might eventually end their subscription. In order to point users to useful items in their inventories online stores like Amazon or online video streaming services such as

(13)

Net-flix invest heavily in recommendation systems. In 2007 NetNet-flix awarded a $1 million prize to a research team that increased the performance of their proprietary recommendation system by more than 10% [4], and claims that their current recommendation system is worth $1 billion per year [12].

Many successful recommendation systems treat the problem of recom-mending items to users as a matrix factorization (MF) problem ([1], [26], [28], [34], [35], [47]): known ratings given by users to specific items are el-ements in a sparse Nu × Nv rating matrix R, where Nu is the number of

observed users and Nv is the number of items in the system’s inventory.

Ele-ment rij is the (observed) rating given to item j by user i. MF recommenders

decompose R into low-rank user and item factor matrices U and V, of di-mensionality Nu × D and Nv × D respectively, with D Nu, Nv. We use

ˆ

rij to indicate a predicted rating given by user i to item j. Predictions for

ratings ˆri0_j0 can be made by taking the dot product of the corresponding user

and item factors for a previously unobserved user/item pair i0/j0 in U and V respectively: ˆri0_j0 = uT

i0v_j0. However, the sparsity of R makes it non-trivial

to find a decomposition that both fits the training data and also generalizes well to new ratings ([35], [34]).

Recent research has extended the MF framework into the realm of deep learning. For example, Strub et al. [41] use an autoencoder to learn a factorization of R. Dziugaite and Roy [19] extend the basic MF framework by using a deep neural network to predict a rating from latent user and item factors, instead of the standard dot product. That is, instead of using a prediction function

ˆ

rij = uTi vj

the authors use a deep neural network to predict a rating from ui and vj1:

ˆ

rij = NN(ui, vj)

where NN(·, ·) indicates a neural network that takes as input a user factor and an item factor.

1_{This is a simplified version of the system described in [19] that functions as an example.}

(14)

1.1 Contributions

In this thesis we extend the previously mentioned research by applying new advances in deep learning. Our extensions of current methods lead to the following three contributions:

1. We build an autoencoder-based matrix factorization model that in-cludes encoders for the sparse rating vectors of both the users and the items. This is in contrast to Strub et al. [41], who only encode either the user or the item rating vectors. This is also in contrast to Dziu-gaite and Roy [19], who use gradient descent to find the latent user and item factors and thus have no explicit encoding function for the rating vectors. The benefit of using encoders for both user and item rating vectors is that a mapping from user and item ratings to latent user and item factors is learned. This can be used to add new users or items to the system as soon as a small number of ratings for these users/items is available.

2. We apply the Autoencoding Variational Bayes (AEVB) algorithm [21] to the problem of matrix factorization. This will bridge the gap be-tween the autoencoding formulation of matrix factorization introduced in point 1 and the probabilistic formulation of matrix factorization in-troduced in e.g., [28], [1], [34].

3. We also apply very recent research in Graph Convolutional Networks by Kipf and Welling [22]. This research employs a novel approximation to previously difficult-to-compute convolutions over graph-structured data. Kipf and Welling build an autoencoder that encodes both item features and the graph structure. We apply this method in encoding the graph structure of the rating data in an autoencoding setting, similar to the other two methods. We apply this method to both the regular autoencoder and with the AEVB framework.

(15)

1.2 Thesis Contents

The remainder of this thesis is organized as follows. Chapter 2 introduces core concepts on which this research builds in more depth. Chapter 3 introduces the Matrix Factorizing Autoencoder, the Matrix Factorizing Variational Au-toencoder, the Matrix Factorizing Graph Autoencoder and the Matrix Fac-torizing Variational Graph Autoencoder algorithms. Chapter 4 describes a number of experiments run on real-life rating and click data and presents and analyzes the results. Chapter 5 places this research in the broader context of Matrix Factorization for recommendation and describes similarities and differences between this and other recent work. Chapter 6 gives a summary of the work presented in this thesis, leads for future work, and concludes this thesis.

(16)

Chapter 2 Background

This chapter introduces important concepts that the original research in this thesis builds upon.

2.1 Recommendation Systems

Many recommendation systems utilize a user’s history of explicit or implicit preference indications to select items that might be relevant to the user. Preference indications can be explicit, as is the case when a user rates items, or implicit, as is the case when a recommendation system has access to a user’s click behavior and uses that to infer preference. An example of the former is Netflix, where users can give a rating of 1 to 5 stars to videos they have watched. An example of the latter is Amazon, which exploits observed user behavior, such as past search terms, item clicks and time spent on a clicked item, as well as purchase history, to infer which items were interesting to a user and which were not.

Recommendation systems have traditionally been categorized into

Content-based Filtering and Collaborative Filtering systems. Content-Content-based filtering

systems combine the history of a user’s preference indications with features of the items, such as item categories or text in the item description, to find new items that are similar to the items the user has shown a preference for in the past. Collaborative filtering (CF) systems use the preference

(17)

indica-tions given by all users to find items that are relevant to a target user. The term ‘filtering’ stems from the idea that recommendation systems help filter relevant items in an inventory. The term ‘collaborative’ indicates that all users ‘collaborate’ in finding interesting items. Modern CF systems learn some internal representation of users and items and combine user and item representations to predict relevance. In recent years the distinction between CF and content-based filtering has become less strict as many CF systems incorporate side information about users or items (e.g. age of user, genre of a movie) to mitigate the cold-start problem. However, the term ‘Collabora-tive Filtering’ is still often used to describe systems in which the preference indication history of all users is utilized to recommend items.

A drawback of basing recommendations on previously observed preference indications is that no recommendations can be made for users who have not interacted with any items, e.g. new users. This problem is referred to as the

cold-start problem. Note that this is a bigger problem in CF than in

content-based filtering: since content-content-based filtering depends on item features and user preference indications it is not necessary for an item to be rated before it can be recommended. CF, however, depends on preference indications of other users for items, which implies that an item will not be recommended if it has no observed preference indications.

While there are many possible ways to use past interactions to predict relevant items, in this thesis we will focus on the problem of using CF for

rating prediction, using past ratings to predict new ratings. By supposition,

higher predicted ratings correspond to more relevant items. Learning-to-rank recommendation systems that predict a ranking of relevance to a user for all items do exist (e.g., [2], [46]), but are not discussed in this thesis.

Collaborative Filtering and Matrix Factorization

As stated in the introduction, many successful recommendation systems ap-proach the problem of rating prediction as a problem of matrix factorization (for example: [42], [1], [23], [25]). A set of ratings rij of a user i for an

(18)

Rating matrix R 1 4 5 2 3 4 5 3 4 1 3 ri· r·j users items

Figure 2.1: Illustration of the sparse rating matrix R. To make the illustration less

overwhelming only the ratings for user i and item j are shown. Note that user i has not rated item j. ri· indicates the sparse vector of observed ratings for user i, i.e. row i in matrix R; r·j indicates the sparse vector of observed ratings for item j, i.e. column j in

matrix R.

R. Element Rij of matrix R corresponds to rating rij of user i for item j.

Figure 2.1 gives a toy example of a sparse rating matrix.

The recurring theme in the family of MF methods is the decomposition of the sparse rating matrix R into user and item matrix factors U and V of dimensionality Nu× D and Nv× D respectively, with D Nu, Nv, such that

UVT most closely reconstructs the previously observed ratings, while still generalizing well to new ratings. A new rating ˆrij for a user/item pair i, j

that is previously unobserved is predicted as the dot product of the latent user factor ui and the latent item factor vj.

ˆ

rij = uTi vj (2.1)

Figure 2.2 gives a schematic view of matrix factorization.

If R were fully observed one could use Singular Value Decomposition (SVD) to find low-rank matrix factors U and V. However, R is only partially observed. A naive approach in which zeros are used as placeholders for the unobserved values will cause the SVD algorithm to fit the zeros and predict (a value close to) 0 for unobserved ratings. Modifying SVD to optimize the factors U and V w.r.t. the sum-squared distance between the predicted

(19)

Rating matrix R 1 4 5 2 3 4 5 3 4 1 3 ri· r·j users items

≈

ui U vj VT _r_ˆ_ij ₌_u_iT_v j

Figure 2.2: Schematic overview of the factorization of matrix R into user and item

matrix factors U and VT. User i is represented by the latent user factor ui, i.e. row i in

U; item j is represented by the latent item factor vj, i.e. column j in VT. The unobserved rating ˆrij, indicated by the purple square in R, is predicted from the dot product of ui and vj. Adapted from [45]

and observed ratings for only the observed ratings leads to a non-convex optimization problem [39].

Probabilistic Matrix Factorization

An early successful approach to matrix factorization is Probabilistic Matrix Factorization (PMF, [35]). We will describe this method in some detail to illustrate matrix factorization for recommendation.

The PMF approach assigns a 0-mean Gaussian prior distribution to the rows of the matrix factors U and V with spherical variance σfI, and assumes

a rating rij is a normally distributed variable with mean uTi vj and some fixed

variance σ. The full data likelihood of this model is:

log p(D, U, V) = log p(U)p(V)p(D | U, V)

=X i log p(ui) + X j log p(vj)+ X i,j,rij∈D log p(rij | ui, vj) =X i log N (ui | 0, σfI) + X log N (vj | 0, σfI)+ X i,j,rij∈D log N (rij | uTi vj, σ) (2.2)

(20)

where σ_f2 is the variance of the latent factors and σ2 is the variance of the ratings. If we assume σ2

f and σ2 are fixed, then optimizing the negative

log-likelihood is equivalent to optimizing the regularized squared error on the data: argmin U,V − log p(D, U, V) = argmin U,V X i,j,rij∈D (rij− uTi vj)2+ λ ||U||2 2+ ||V|| 2 2 (2.3) where we have defined λ = σ_σ22

f

and ||A||2 2 =

P

ijA2ij is the squared Frobenius

norm [35]. The regularized squared error of eq. (2.3) can easily be optimized by gradient-based optimization methods [35].

Note that it is not necessarily straightforward to add new users or items to U or V. For example if ratings for a new user i0 are observed we could optimize their latent factor ui0 by finding

argmin

u_i0

X

j:ri0j∈D

(ri0_j − uT_i0v_j)2+ λ||u_i0||2₂ (2.4)

where ||ui0||2 indicates the squared vector norm of u_i0. This is convex in

ui0 and can be found using regularized least squares. However this solution

would (presumably) have an effect on the error surface of eq. (2.3) w.r.t. the rows in V that correspond to items rated by user i0. Optimizing V will then create a second-order effect on the optimal value of U. This means that gradient descent updates w.r.t. the full matrices U and V have to be done. In other words, one or more potentially expensive iterations of gradient descent have to be performed to update the factors.

A variational Bayesian algorithm for Matrix Factorization was introduced by Lim and Teh [28]. This algorithm is described at a high level in Sec-tion 2.2.2. Many other algorithms based on a probabilistic interpretaSec-tion of matrix factorization that are not directly relevant to this thesis have been developed. A selection of these algorithms is discussed in Chapter 5.

(21)

New Users in the Matrix Factorization Paradigm

We explicate some concepts that are relevant to the remainder of this thesis:

1. We use the term ‘Known Users’ or ‘Known Items’ to denote users or items that were seen during training.

2. We use the term ‘New Users’ or ‘New Items’ to denote users or items that were not seen during training.

Note that this implies that, when new users and items appear after training, new users can rate new items. Figure 2.3 shows how known users and items and new users and items, as well as their ratings, relate to each other.

Also note that, in this terminology, new users differ from cold-start users in that we assume they have already rated a number of items. Similarly new items differ from cold-start items in that we assume they have already been rated by a number of users. This distinction will be important later on in this thesis.

2.2 Autoencoding Variational Bayes

This section describes the Autoencoding Variational Bayes (AEVB) algo-rithm introduced by Kingma and Welling [21]. This section is more exten-sive than the previous sections in this chapter, as the AEVB algorithm re-quires some background in both autoencoders and variational Bayesian (VB) methods. In this section we will first briefly describe autoencoders and basic variational Bayesian methods and then use these descriptions to introduce the Autoencoding Variational Bayes algorithm of Kingma and Welling [21].

2.2.1 Autoencoders

Autoencoders are unsupervised learning methods that are used to learn latent representations of input data [3]. Autoencoders consist of two parts:

1. An encoder function gφ : RN → RM, parametrized by a set of

(22)

R

Ratings for

known items

Ratings

from

kno

wn

users

New users

new items

Known New

Items

Kno wn New

Users

Figure 2.3: Illustration of the difference between ratings from new users for known

items or from known users on new items on the one hand, and ratings from new users for new items on the other hand. In this figure, submatrix R contains the ratings used for training. The bottom left submatrix contains ratings from new users on known items. The top right submatrix contains ratings from known users on new items. The bottom right matrix contains ratings from new users on new items.

(23)

2. A decoder function fθ : RM → RN, parametrized by a set of

parame-ters θ

The encoder encodes an input vector x ∈ RN _{(e.g. an image vector) into}

a latent representation z ∈ RM_{, while the decoder decodes a latent vector}

z ∈ RM _{and attempts to reconstruct the original input vector x ∈ R}N_.

For some dataset D an autoencoder is trained with the following objective:

argmin φ,θ X x∈D Errhx, fθ(gφ(x)) i + Reg(φ, θ) (2.5) where Errhx, fθ(gφ(x)) i

denotes the reconstruction error for an input x. The reconstruction error is a measure of how much the original input to an en-coder differs from the output predicted by the autoenen-coder. Reg(φ, θ) is the regularization penalty for the parameters φ and θ.

Autoencoders are usually used in Deep Learning. In this context the encoder fθ(·) and the decoder gφ(·) are both Neural Networks (e.g. simple

Multilayer Perceptrons) with parameters θ and φ, respectively. The objective function of eq. (2.5) is then minimized with respect to φ and θ using gradient-based optimization methods, such as Stochastic Gradient Descent (SGD).

Hybrid CF with Autoencoders

Relevant to the work in this thesis is recent work by Strub et al. [41]1_.

Strub et al. approach the MF problem as an autoencoder problem. Their Collaborative Filtering Network (CFN) model is a very basic autoencoder that encodes either the rows (sparse user rating vectors) or columns (sparse item rating vectors) of R into a latent space of dimensionality D. Their decoder then predicts a dense rating vector, i.e. including predictions for unobserved ratings. In the following discussion we will describe their U-CFN model, which encodes the sparse vector of user ratings ri·, i.e. the rows of R,

with the understanding that their V-CFN model, which encodes the sparse item rating vectors r·j, i.e. the columns of R, mirrors the U-CFN model.

1_{Research for this thesis had begun before [41] was released. The work in this thesis is}

(24)

Their architecture has the form:

ˆ

ri·= W(2)σ(W(1)ri·+ b(1)) + b(2) (2.6)

where ˆri· indicates the reconstructed (dense) user rating vector; W(1), b(1)

indicate the D × Nv-dimensional weight matrix and the D-dimensional bias

vector of the encoder function; W(2), b(2) indicate the Nv × D-dimensional

weight matrix and the Nv-dimensional bias vector of the decoder; and σ(·)

indicates an arbitrary activation function. The encoder and decoder thus have the following form:

encoder : gφ(ri·) = σ(W(1)ri·+ b(1))

decoder : fθ(gφ(ri·)) = W(2)gφ(ri·) + b(2)

The encoder gφ(·) maps the sparse rating vector to a latent representation.

The decoder fθ(·) reconstructs the dense rating vector from the latent

rep-resentation.

The authors note that there is a strong similarity between their auto-encoding formulation of matrix factorization and classic matrix factorization methods. To make the link more apparent we shall refer to the output vector of the encoder function gφ with input ri· as ui. Thus:

ui = gφ(ri·) = σ(W(1)ri·+ b(2)) (2.7)

The similarities become clear when focusing on the single rating ˆrij:

ˆ rij = fθ(gφ(ri·)) = W (2) j T ui+ b (2) j (2.8)

where we use W(2)_j to indicate the jth row in W(2) and b(2)_j to denote the jth element in b(2). If we use vj to denote W

(2)

j , and bj to denote b

(2)

j , the rating

prediction becomes an item-bias corrected matrix factorization prediction:

ˆ

(25)

Rating matrix R 1 4 5 2 3 4 5 3 4 1 3 ri· r·j users items ri·∈ RNv 1 4 5 2 3 4 ui∈ RD gφ W(2)_{∈ R}Nv ×D vj · ui fθ ˆ rij∈ RNv 0.9 4.1 3.2 4.8 1.6 4.3 3.7 4.5 2.2 2.6 4.3 1.4

Figure 2.4: Schematic (and simplified) representation of the U-CFN architecture. The

encoder, gφ(·), encodes the sparse rating vector ri·of user i into a (dense) latent factor ui. The decoder fθ(·) decodes the encoded factor uiinto a dense rating vector reconstruction ˆ

rij by multiplying the latent factor ui with a learned matrix W(2). In this case the rows of W(2)encode the sparse item rating vectors, i.e. the columns of R. The predicted rating ˆ

rij is indicated in the rightmost column vector by the purple square.

Note that in this case, vj is fixed after training, while ui is a function of

the sparse rating vector ri·. To emphasize the similarity to standard MF

approaches we shall refer to the weight matrix W(2) as V. Figure 2.4 gives a schematic representation of the U-CFN autoencoder architecture. The bias vector b(2) is left out of this figure for simplicity.

The authors optimize their model by performing gradient descent on the regularized error on the observed ratings. The target is thus to find:

argmin φ,θ X i X j:rij∈D Err rij, fθ(gφ(ri·))j + Reg(φ, θ) (2.10)

where Err(·, ·) is the reconstruction error, Reg(·) is a regularization penalty on the parameters, and fθ(gφ(ri·))j is used to denote the jth element in the

predicted rating vector.

An interesting aspect of this approach is that only the item factor matrix

W(2) (i.e. V in this example) is directly optimized. The model learns a mapping gφ(·) from user ratings to a latent user factor. Therefore, the latent

user matrix U can be constructed by feeding R into the encoder function:

(26)

But this also means that a new user i0, provided they have only rated known items, can easily be added to the system by feeding their observed rating vector r0_i· into the encoder gφ(·).

Note that Sedhain et al. [37] propose a model that is very similar to the CFN models proposed by Strub et al.

2.2.2 Variational Bayes

This subsection is mostly a brief summary of the material in chapter 10 of Bishop [6], unless otherwise noted. Variational Bayesian methods are used for finding approximate posterior distributions in complex (generally Bayesian) probabilistic models where there is no analytical solution to the true posterior.

We consider models in which there are observed variables denoted by X. We assume that the model has latent variables as well. In true Bayesian fashion we also treat the parameters of distributions as latent variables with appropriate priors. We use Z to denote the set of all latent variables and parameters, and zi to denote a single latent variable or parameter. Since,

by supposition, there is no analytical solution to the true posterior distribu-tion of the latent variables, variadistribu-tional Bayesian methods introduce a varia-tional distribution Q(Z) over the latent variables and parameters. Variavaria-tional methods are optimized by making the variational distribution Q(Z) approx-imate the true but intractable posterior distribution p(Z | X) as closely as possible.

One can verify that the marginal log probability of the data, log p(X), can be written as:

log p(X) = Z Q(Z) logp(X, Z) Q(Z) dZ − Z Q(Z) logp(Z | X) Q(Z) dZ (2.12) We define

(27)

L(Q) = Z Q(Z) logp(X, Z) Q(Z) dZ DKL Q(Z) || p(Z | X) = − Z Q(Z) logp(Z | X) Q(Z) dZ Thus log p(X) = L(Q) + DKL Q(Z) || p(Z | X) (2.13)

where DKL is the Kullback-Leibler (KL)-divergence and the distribution

Q(Z) is the variational distribution over the latent variables. The KL-divergence DKL

Q || P

is a measure of how much a probability density function (pdf), (or probability mass function (pmf)) for discrete variables,

Q, differs from a pdf (or pmf) P over the same variables. The KL-divergence

is always ≥ 0, with equality iff Q = P . Therefore the KL-divergence can be used as a measure of how much the variational approximation Q(Z) differs from the true posterior p(Z | X).

Since the KL-divergence is nonnegative and 0 iff Q(Z) = P (Z | X), L(Q) is a lower bound on log p(X), and is often referred to as the Evi-dence Lower BOund, or ELBO. The difference between L(Q) and log p(X) is DKL

Q(Z) || p(Z | X)

. This implies that maximizing L(Q) is equivalent to minimizing the KL-divergence between the approximate posterior and the true posterior. The ELBO can itself be further decomposed into:

L(Q) = Z Q(Z) log p(X, Z) Q(Z) dZ = EQ[log p(X, Z)] + H[Q] (2.14)

where H[Q] indicates the entropy of Q.

Assuming the approximate posterior Q(Z) is parametrized by a set of parameters Ψ, one can perform gradient ascent on L(Q) w.r.t. the parameters of Ψ [30].

(28)

However, the expectation of the log-joint EQ[log p(X, Z)], which decomposes

into a sum of terms, possibly contains expectations that are intractable to compute [30]. Let f (zi) be a function that contains all terms in log p(X, Z)

that depend on zi, with zi chosen such that Eqψ[f (zi)] is an intractable

expectation. Let qψ(zi) be the approximate posterior distribution over zi,

parametrized by a subset of the variational parameters ψ ⊆ Ψ. Paisley et al. [30] note that ∇ψEqψ[f (zi)] = Eqψ[f (zi)∇ψlog qψ(zi)]. The latter

expectation can then be approximated by using Monte Carlo integration [30], thus solving the issue of intractability:

Eqψ[f (zi)∇ψlog qψ(zi)] ≈ 1 S S X s=1 f (z(s)_i )∇ψlog qψ(z (s) i ) (2.16)

where z(s)_i indicates the sth sample drawn from qψ(zi), out of a total of S

samples.

Note that stochastic integration of Eqpsi[f (zi)] is possible by using

Eqpsi[f (zi)] = Z qψ(zi)f (zi)dzi ≈ 1 S 2 X s=1 f (z(s)_i ) (2.17) with z(s)_i drawn from qψ(zi), but that the gradient of f (z

(s)

i ) w.r.t. ψ is 0, as

the sample z(s)_i is constant once drawn. The gradient w.r.t. ψ is nonzero in eq. (2.16).

The authors perform gradient ascent using this stochastic gradient esti-mator. However, this estimator exhibits very high variance ([21],[30]), which makes it impractical in many settings.

Variational Bayesian Matrix Factorization

We illustrate the variational Bayesian approach by briefly describing the variational Bayesian Matrix Factorization algorithm of Lim and Teh [28]. Like in the PMF algorithm, the authors assume ratings rij are normally

distributed around the dot-product of a latent user and item factor:

(29)

where σ2 is the variance of the ratings.

The latent factors are given 0-mean Gaussian priors with diagonal (but not necessarily spherical) variance:

p(U | σu) = Y i N (ui | 0, diag(σ2u)) p(V | σv) = Y j N (vj | 0, diag(σv2)) (2.19) where σ2

u and σ2v are vectors and diag(σu2) and diag(σ2v) indicate square

diagonal matrices with the values of σ2

u and σ2v on the diagonal, respectively.

The full data likelihood of this model is thus

Note that the variances σ2, σ2_u and σ_v2 are not assigned prior distributions. The authors are interested in finding the posterior distribution

p(U, V | D)

However, this posterior distribution cannot be computed in closed form. The authors instead define an approximate posterior distribution Q(U, V) that they assume factorizes into distributions over U and V:

Q(U, V) = Q(U)Q(V) (2.21)

(30)

L(Q) =

Z

Q(U, V) logp(D | U, V, σ)p(U | σu)p(V | σv)

Q(U, V) dUdV = − DKL Q(U) || p(U | σu) − DKL Q(V) || p(V | σv) + EQ(U)Q(V) log p(D | U, V, σ) (2.22)

Lim and Teh optimize the ELBO of eq. (2.22) by iterating between opti-mizing w.r.t. U, optiopti-mizing w.r.t. V and optiopti-mizing w.r.t. the variances σ, σu

and σv.

2.2.3 Autoencoding Variational Bayes

Kingma and Welling [21] extend both variational Bayesian methods and pre-vious work on autoencoders in the Autoencoding Variational Bayes (AEVB) algorithm. Their work considers models in which there is a latent variable

zi for each data point xi. The datapoint xi is assumed generated by some

conditional distribution pθ(xi | zi). The latent variables are endowed with

prior distributions p(zi). The parameters of this generative distribution are

produced by a (deterministic) function fθ(·) of the latent variable zi. fθ(·) is

parametrized by the set of parameters θ, with θ shared amongst all distribu-tions. For example, if pθ(xi | zi) is Gaussian, then the mean µi and diagonal

variance σ2

iI are computed from zi using fθ:

[µi, log σi] = fθ(zi) (2.23)

Furthermore, the variational distribution over the latent variables is con-ditioned on the data, i.e. the variational distribution is expressed as

Qφ(Z | D)

and factorizes over the independent datapoints, i.e.

Qφ(Z | D) =

Y

i

(31)

The parameters of each distribution qφ(zi | xi) are a (deterministic) function

gφ(·) of xi, where gφ(·) is parametrized by φ. Similarly to the role fθ(·) plays

in the generative conditional distribution pθ, the function gφ(·) produces

the parameters of each of the variational distributions. The distributions

qφ(zi | xi) are referred to as the recognition model.

The ELBO for this model decomposes into a sum over ELBOs for indi-vidual datapoints. The ELBO for a single data point is [21]:

L(θ, φ; xi) = −DKL qφ(zi | xi) || p(zi) + Eq log pθ(xi | zi) (2.24)

In case the expectation Eq

log p(xi | zi)

is intractible to compute, one can resort to stochastic integration to approximate the expectation [21]. By drawing S samples from qφ(zi | xi) the expectation can be approximated as

Eq log pθ(xi | zi) = Z qφ(zi | xi) log pθ(xi | zi)dzi ≈ 1 S S X s=1 log pθ(xi | z(s)i ) (2.25)

where z(s)_i denotes sample s drawn from qφ(zi | xi). The term in the ELBO

that depends on the single datapoint xi can then be approximated by [21]:

L(θ, φ; xi) ≈L(φ, θ; xe _i)def= −D_KL qφ(zi | xi) || p(zi) + 1 S S X s=1 log pθ(xi | z (s) i ) (2.26) This yields the Stochastic Gradient Variational Bayes (SGVB) estimator for a single datapoint, denoted by L(θ, φ; x). The sum eq. (2.26) over all data-e

points xi approximates the ELBO [21].

Note that the gradient of pθ(xi | zi) is 0 w.r.t. φ in eq. (2.26). The

reason for this is that a sample z(s)_i of qφ(zi | xi) is constant once it is drawn.

This means that the SGVB estimator cannot be directly optimized w.r.t. the parameters φ of the recognition model.

(32)

To solve this issue, Kingma and Welling introduce the reparametrization

trick. This entails the introduction of a deterministic function h(, gφ(xi)),

that maps a random noise vector and the parameters of qφ(zi | xi),

com-puted by gφ(xi), to a sample from the posterior qφ(zi | xi) [21]. This yields:

e L(θ, φ; x) = −DKL qφ(zi | xi) || p(zi) + 1 S S X s=1 log pθ xi | h((s), gφ(xi)) (2.27) This separates the sample (s) _{from the parameters φ, which allows} _{L to be}_e

differentiated w.r.t. both θ and φ, assuming that h is differentiable w.r.t. both

θ and φ.

In case the KL-divergence term can be computed in closed form (which is often the case, e.g. when both the prior and approximate posterior dis-tribution are Gaussian), the SGVB estimator can be expressed in closed form, and can be optimized by performing gradient ascent using the gradient ∇θ,φL(θ, φ; x) [21].e

Relationship to Autoencoders

The authors apply the SGVB estimator in the Autoencoding Variational Bayes (AEVB) algorithm. The relationship between autoencoders as de-scribed in Section 2.2.1 and the SGVB estimator becomes clear when one looks at the form of eq. (2.27). The distribution qφ(zi | xi) can be seen

as an encoding distribution that gives a probability distribution on the en-coded vector zi for input vector xi. The KL-divergence between qφ(zi | xi)

and p(zi) can be interpreted as a regularization term, and the expectation

Eqφ(zi|xi)[log pθ(xi | zi)] can be interpreted as the expected negative

(33)

2.3 Graph-based Recommendation Systems

and Graph Convolutional Networks

An alternative path in recommendation systems research is to view recom-mendation systems as graphs. In this view, users and items form nodes in a bipartite graph G = (U , V, E ), where U is the set of nodes associated with users, V is the set of nodes associated with items, and E is the set of edges connecting users to items. Edges can denote binary preference indications, or ratings if the edges are weighted [7], [10]. Note that the bipartiteness of the graph follows from the fact that users can only rate items, not other users, and that items cannot rate other items. Therefore edges only exist between users and items.

Traditionally the graph structure of the recommendation data is used to determine the similarity between nodes. This can then be used either by directly exploiting the similarity between a user node i and an item node

j to predict the rating the user would give the item, or by exploiting the

similarity of a target user i with all other users and their ratings on a target item j: ˆ rij = P i0_∈D jsim(i, i 0_)r i0_j P i0_∈D jsim(i, i 0₎ (2.28)

where Dj is the subset of D containing ratings for item j (and we assume

that user i has not yet rated item j as it should be excluded from this set) [9].

Much of this work focuses on random walk properties of the recommen-dations graph to determine similarities between users or between users and items [7], [10]. Recent work extends this to more general graph kernels [9].

2.3.1 Graph Convolutional Nets

Kipf and Welling [22] introduce an efficient approximation to localized spec-tral filters on graphs in a Graph Convolutional Network (GCN). These filters can be interpreted as local graph feature extractors. These feature extractors

(34)

encode, for each node, both the local neighborhood structure of the node and transform features of the node and nodes in its neighborhood. This is anal-ogous to Convolutional Neural Networks (CNNs) in which the consecutive application of learned, localized filters allows the networks to learn complex features of signals, e.g. images. The GCN framework can be seen as a gener-alization of CNNs, where GCNs are also sensitive to the local structure of a graph, something that does not hold for standard CNNs, as these only work on regular graphs (i.e. the lattice of pixels). However, like CNNs, consecu-tive applications of the graph convolution allow higher order neighborhoods of nodes to be encoded.

Kipf and Welling [22] propose the following propagation rule:

H(l) = σAHˆ (l−1)W(l) (2.29) where H(l)is the matrix of activations in layer l; ˆAdef= Df

−1₂

˜

AfD

−1₂

; ˜A def= A+I, where A is the adjacency matrix; fD is a diagonal matrix withfD_ii=P_jA˜_ij

and σ(·) is an activation function. In the input layer l = 0, the matrix of activations H(0) is the feature matrix X, in which row i contains the features for node i [22]. The adjacency matrix A is a symmetric matrix where Aij = Aji = wij is the weight of a edge between node i and j, and 0 if

there is no edge between i and j. If the edges in the graph are unweighted, then Aij = Aji = 1 if there is an edge between node i and node j, and 0

otherwise.

Kipf and Welling motivate the form of the propagation rule of eq. (2.29) from a first-order convolution over the full graph. A convolution over the full graph generally requires a full eigenvalue decomposition of the normal-ized graph Laplacian matrix I − L = D−12AD−

1

2 [22], which is expensive to

(35)

Chapter 3 Autoencoding Variational

Matrix Factorization

This section describes the Autoencoding Variational Matrix Factorization (AEVMF) algorithm. For a description of the Autoencoding Variational Bayes algorithm we refer the reader to Section 2.2 and the references cited therein.

In this chapter we will first introduce the Matrix Factorizing Autoencoder that functions as a baseline model for the Variational Matrix Factorizing Au-toencoder, which we introduce next. At the end of this chapter we introduce a related but slightly different model that is based on graph convolutions, the Graph Convolutional Matrix Factorizing Autoencoder, and its variational adaptation.

3.1 Matrix Factorizing Autoencoder

Our simplest model is the Matrix Factorizing Autoencoder (MFAE). This model consists of the following three components

1. A user rating vector encoder gφu : R

Nv _{→ R}Du_{, parametrized by φ}

u

2. An item rating vector encoder gφv : R

Nu _{→ R}Dv_{, parametrized by φ}

v

(36)

where Du and Dv are the dimensionalities of the latent user and item factors

respectively.

The encoder gφu(·) encodes a sparse user rating vector ri· ∈ R

Nv_{, i.e. a}

row in matrix R, into a latent user representation ui, while gφv(·) encodes

a sparse item rating vector r·j ∈ RNu, i.e. a column in R, into a latent

item representation vj. The decoder fθ(·, ·) takes as input a latent user

representation ui and a latent item representation vj and predicts a rating

ˆ

rij. Figure 3.1 gives a schematic overview of this model. The MFAE is

trained to minimize the objective:

argmin φu,φv,θ X (i,j,rij)∈D Err rij, fθ(gφu(ri·), gφv(r·j)) + Reg(φu, φv, θ) (3.1)

where Err(·) is an arbitrary error function, and Reg(·) is some regularization term on the learnable parameters.

This might seem trivial, as for any triple (i, j, rij) ∈ D the rating rij is

present in both ri· and r·j, since ri· and r·j are both constructed from the

ratings in D. However, the decoder only has access to the latent representa-tions (ui and vj) of the rating vectors (ri· and r·j). Furthermore, even if the

decoder did have access to the raw rating vectors, it has no way of learning which of the ratings in the rating vector is the target rating for a combination of two vectors ri· and r·j.

This formulation provides a benefit with regards to the CFN model of Strub et al. [41]. Recall that their model only encodes rows or columns of

R. The side that is not encoded is represented by a fixed matrix W(2). For example, in the U-CFN model, user rating vectors are encoded. In this model the item factors remain fixed. If a new item is added to the system in which the U-CFN model is employed, even if it is only rated by known users, the model cannot readily compute a latent factor for this new item.

In the remainder of this section we present a number of increasingly so-phisticated implementations of the MFAE, culminating in the Matrix Fac-torizing Variational Graph Autoencoder (MFVGAE).

(37)

Rating matrix R

1 4 5 2 3 4 5 3 4 1 3 ri· r_·j

users

items

ri· r·j gφu gφv ui vj fθ ˆ rij

Figure 3.1: Schematic overview of the Matrix Factorizing Autoencoder. Elliptical nodes

represent real or vector-valued variables, while rectangular boxes denote function appli-cation. The arrows indicate the flow of information. Red is used to indicate user ratings and the latent factor for user i, blue is used to indicate item ratings and the latent factor for item j. Purple is used to indicate a predicted (and previously unobserved) ˆrij.

3.1.1 Supervised MFAE

In this model we assume the matrix factorization into U and V is known and we train the encoder to learn a mapping from the sparse user and item rating vectors ri· and r·j to the user and item factors ui and vj for all i and

j. The training objective is to find

argmin

φu

X

i

Err(gφu(ri·), ui) + Reg(φu)

argmin φv X j Err(gφv(r·j), vj) + Reg(φv) (3.2)

In order to run supervised training, we use the latent user and item factors produced by a different matrix factorization algorithm as targets to train the encoder.

(38)

Encoders

For all our encoders for the unsupervised MFAE we use a simple two-layer Neural Network. The networks have the following form:

ui = gφu(ri·) = σ2(σ1(ri·W (1) u + b (1) u )W (2) u + b (2) u ) vj = gφv(r·j) = σ2(σ1(r·jW (1) v + b (1) v )W(2)v + b (2) v ) (3.3)

where σ1(·) and σ2(·) are arbitrary activation functions, W(1)u , W

(2)

u , W

(1)

v

and W(2)_v are weight matrices for respectively the user and the item rating vector encoders with dimensionality Nv× Dh, Dh× D, Nu× Dh and Dh× D,

respectively, and b(1)_u , b(2)_u , b(1)_v and b(2)_v are bias vectors for the user and item rating encoders, of dimensionality Dh, D, Dh and D, respectively.

The number of free parameters Nparams for this model is

Nparams= (Nu+ Nv) × Dh1+ 2 × (Dh1+ 1) × D (3.4)

where Nu is the number of users, Nv is the number of items, Dh1 is the

number of hidden units in the hidden layer (and is assumed equal for the user and the item rating encoder), and D is the number of latent factors.

Note that the two encoders constitute a simple Multilayer Perceptron (MLP) with one hidden layer.

hu = σ1(ri·W(1)u + b (1) u ) hv = σ1(r·jW(1)v + b (1) v ) (3.5) ui = σ2(huW(2)u + b (2) u ) vj = σ2(hvW(2)v + b (2) v ) (3.6)

where hu is the vector of activations of the hidden layer.

While our formulation allows for differences in hidden layer dimensionality between the user rating and item rating encoder, we only include models in which the hidden layers of the two encoders have the same dimensionality.

We define a simplified decoder in which weights in the output layer are shared. Thus we constrain W(2)_u = W(2)_v and b(2)_u = b(2)_v . We refer to the shared weight matrix and bias of the output layer simply as W(2) and b(2).

(39)

This simplified encoder has the following form: ui = gφu(ri·) = σ2(σ1(ri·W (1) u + b (1) u )W(2)+ b (2) ) vj = gφv(r·j) = σ2(σ1(r·jW (1) v + b (1) v )W (2)_{+ b}(2) ) (3.7)

The benefit of this encoder over the first encoder is that the second encoder has fewer parameters, and that the second encoder ensures that the hidden activations of the user and item encoder are projected onto the same space.

The number of parameters Nparams for the simplified encoder is

Nparams= (Nu+ Nv + 2) × Dh1+ (Dh1+ 1) × D (3.8)

Decoders

While the formulation of the general MFAE allows an arbitrary function to be used as a decoder, the initial matrix factorization on which the supervised MFAE is trained was found such that it minimizes the error when the dot-product function between a user and an item factor is used as the prediction function. It is therefore sensible to use the dot-product as a decoder in this setting:

ˆ

rij = fθ(ui, vj) = uTi vj (3.9)

This decoder has no learnable parameters.

The supervised MFAE model is used as a baseline model to investigate whether it is possible to learn a mapping from a raw rating vector to a user or item factor.

3.1.2 Unsupervised MFAE

In this setting we do not have previously learned targets. Instead the training objective is to find argmin φu,φv,θ X i,j,rij∈D Err rij, fθ(gφu(ri·), gφv(r·j)) + Reg(φu, φv, θ) (3.10)

(40)

The error of the Supervised MFAE algorithm is an upper bound on the error of the Matrix Factorizing Autoencoder. This is caused by the facts that

1. The Supervised MFAE algorithm will only perform as well as the algo-rithm that produced the target factors in case the MFAE algoalgo-rithm is able to predict the target factors perfectly.

2. In case the Supervised MFAE algorithm is unable to predict the tar-get factors perfectly, the prediction error on the tartar-get factors of the Supervised MFAE algorithm will likely be exacerbated when user and item factors are combined to predict a rating.

By optimizing the reconstruction error between an original rating rij and

a predicted rating ˆrij we circumvent this problem.

Encoders

We use the same encoders as those introduced in Section 3.1.1

Decoders

In this section we present several decoder functions fθ.

Dot Product Decoder This is the dot-product based decoder introduced in Section 3.1.1, eq. (3.9).

Matrix Dot Product Decoder In this decoder we introduce a learnable

Du× Dv matrix M . This matrix projects ui onto the same space as vj. This

means that it is no longer required that Du = Dv. Ratings are predicted as:

ˆ

rij = fθ(ui, vj) = uTiMvj (3.11)

Note that this decoder can be interpreted as projecting the latent item factor onto the space of the latent rating factor. The learnable parameters for this decoder are the elements of the matrix M.

Multinomial Dot Product Decoder The underlying assumption of a dot-product based decoder is that rij is a real-valued variable. It has been

(41)

noted in the literature that this is not necessarily a reasonable assumption ([26],[48],[44]). For this reason we introduce the family of Multinomial Dot Product Decoders.

This family of decoders treats rating rij as a categorical variable of Dr

categories, where Dr is the number of possible rating values. For a movie

rating dataset where users can rate movies on a 1- to 5-star scale Drwould be

5. The initial output of the decoder is a Dr-dimensional vector rij. Element

k : 1 ≤ k ≤ Dr of this vector is computed as:

r(k)_ij = f_θ(k)(ui, vj) = uTiM

(k)_v

j (3.12)

Thus, for every rating category k we introduce a learnable matrix M(k). The predicted probability of the rating taking value k is computed using the softmax function:

p(ˆrij = k | ui, vj) = er(k)ij P k0er (k0) ij (3.13)

The predicted rating ˆrij can then either be computed as the most probable

rating, or as the expected rating [48]:

ˆ rij = argmax k p(ˆrij = k | ui, vj) ˆ rij = E[k] =X k kp(ˆrij = k | ui, vj) (3.14)

Deep decoder This decoder appends vj to ui to form one vector zij. This

vector is then fed into an MLP whose output is a single node corresponding to ˆrij. Thus:

ˆ

rij = fθ(ui, vj) = MLP([ui, vj]) (3.15)

where MLP denotes an arbitrary multilayer perceptron and [·, ·] is used to denote vector concatenation. This model can easily be extended to predict ratings as a categorical variable by changing the output dimensionality from 1 to Dr. This model is a simplified version of the model presented by Dziugaite

(42)

and Roy [19] that only takes user factors as input (not user and item specific matrices).

3.2 Matrix Factorizing Variational

Autoen-coder

The Matrix Factorizing Variational Autoencoder (MFVAE) algorithm bridges the gap between the autoencoder formulation introduced in Section 3.1 and the variational Bayesian Matrix Factorization algorithm introduced by Lim and Teh [28]. In this algorithm, we assume observed ratings are i.i.d. gener-ated from a conditional probabilistic distribution:

pθ(D | U, V) =

Y

i,j,rij∈D

pθ(rij | ui, vj) (3.16)

The parameters of the conditional distribution are a deterministic func-tion fθ of ui and vj. fθ is in turn parametrized by a set of parameters θ.

The latent factors, collected in matrices U and V, are unobserved stochastic variables. We assign a prior distribution p(U, V) to the latent factors U and

V that factorizes into prior distributions on the rows of U and V.

p(U, V) =Y i p(ui) Y j p(vj) (3.17)

The full data likelihood in this model is

p(U, V, D | θ) =Y i p(ui) Y j p(vj) Y i,j,rijinD pθ(rij | ui, vj) (3.18)

Note that the matrix factorization likelihood of probabilistic matrix factor-ization [35] (Section 2.1) is recovered in case N (rij | uTi vj, σ) is the

genera-tive distribution (and thus fθ has no learnable parameters since in this case

fθ(ui, vj) = uTi vj, and we assume σ is fixed).

(43)

posterior distribution over the latent factors p(U, V | D), which is intractable to compute for even mildly complicated functions fθ(·). We therefore

intro-duce a variational approximate distribution Q(U, V) to the posterior distri-bution on the latent variables U and V. We assume that the variational distribution factorizes over the rows of U and V. Note that [28] only as-sumes a factorization of Q(U, V) into Q(U)Q(V), but that the factorization of Q(U) and Q(V) into the rows of U and V follows from the form of the expectations in the ELBO (we refer the reader to [28] for details). Unlike [28] and following [21], we condition the posterior distribution on the data. Specifically, in the variational posterior distribution we condition each latent factor ui and vj on the observed ratings associated with that latent factor:

Qφ(U, V | D) = Y i qφu(ui | ri·) Y j qφv(vj | r·j) (3.19)

The parameters of each distribution qφu(ui | ri·) depend on the observed

rat-ings ri· through a deterministic function gφu(·), which is in turn parametrized

by the set of parameters φu. Likewise, the parameters of qφv(vj | r·j) depend

on r·j through the deterministic function gφv(·) parametrized by φv.

The ELBO for the MFVAE model is presented in eq. (3.20). In the matrix factorization, setting the ELBO does not decompose into a convenient sum over ELBOs of individual datapoints, due to the diadic nature of the data. Instead the marginal likelihood decomposes into a sum over KL-divergence terms of each of the latent factors and a sum over the expected log conditional probability of each of the ratings. For a derivation of the full data marginal likelihood we refer the reader to Appendix A.1.

(44)

L(θ, φu, φv; R) = − DKL Q(U, V | D) || p(U, V) + EQ(U,V|D) log pθ(D | U, V ) = −X i DKL qφu(ui | ri·) || p(ui) −X j DKL qφv(vj | r·j) || p(vj) + X i,j∈D Eq_φu(ui|ri·)qφv(vj|r·j) log pθ(rij | ui, vj) (3.20)

Note that this ELBO is highly similar to the ELBO for the Variational Bayesian Matrix Factorization model of Lim and Teh [28], which can be found in eq. (2.22). The major difference between this ELBO and the ELBO of the Variational Bayesian Matrix Factorization Algorithm is the depen-dence of the approximate posterior distribution Q(U, V | D) on the data D. This is typical of Variational Autoencoder-type applications [21].

The form of the ELBO implies that the use of mini-batches of data is only possible if subsets of users and items can be selected such that there are no users and items outside of this subset that have rated/have been rated by an item/user outside of this group. Viewed from a graph perspective: mini-batches only make sense for connected components of the rating graph. While connected components in the rating graph – if there are more than one – could be precomputed, we assume that the set of connected components in the rating graph contains one very large connected component containing most user and item nodes, and a few very small connected components containing unpopular movies and users with few ratings. Defining mini-batches of equal size would thus require some sort of subsampling procedure1_.

Otherwise, the procedure for maximizing the variational lower bound of eq. (3.20) is identical to the procedure for maximizing the variational lower bound in the original AEVB algorithm. For convenience this is reproduced here.

1_{The connectivity of each of the rating datasets and the splits we create (i.e.}

(45)

The expectation Eqφu(ui|ri·)qφv(vj|r·j)

h

log pθ(rij | ui, vj)

i

in eq. (3.20) can, by supposition, not be computed in closed form. However, we can estimate the expectation by drawing samples ui ∼ qφu(ui | ri·) and vj ∼ qφv(vj | r·j),

and computing an approximation to the expectation:

Eqφu(ui|ri·)qφv(vj|r·j) h log pθ(rij | ui, vj) i ≈ 1 S S X s=1 log pθ(rij | u (s) i , v (s) j ) (3.21)

where u(s)_i indicates the sth sample from qφv(vj | r·j) and likewise v

(s)

j

indi-cates the sth sample from qφu(ui | ri·).

Recognition models and Prior Distributions

In all our models we assume that the prior distribution for all latent factors

ui and vj is a 0-mean Gaussian distribution with diagonal variance. In

most models we assume that the priors have unit variance, but we allow some flexibility in setting the (spherical) variance in the prior. This has the effect of giving the KL-divergence in the ELBO a higher or lower weight, depending on whether the variance is decreased or increased, respectively, i.e. giving the approximate posterior more freedom to move away from the prior mean. Thus:

p(ui) = N (ui | 0, σ2uI)

p(vj) = N (vj | 0, σv2I)

(3.22)

where σu and σv are positive scalars.

We also use Gaussian recognition models. We use the notation µui and

σ2

ui to denote the mean and a vector representing the diagonal variance of

the recognition model qφu(ui | ri·), respectively, and we use µvj and σ

2

vj

to denote the mean and a vector representing the diagonal variance of the recognition model qφv(vj | r·j), respectively. Thus:

qφu(ui | ri·) = N (ui | µui, σ 2 uiI) (3.23) qφv(vj | r·j) = N (vj | µvj, σ 2 vjI) (3.24)