Geometric Priors for Disentangling Representations

(1)

MSc Artificial Intelligence

Master Thesis

Geometric Priors for Disentangling

Representations

by

Samarth Bhargav

11859032

September 27, 2019

36 EC February - September 2019

Supervisor:

Andy Keller

Examiner:

Patrick Forr´

e

Assessor:

Max Welling

University of Amsterdam

Faculty of Science

(2)

Abstract

Disentangled Representations have been hypothesized to be key for improving performance of A.I algorithms in tasks where the performance of humans is far superior to A.I approaches (Lake et al., 2017). In addition to being interpretable, they have been shown to improve performance in reasoning tasks (van Steenkiste et al., 2019) and improve sample efficiency in zero-shot learning (Higgins et al., 2017b). A disentangled representation is a representation in which only one aspect of the representation is sensitive to changes in one generative factor, while being invariant to changes in other factors.

We address the problem of learning disentangled representations in an unsupervised setting, on datasets that are generated from independent generative factors. Recent work (Locatello et al., 2018) has highlighted the need for inductive biases for learning disentangled representations. We propose an inductive bias in the form of the prior, chosen to suit the dataset being used, and investigate whether it sufficient to produce disentangled representations. This prior is composed of independent factors, each of which are defined on geometries. We show that while this method works for smaller datasets with limited success, using a appropriate prior might not be sufficient to disentangle representations for larger datasets with several factors of variation and additional supervision might be needed to solve the task.

(3)

Acknowledgements

I would like to thank Andy Keller for his supervision and support. His constant motivation, patience and guidance was instrumental in finishing this thesis. I would also like to thank Patrick Forr´e for taking the time to answer several questions and inspiring several ideas we had throughout the project. Thanks to Max Welling, Patrick and Andy for agreeing to read and assess my thesis. I am grateful to Giorgio Patrini for providing me the opportunity to work on an exciting new field, which led to me working on this thesis and learning a lot.

Thanks to the following people who were supportive throughout the thesis, and kept me sane: Gulfaraz, Marco, Kishan, Akash, Tharangni, Harri¨et. Thanks Akash and Marco for proofreading the thesis. Finally I wish to thank my my family, who have been nothing but supportive during my Masters.

(4)

4.3 Evaluation . . . 26 4.3.1 Qualitative Evaluation . . . 26 4.3.2 Quantitative Evaluation . . . 28 4.4 Data . . . 30 4.4.1 dSprites . . . 30 4.4.2 Small Norb . . . 31 5 Experiments 33 5.1 dSprites . . . 33 5.1.1 dSpritesXY . . . 33 5.1.2 dSpritesXYOrientation . . . 38 5.1.3 dSpritesXYScale . . . 41 5.1.4 dSpritesXYShape . . . 44 5.1.5 dSprites . . . 46 5.2 SmallNorb . . . 48

(5)

6 Discussion 50

7 Conclusion 52

A Hyperparameters 59

(6)

1 Introduction

The choice of representation of data has a fundamental impact on the performance of a predictive model trained and evaluated on that data. If a representation is well suited for a task, it can increase the efficacy and robustness of a model (Bengio et al., 2013). But the suitability of a representation is usually determined by the performance of the representation a particular task, say, classification. Of course, a model can ‘overfit’ on a given dataset and task and perform very well. However, this doesn’t necessarily mean that it will perform well on related (or other) tasks. The field of Representation Learning deals with learning a representation that is well suited for several, general, tasks.

The resurgence of Deep Learning points to the importance of representations - since the success of Krizhevsky et al. (2012) on the ImageNet challenge, the field has exploded, causing an flurry of research and industry interest in the field. Deep Learning, unlike other earlier approaches, manages to learn representations that achieve state of the art performance. Since Deep Learning requires at least several thousand data points, A.I practitioners exploit a well-known aspect of Deep representations. Representations which are learned for one task can also be fine-tuned for related tasks. This is called transfer learning (Pan and Yang, 2009) and has seen application of Deep Learning to several domains.

In addition to transferability, there are several other desiderata. Bengio et al. (2013) explores other desirable properties of a ‘useful’ representation. Among the already familiar transferability, there are others like smoothness, expressivity and so on (see Section 2.1). It has been hypothesized by several researchers that a representation that captures factors of variation in the data in a disentangled manner is desirable. A Disentangled representation captures factors of variation in the data, also called the generative factors of the data in an independent manner: if we change one factor of variation in the observation (e.g. Shape of an object) keeping everything else fixed, only one aspect of the representation produced by a algorithm or model changes1_.

Consider the MNIST dataset (LeCun and Cortes, 2010) for instance. Early work in this area of research (Chen et al., 2016) managed to disentangle the digit, orientation and stroke thickness - see Figure 1.1. This makes the representation learned by the model interpretable and intuitive. Among other advantages which are explored in the next section, a disentangled representation has been thought to be crucial for building A.I that is able to reason like humans (Lake et al., 2017). Indeed, disentanglement has been shown to improve performance in tasks which require reasoning (van Steenkiste et al., 2019).

1_{A more precise definition is given in Section 3.1}

(a) Digit (b) Orientation (c) Stroke Thickness

Figure 1.1: Manipulating a disentangled representation: (a) Digit (b) Orientation and (c) Stroke Thickness are varied independent of each other. Each row shows one sample being manipulated -only one aspect changes, while the others remain fixed

(7)

Recent work (Locatello et al., 2018) has shown that this task is more difficult than researchers initially thought. In their large scale study, they show current state-of-the-art models cannot reliably learn disentangled representations in an unsupervised manner and are very sensitive to hyperparameter choices. They conclude that the task is difficult without certain modeling as-sumptions i.e. inductive biases.

In this work, we take advantage of the fact that we know the nature of the factors of variation in the data. For instance, we know that the MNIST dataset has 10 digits, and has other variations like stroke thickness and orientation which are continuous. We exploit this knowledge by using priors composed of independent factors, where each factor is a distribution on a geometry suited for a factor of variation in the data. We perform experiments to see if this provides enough supervisory signal to reliably disentangle representations. Recently proposed optimization methods (Villani, 2008; Peyr´e et al., 2019) allow us to use arbitrarily complex factorial priors with little to no change in the optimization algorithm.

In the following section, we will motivate the need for a disentangled representations. In Section 1.2, we will outline the aim of the thesis and end with Section 1.3, which outlines the organization of the thesis.

1.1 Motivation

Disentangled Representations Lake et al. (2017) measures the progress of machine learning and artificial algorithms, contrasting it with Human intelligence. They hypothesize that a repre-sentation that is disentangled could improve the performance of state-of-the-art AI approaches in situations where the performance of humans is far superior to A.I approaches. Such tasks include knowledge transfer, where the previously learned representations are reused and tuned for tasks other than the original task; in zero-shot inference, where reasoning about new data is enabled by recombining previously learned factors, and discarding irrelevant factors. The key idea behind this conclusion is that human beings form causal models of the world that support explanation and understanding (Lake et al., 2017). Instead of solving a pattern recognition problem, the goal is to build representations that capture such a model, which would allow an A.I to reason and harness the compositionality inherent in such a representation.

While there is little evidence to the claim that disentangled representations are useful for general downstream tasks like classification, recent research suggests that such a representation is indeed useful for abstract reasoning, a task that humans are able to perform with little effort. van Steenkiste et al. (2019) report that disentangled representations do indeed improve the performance of downstream reasoning tasks, in terms of sample efficiency and speed of learning.

Furthermore, Higgins et al. (2017c) report that a disentangled representation enables their algorithm to extrapolate from its training data distribution and ‘imagine’ novel visual concepts by the recombination of previously learnt concepts.

Disentangled representations also have shown to improve performance in Reinforcement Learn-ing: DisentAngled Representation Learning Agent (DARLA) (Higgins et al., 2017b) was robust to many domain shifts including simulation to reality transfer in robotics; Kurutach et al. (2018) use the InfoGAN framework Chen et al. (2016) (a framework that disentangles representations) to induce an informative model, and show that the representations learned by the model can generate convincing walkthrough sequences (for planning).

(8)

The Disentanglement Challenge (NeurIPS 2019) attempts to bring disentanglement algorithms to the ‘real world’, by transferring representations learned on simulated data to real data, and learning disentangled representations for datasets with complicated physical objects.

Disentangled representations have another advantage, other than the ones mentioned above.

Disentangled Representations are interpretable While having a disentangled representa-tion that aligns with human- interpretable intuirepresenta-tions is not an explicit goal for the task of disen-tangling a representation, it seems that these two factors are related (Burgess et al., 2018), and formed the basis of early work in the area, before quantitative metrics were available. The nature of this relationship between interpretability and disentanglement is not understood (especially why some disentangled representations align with human intuition), but can be clearly seen if we ‘tra-verse’ the space of representations (see Section 4.3.1). In fact, early work (Chen et al., 2016) relied entirely on human qualitative evaluation. However, it is not necessary that a disentangled repre-sentation is interpretable. But, it may be clear that different factors encode different properties, the factor itself might be a high dimensional space which might not allow for principled traversal (for instance in a n-Sphere, n ≥ 3).

Having an interpretable and disentangled representation presents obvious advantages. An interpretable representation allows a researcher to uncover what the model has understood. For instance, if we have a RL agent with a perception model composed of a neural network which produces an interpretable representation which is disentangled, we can essentially ‘peer’ into the agent’s behaviour. This might also allow for better A.I safety, since the perception model may be interpretable.

In summary, disentangled representation learning is an interesting direction of research, and shows promise in several tasks. However, recent work challenges some of the assumptions made by current approaches for disentanglement.

The need for Inductive Biases Locatello et al. (2018) conduct an experimental study of state-of-the-art algorithms for disentangled representations, and highlighted several issues in cur-rent approaches. In particular they prove both theoretically and experimentally, that without inductive biases on the model or the data, it is impossible for an algorithm to learn a disentangled representation without supervision. They key intuition behind this result is as follows: given access to only the data (and no ground truth labels), there are infinitely many generative models which have the same marginal distribution of the observed data, and there is no known way to determine which of these models disentangle the representations. Some additional signal, via inductive biases, which constrain the set of possible solutions to favor disentanglement are needed. In this work, the supervisory signal is limited to the the nature of the prior we employ when training the model (and the corresponding latent space induced by it).

Geometric latent spaces The choice of prior and the latent space it induces is one of the most common inductive biases used in Machine / Deep Learning (Ridgeway, 2016). Research on geometric latent spaces show that having a geometric latent space is advantageous, and achieves better reconstruction (Falorsi et al., 2018; Davidson et al., 2018; Patrini et al., 2019; Ridgeway and Mozer, 2018). Falorsi et al. (2018) highlight that a continuous encoder network cannot embed a data manifold with a non-trivial topology in a one-to-one manner without creating ‘holes’ in the manifold. They show in particular that choosing a manifold that matches the topology latent data

(9)

manifold is crucial to preserve the topological structure and learn a ‘well-behaved’ latent space. Therefore, choosing a well-suited latent space is crucial.

In this work, we exploit the fact that we know the generative factors of the datasets, and how they are distributed. For instance, the ‘Digit’ generative factor in the MNIST dataset is clearly discrete, and ‘Stroke Thickness’ is clearly continuous (see Figure 1.1). By constructing a prior composed of distributions on different geometries, we hope to nudge generative factors to be captured on the geometry that is well-suited for it.

The use of multiple, independent, low-dimensional geometric factors also makes it easy to visualize the latent spaces, allowing for effective qualitative evaluation of both disentanglement and semantic alignment.

Recent work in the field of Optimal Transport Villani (2008); Peyr´e et al. (2019); Cuturi (2013); Tolstikhin et al. (2017); Genevay et al. (2017); Patrini et al. (2019) has opened up the possibility of using arbitrary priors with little to no change in the learning algorithm.

1.2 Aim of the thesis

The aim of the thesis is to solve the problems outlined above in an elegant manner. By using a prior composed of distributions on low dimensional, geometries composed of independent factors, we attempt to learn an interpretable, disentangled latent space, where each factor captures a semantically meaningful factor of variation in the observed data.

First, we experiment with different geometries and show that there are natural choices for certain factors of variation in the data, and choosing other geometries induce ‘holes’ in the manifold. Second, we investigate whether geometries composed of multiple, independent factors suited for the generative factors of the data provides a sufficient supervisory signal for the model to disentangle the learned representation. We show that while this can achieve disentanglement successfully for small datasets with a limited number of factors, the task remains difficult, and additional supervision might be needed to completely solve the task.

1.3 Organization

The thesis is organized as follows:

• Section 2 provides background information needed for the thesis, introducing the variational inference framework (Section 2.3) and the Optimal Transport framework (Section 2.5). • Section 3 presents the definitions of disentanglement, and positions the work presented here

in the current research context.

• Section 4 describes the methods of the thesis, along with the datasets we use. • Section 5 outlines the experiments we conducted and their results

• Section 6 contains a discussion of the results

(10)

2 Background

A brief overview of all the required information to understand this thesis is presented in this Section. Readers are assumed to be familiar with Machine Learning2 and Deep Learning3. The rest of the section is organized as follows: Section 2.1 discusses Representation Learning; Section 2.3 and 2.4 introduce Variational Inference and the Variational Autoencoder respectively; Section 2.5 introduces the field of Optimal Transport and provides some context for the models used in Sections 2 and 4.

2.1 Representation Learning

Before the current era of Deep Learning, much effort had been put into designing representations which captured some informative signal which is useful for some (usually) predefined task (Lowe et al., 1999; Bay et al., 2008; Dalal and Triggs, 2005). This ‘feature engineering’ proved to be tedious to work with, since it required tweaking an algorithm to fit a particular task. The emergence of Deep Learning allowed researchers to do away with much of this ‘engineering’, by training a sequence of neural network layers which were able to learn a set of hierarchical features directly from data. These features are learnt in an ‘end-to-end’ manner and proved to be very successful in several tasks and domains. The success of (deep) neural networks allowed for powerful task-specific representations to be learned.

Bengio et al. (2013) argues that the representation of data plays a very important role in the success of Deep Learning. In particular, the paper argues for several characteristics of a ‘good’ representation. Some of these properties include: smoothness, a learned function f maps data points close in the data space, x ≈ y, to points close in the representation space f (x) ≈ f (y); shared factors across tasks, where the learned representation generalizes across several tasks; expressive meaning that the dimensionality of the representation is much smaller than the dimensionality of the data space, while still capturing (ideally) all combinations of variations in the data; hierarchical representations which capture different levels of abstraction, and so on. The importance of learning a good representation may be showcased by the emergence of conferences or tracks dedicated to Representation Learning, such as the International Conference on Learning Representations (ICLR).

2.2 Notation

We use the following notation. We use calligraphic letters (i.e. X ) for sets, capital letters (i.e. X) for random variables, and lower case letters (i.e. x) for their values. We denote probability distributions with capital letters (i.e. P (X)) and corresponding densities with lower case letters (i.e. p(x)).

2.3 Variational Inference

Variational Inference (VI) (Blei et al., 2017) is a technique that approximates intractable prob-ability densities. It is an alternative to techniques like the Markov chain Monte Carlo sampling (MCMC) (Gilks et al., 1995). VI has been applied in several domains like computer vision, com-putational neuroscience and large scale document analysis (Srivastava and Sutton, 2017).

2_{See Bishop (2006) for a full overview}

(11)

Consider the following problem of modeling the joint density of the (continuous) latent variables Z (and values z ∈ Z ) and observations X (and values x ∈ X ),

p(x, z) = p(x|z) | {z } likelihood p(z) |{z} prior (2.1)

The latent variables help explain the distribution of the observations. Models described by (2.1) draw samples from the prior distribution and relate them to the likelihood of the observations. Inference in such models involves the computation of the posterior, P (Z|X). We can write the conditional density of this posterior distribution as follows,

p(z|x) = p(z, x)

p(x) (2.2)

where the denominator, called the evidence, is defined as follows: p(x) = R_Zp(z, x)dz for all x ∈ X . Typically, the posterior is intractable, so we are left using other methods which approximate it, such as MCMC. While effective in certain settings, MCMC does not scale very well in high dimensions. VI can be employed in these circumstances. In VI, we hypothesize a family of distributions, D, over the latent variables, and then search for a member (approximate posterior) that best explains the true posterior by minimizing a divergence between the ‘true’ posterior and the candidate distribution:,

ˆ

QZ = arg min QZ∈D

DKL(QZ||P (Z|X)) (2.3)

where ˆQZ is the approximate posterior. This is usually parametrized by a neural network.

Equation 2.3 can be further broken down:

DKL(QZ||P (Z|X)) = E [Q(z)] − E [P (z|x)]

= E [Q(z)] − E [P (z, x)] + log p(x) | {z }

intractable

(2.4)

Since the log evidence term log p(x) is intractable, we instead optimze a lowerbound. First:

log p(x) = log Z p(x, z)dz = log Z p(x|z)p(z)dz = log Z _q(z) q(z)p(x|z)p(z)dz = log Eq(z) p(x|z)p(z) q(z) (2.5)

Since the log function is concave, we can use Jensen’s inequality to get a lower bound, called the evidence lower bound (ELBO):

(12)

log p(x)JI≥ Eq(z) logp(x|z)p(z) q(z) = Eq(z) logp(z) q(z) + Eq(z)[log p(x|z)] = −DKL(Q(Z|X)||PZ)) + Eq(z)[log p(x|z)] (2.6)

From equations 2.4 and 2.6, we can establish a relation between the log evidence and the ELBO,

ELBO = log p(x) − DKL(QZ||P (Z|X))

ELBO ≤ log p(x)

(2.7)

since Kullback-Leibler divergence is always non-negative. Hence, maximizing ELBO is equivalent to minimizing the quantity DKL(QZ||P (Z|X)) upto a constant. If the approximate posterior QZ

perfectly matches the posterior P (Z|X) i.e DKL(QZ||P (Z|X)) = 0, then the ELBO is equal to

log p(x).

2.4 Variational Autoencoders

xi zi x0_i Decoder Encoder

Figure 2.1: A schematic of the Encoder-Decoder architecture, also called an auto-encoder. The encoder encodes a data point data xi, into a (low dimensional) latent space, producing zi. The

decoder decodes zi to produce x0i, producing a ‘reconstructed’ data point. The learning goal is to

encode zi such that the reconstructed point is identical to xi.

The Variational Autoencdoer (Kingma and Ba, 2014) has become ubiquitous for unsupervised learning (Doersch, 2016). The VAE employs neural networks to learn a generative model - a model which captures the ‘true’ distribution of the data. The VAE consists of two parts, the encoder and the decoder, both of which are neural networks. The encoder encodes the input data, into a (low dimensional) latent space. The decoder decodes from this latent space into the data space, producing a ‘reconstructed’ data point. Part of this process is illustrated in Figure 2.1.

Since the training procedure (in theory) ensures that all examples are faithfully reconstructed from the latent space - the latent space should contain all the information about the data itself. Usually, this latent space is of a much lower dimension, allowing for some (lossy) compression of

(13)

the data into a compact representation.

Following the notation introduced in the previous section, the generative process for a point x ∈ X is:

p(x) = Z

Z

p(x|z) p(z) dz ∀x ∈ X (2.8)

The VAE uses the maximum likelihood principle (MLE) to learn a generative model, which is capable of generating samples similar to the observed training data. In the vanilla implementation of the VAE, a standard (isotropic) Gaussian prior is used as the prior PZ. This approximate

posterior distribution Q(Z|X) typically is a diagonal Gaussian. Following the derivation from previous section, ELBO is

Eq(z|x)[log p(x|z)] − DKL(Q(Z|X)||PZ) (2.9)

The first term in (2.9) amounts to the reconstruction error. The DKL term can be

inter-preted as a regularization term that forces the variational distribution to match the prior (Doersch, 2016; Ghosh et al., 2019). The DKL term has closed form solutions for well known distributions.

Although the expectation of log likelihood can be approximated using MC estimates, we can-not use backpropagation through samples. This issue is addressed with the reparameterization trick (Rezende et al., 2014; Kingma and Welling, 2013) by moving the sampling to an input layer, allowing gradient computation.

2.5 Optimal Transport

The field of Optimal Transport (OT) (Villani, 2008) deals with comparing two probability distribu-tions, one of which we want to transport or morph into the other. If we consider these distributions as piles of sand, we can associate a cost of transporting a single grain of sand from one pile to another. This ‘ground cost’ depends on the path this grain of sand takes, and is usually defined on a geometry. There are several such ways of moving the grains of sand, but OT deals with optimal way to transport this mass i.e we want to minimize the overall cost of moving all the grains of sand. Early uses of OT involved the Earth Mover’s Distance (EMD) Peleg et al. (1989) on images Rubner et al. (1998). More recently, it has been applied to Machine Learning, and in particular Generative Modeling, which we will explore in this section. The subsequent paragraphs introduce OT problems formally. The reader can refer to Villani (2008); Peyr´e et al. (2019) for a more thorough introduction to the topic.

Monge Problem The Monge problem (Monge, 1781) deals with finding an optimal mapping between two probability spaces X and Y. Intuitively we want to ‘push’ the mass from one space to another, and we pick such a mapping or transportation plan such that it minimizes a transportation cost c : X × Y. This push forward T : X → Y is denoted by T]. Given two probability measures µ

and ν, we can formally define the Monge Problem:

inf

T]µ=ν

Z

(14)

Example: Monge Problem for discrete measures Consider two discrete measures α = Pn

i=1aiδxi and β =

Pm

j=1bjδyi for points x1, . . . xn ∈ X and y1, . . . ym ∈ Y. We want to find a

mapping T that maps each point xi to a single yj, and which therefore must ‘push’ the mass of α

to β. Intuitively, we want to move each grain of sand to a unique location, while doing the least amount of work. The push forward map T : {x1. . . xn} → {y1. . . ym} must satisfy:

∀j ∈ [[m]], bj =

X

i:T (xi)=yj

ai

which can be denoted as T]α = β. The objective, which is to minimize the work done, can

be denoted as follows. Given a cost function c(x, y), defined ∀(x, y) ∈ X × Y, we want to pick a transportation plan T which minimizes the transportation cost:

min T X i c(xi, T (xi)) : T]α = β (2.11)

Kantorovich Problem The Kantorovich problem (Kantorovich, 1942) relaxes some of the con-straints on the Monge problem. For instance, for discrete measures, instead of assigning a xi to a

single location yi through T , we now assign this to several locations. This amounts to moving away

from the deterministic nature of the original problem to a probabilistic view. Therefore, instead of maps T , couplings are considered. Given two measures µ and ν, a coupling Γ ∈ P(X × Y) is constrained as follows:

Π(µ, ν)def= { Γ ∈ P(X × Y) | ∀A ∈ X , B ∈ Y, Γ(A × Y) = µ(A) and Γ(X × B) = ν(B)} (2.12) The Kantorovich problem can be formally written down as follows:

inf

Γ∈Π(µ,ν)

Z Z

c(x, y)P (dx, dy) = inf

P ∈Π(µ,ν)EP[c(X, Y )] (2.13)

Example: Kantorovich problem for discrete measures For discrete measures, the trans-port plan becomes a coupling matrix P ∈ Rn×m+ , where each entry describes the amount of mass

flowing from one location to another i.e Pi,j describes the amount of mass of xiflowing to location

yj. Pi,j is constrained - all of the mass from xi must be assigned to a location, and each location

yi must be ‘filled up’. Formally, given two weights a, b, we can define the permissible couplings:

U (a, b)def= P ∈ RRn×m+ : P1m= a and PT1n= b (2.14) where: P1m= P jPi,j i ∈ Rn _{and P}T₁ n= P iPi,j j ∈ Rm_.

The Kantorovich problem (2.13) reduces to:

min

P ∈U (a,b)hP, Ci

(15)

Wasserstein Distance If c(x, y) = d(x, y) is a metric, then for p ≥ 1 then we have the p-Wasserstein distance: Wp def = inf Γ∈Π(µ,ν) Z Z d(x, y)pP (dx, dy) 1/p (2.16) Wp can be solved using several methods including network flow solvers (Klein, 1967). With

empirical measures µ =Pn i=1aiδxi and ν = Pm j=1biδyi, we have: Wp(µ, ν) = min P ∈U (a,b)hP, Ci (2.17)

where Ci,j = d(xi, yj)p. This problem is solvable using minimum cost flow solvers (Klein,

1967), which has a complexity of O(n3log(n)). In addition, the solution isn’t stable and not always unique, which leads us to employ regularization.

Entropy Regularization for Wp By using Entropy Regularization we can achieve a unique

and stable solution to (2.17) (Cuturi, 2013):

Wλ(µ, ν)

def

= min

P ∈U (a,b)hP, Ci − λH(P ) (2.18)

where H(P )def= −Pnm

i,j=1Pij(log Pij− 1) is the entropy regularizer. This quantity is computable

using the Sinkhorn algorithm (Cuturi, 2013), which is detailed in the next section and in Section 4.

2.5.1 Generative Modeling using Optimal Transport

While the previous section was more general, in this section we focus on using OT to build gen-erative models. In particular, we apply OT on probability distributions for learning a gengen-erative model of the data. OT offers a natural alternative to methods like Variational Inference. Several attempts to use OT for generative models have been explored Bousquet et al. (2017); Genevay et al. (2017); Tolstikhin et al. (2017); Ambrogioni et al. (2018); Patrini et al. (2019).

Given two distributions PX and PY and any measurable non-negative cost c : X × Y →

R+∪ {∞}, we can restate the Kantorovich problem:

Wc(PX, PY) = inf Γ∈Π(PX,PY)

E(X,Y )∼Γ[c(X, Y )] (2.19)

where Π(PX, PY) is the set of all joint distributions if (X,Y) with PX and PY as the marginals.

If c(x, y) = d(x, y)p _{for a metric d and p ≥ 1 then W} p= p

√

Wc is called the p-Wasserstein distance

Let PX denote the true data distribution on X . Similar to the VAE, we have a latent space Z

and a prior distribution over this space, denoted by PZ. The decoder, G(X|Z) is parametrized by

a neural network. A generative model uses samples from the latent space PZ to generate data in

X . The induced marginal will be denoted by PG. Therefore, learning this induced marginal which

approximates the true data distribution PX can be cast into an OT setting:

min

G Wc(PX, PG) (2.20)

This quantity is intractable in general, because of the infimum (see (2.19)). Instead of two random variables living in the data space X , we consider a latent space Z. In this model, instead

(16)

of finding a coupling between two random variables living in X , one distributed according to PX

and the other one according to PG, it is sufficient to find a conditional distribution Q(Z|X) such

that its Z marginal QZ is identical to the prior distribution PZ. We then consider the encoder

Q(Z|X) i.e the posterior, and the aggregated posterior QZ:

QZ = EX∼PXQ(Z|X) (2.21)

Tolstikhin et al. (2017) show that if the decoder G(X|Z) is deterministic 4 _{i.e P}

G = G]PZ or

in other words, if all the stochasticity of the generative model is captured by PZ, then:

Wc(PX, PG) = inf Q(Z|X):QZ=PZ

EX∼PXEZ∼Q(Z|X)[c(X, G(Z))] (2.22)

Therefore, learning a generative model G amounts to the following objective5_:

min

G Q(Z|X)min EX∼PXEZ∼Q(Z|X)

[c(X, G(Z))] + β · DZ(QZ, PZ) (2.23)

where β > 0 is a Lagrange multiplier and DZ is any divergence measure on probability

distri-butions on Z, which is left open. In Tolstikhin et al. (2017), the WAE uses either the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) or a discriminator trained adversarially for DZ:

GAN-Based DZ The first option is to choose DZ(QZ, PZ) = DJ S(QZ, PZ) and use adversarial

training (Goodfellow et al., 2016) to estimate it. A discriminator in the latent space Z tries to separate two separate two “true” points sampled from PZand “fake” ones sampled from QZ. This

approach is called the WAE-GAN. Tolstikhin et al. (2017) report that this method was found to be harder to train because of the instability involved with training GANs (Goodfellow et al., 2016), in spite of having an adversary in the latent space instead of the much higher dimensional data space.

MMD-Based DZ For a positive-definite reproducing kernel k : Z × Z → R the following

expression is called the maximum mean discrepancy(MMD) (Gretton et al., 2012; Sriperumbudur et al., 2010, 2011): MMDk(PZ, QZ) = || Z Z k(z, ·)dPZ(z) − Z Z k(z, ·)dQZ(z)||Hk (2.24)

where Hk is the RKHS of real-valued functions mapping Z to R. If k is characteristic, then

MMDk defines a metric and can be used as a divergence measure. MMD has an unbiased

U-statistic estimator, which can be used in conjunction with stochastic gradient descent (SGD) method. Tolstikhin et al. (2017) explored possible kernels for this estimator and found that the inverse multiquadratics kernel (IMQ) works best:

k(x, y) = C

(C + ||x − y||2 2)

(2.25)

Sinkhorn Autoencoder An alternative objective can be derived using Entropy Regularization for the Kantorovich Problem for probability distributions (2.19). Consider two distributions PX

4_{We focus on deterministic encoders in this work, but random encoders can be used, albeit with the}

reparametriza-tion trick, similar to the VAE

(17)

and PY, and a non-negative measurable cost c : X × Y → R+∪ {∞}. (Genevay et al., 2017, 2018;

Feydy et al., 2018) define the entropy regularized OT cost with ≥ 0: ˜

Sc,(PX, PY) := inf Γ∈Π(PX,PY)

E(X,Y )∼Γ[c(X, Y )] + · DKL(Γ, PX⊗ PY) (2.26)

which is not a divergence due to its entropic bias. When we remove this bias, we arrive at the Sinkhorn divergence:

Sc,:= ˜Sc,(PX, PY) −

1

2( ˜Sc,(PX, PX) + ˜Sc,(PY, PY)) (2.27) The sinkhorn divergence Sc, interpolates between the OT-divergences and MMD ((2.24))

(Feydy et al., 2018). This quantity can be computed using the Sinkhorn algorithm (Cuturi, 2013), which allows for fast computation of this divergence, with time complexity close to O(M2). There-fore, we are close to the original objective with favorable computation and statistical properties, provided an appropriate is chosen.

Since this problem is difficult in general (Genevay et al., 2017), we can recast the problem in an auto-encoder setting similar to the WAE. Therefore, this problem reduces to using the Sinkhorn divergence in the latent space i.e DZ is equal to the Sinkhorn divergence. Therefore, we can rewrite

the objective (2.23) to get the Sinkhorn autoencoder (SAE) objective (Patrini et al., 2019):

min

G Q(Z|X)min EX∼PXEZ∼Q(Z|X)[c(X, G(Z))] + β · S˜c,(QZ, PZ) (2.28)

where the quantity S˜c,(QZ, PZ) is defined on a cost in the latent space, Z, ˜c : Z×Z → R+∪{∞}

, instead of the data space X . If we restrict the model for deterministic encoder/decoders and p−Wasserstein distances, we obtain6_:

min G minQ p q EX∼PX[||X − G(Q(X))|| p p] + γ · S||·||p p,(QZ, PZ) 1 p _(2.29)

The computation of the MMD-based DZ and SAE based DZ is detailed in Section 4.2

(18)

3 Related Work

In this section, we will briefly explore work that is related to the proposed methods. We go over the proposed definitions of disentanglement in Section 3.1; In Section 3.2 we explore how to evaluate disentanglement; Sections 3.3 and 3.4 explored supervised and unsupervised models respectively; Finally, in Section 3.5, we explore models which have geometric latent spaces.

3.1 Definition(s) of Disentanglement

In this section, we will explore what it means for a representation to be disentangled. While there have been several attempts to define disentanglement, we note that there is no consensus among researchers.

All definitions assume that there are several generative factors which control or influence the data generation process. Typically, these are attributes of the data that have a semantic meaning associated with them. For synthetic datasets like dSprites (Matthey et al., 2017), these generative factors are independent of each other. The dataset itself is generated from a graphics engine, and ‘ground truth’ factors uniquely determine the pixel values of the image. The representation generated by a model is referred to as ‘latents’ since they are inferred from the observations, and not directly observed during the learning process.

One of the first attempts to define a disentangled representation was in Bengio et al. (2013): A disentangled representation is a representation in which a single latent is influenced by a single generative factor, while being relatively invariant to changes in other factors This is illustrated in Figure 3.1. For instance, consider the dSprites (Matthey et al., 2017) dataset. If we take one ground truth latent, shape, and vary it, only one latent dimension in the representation should have varied. This is a widely accepted (albeit informal) definition that early work (Higgins et al., 2017a; Chen et al., 2016) used.

More recently, Higgins et al. (2018) attempted to formalize the definition of a disentangled representation7_{. They suggest that symmetry transformations are key to exploit structure in the}

data. Symmetry transformations change only certain properties of the underlying world, while leaving all other properties invariant. Intuitively, they define a disentangled representation as follows:

A vector representation that can decomposed into a number of subspaces, each of which is compatible with, and can be transformed independently by a unique transformation. A symmetry transformation is a transformation which changes certain aspects of the world, while keeping other aspects (specifically object identity) unchanged. A set of such transformations make up a symmetry group. For instance, consider the Position X and Position Y generative factors in the dSprites dataset (Matthey et al., 2017). Changing one of these factors, say position X, leaves other properties like Shape, Orientation or Scale unchanged. Therefore, we can say that the horizontal and vertical translations are symmetry transformations and all of these transformations (under a certain set of formal assumptions) make up the symmetry group for the dSprites dataset. A group action is the effects of the transformation, manifested in the observation of the sprite

7_{Note that this section only details a very high level intuition of the formal definition outlined in the paper. A}

(19)

moving, for instance. Clearly, for this example, the symmetry group can be decomposed into sub-groups, one that affects position along the x direction, one along the y direction, etc.

A vector representation is called a disentangled representation with respect to a par-ticular decomposition of a symmetry group into subgroups, if it decomposes into inde-pendent subspaces, where each subspace is affected by the action of a single subgroup, and the actions of all the other subgroups leave the subspace unaffected.

Note that a symmetry group may have multiple decompositions, and not all of these decompo-sitions may lead to a disentangled representations. It may be the case that the symmetry group decomposes into subgroups where the position x and y are entangled, for instance. Locatello et al. (2018) show that the decomposition that results in a disentangled representation requires addi-tional assumptions about the data, and the model producing the representation in the form of inductive biases.

Note that this is different from the one in Bengio et al. (2013) where each generative factor was intended to be encoded in a single latent variable. The new definition allows for independent subspaces, potentially with dimension greater than 1, to represent generative factors. This is in line with our proposal, in which we attempt to model generative factors with independent factors, each of which can have dimension greater than one. The next section focuses on quantifying the extent of disentanglement in a learned representation.

Gen(fi) Encoder

[0.1, 3, 2.3, . . . ] [0.1, 2, 2.3, . . . ] [0.1, 1, 2.3, . . . ]

Figure 3.1: A model (β-VAE Higgins et al. (2017a)) that produces a disentangled representation Bengio et al. (2013). The data generative process is fed generative factors which have only one value (in this example, shape) varying. This is then fed to our β-VAE model, which produces representations (the mean of the encoder) which have all but one latent fixed

3.2 Evaluating disentanglement

Early work in the area relied on qualitative evaluation to see if a representation is disentangled. Several quantitative metrics have been proposed since, but qualitative metrics remain an important way to understand disentanglement.

Latent Traversals The simplest way to perform qualitative evaluation of models is to plot latent traversals. First, a single sample (ex. from the prior) is taken, and multiple samples are generated from this sample, keeping all but one latent dimension (or factor) fixed. These representations are then fed to the decoder. The observations produced by the decoder should have only one generative

(20)

factor like Shape varying, with other factors like position fixed. Additional methods and further explanation of these methods are provided in Section 4.3.1. Of course, this method of evaluation makes objective comparison of two different models (or even the same model trained with different hyperparameters) noisy and unreliable.

Quantitative Metrics The earliest work which attempted to quantify disentanglement is Hig-gins et al. (2017a). Kim and Mnih (2018) proposed an alternative metric which claims to solve problems with the metric proposed in Higgins et al. (2017a). We use the metric proposed in Kim and Mnih (2018) in this work. An explanation of how this metrics are computed is detailed in 4.3.2. Other proposed quantitative evaluation are Disentanglement, Completeness and Informativeness (Eastwood and Williams, 2018); Mutual Information GAP (MIG) (Chen et al., 2018); Modularity and Explicitness (Ridgeway and Mozer, 2018); Separated Attribute Predictability (SAP) (Kumar et al., 2017)

The metrics defined above are limited in that they can only be applied to synthetic datasets. Therefore, researchers rely on both quantitative measurements of disentanglement on synthetic datasets, and qualitative measurements on synthetic and real world datasets.

Application to ‘real world’ datasets Evaluating on synthetic datasets, for which we have either access to all the ground truth generative factors or the generative process itself, is relatively straightforward. However, synthetic datasets are typically biased when it comes to the nature of the generative factors: they are often independent. In addition, they are not noisy. For real world datasets where either the generation process is unknown or complex, there is no straightforward way to quantify disentanglement. The prevailing perspective for evaluation of models trained on real world datasets is to only consider generative factors that are aligned to human intuition i.e factors that have a semantic meaning assigned to them. In practice, this means that generative factors are ‘split’ into two sets: entangled generative factors and independent generative factors. The representation should capture enough information about the data distribution, while still capturing the independent generative factors in the latent space. While quantifying disentanglement in this context remains an active research area, qualitative results are promising. For instance, Chen et al. (2016) capture rotation and stroke thickness in MNIST; Higgins et al. (2017a); Kim and Mnih (2018) capture size, azimuth, back length, etc in the 3DChairs dataset (Aubry et al., 2014) and so on; Kim and Mnih (2018) recover azimuth, lighting, etc on the 3D Faces dataset (Paysan et al., 2009), and so on.

3.3 Supervised Models

The first attempts at disentangling representations were semi-supervised, requiring some knowledge of the underlying factors, or access to some labels: Hinton et al. (2011); Kingma et al. (2014); Reed et al. (2014); Denton et al. (2017); Narayanaswamy et al. (2017); Mathieu et al. (2016); Goroshin et al. (2015); Hsu et al. (2017); Karaletsos et al. (2015); Whitney et al. (2016); Rippel and Adams (2013); Zhu et al. (2014); Yang et al. (2015); Cheung et al. (2014). In the following paragraphs, we will highlight a few notable attempts.

The DC-IGN (Kulkarni et al., 2015) learned a interpretable representation of images which was disentangled with respect to some prespecified transformations only, such as out-of-plane ro-tations and lighting variations. This approach is semi-supervised because of the way the model is trained: A batch where only one ground-truth latent (azimuth, lighting, etc) variable is varying

(21)

is picked, and the representation produced by the model is nudged so that only the correspond-ing latent variable (which is picked before hand) varies. Note that even with supervision, some ground truth factors weren’t captured in the latent space (Kulkarni et al. (2015); Higgins et al. (2017a)), highlighting the difficulty of the problem. Similar approaches which use either (i) group-ing, similar to the DC-IGN approach, or (ii) semi-supervised methods (Bouchacourt et al., 2018; Narayanaswamy et al., 2017; Mathieu et al., 2016; Ridgeway and Mozer, 2018).

To our knowledge, only Chen et al. (2016); Dupont (2018); Rey et al. (2019) explicitly model the underlying factors by assuming some kind of priors. For instance, for modeling MNIST, Chen et al. (2016) assume that the data is generated by a mix of known and unknown generative factors. They attmempt to model the ‘digit’ variable using a discrete distribution and 2 continuous distributions, meant to model stroke thickness and rotation; (Rey et al., 2019) model the presence of glasses and gender.

3.4 Unsupervised Models

Current approaches to learning a disentangled representation focus on learning in a completely unsupervised manner. In this section, we go over general approaches which do not use any super-vision, either in terms of known ground truth labels, or in terms of grouping the data (Kulkarni et al., 2015) .

ICA Independent Component Analysis (ICA) Comon (1994), involves techniques which separate a signal into independent components, a task known as ‘blind source separation’. The underlying assumption of this method is that there is a generative process in which the signal is composed of statistically independent (non-Gaussian) components. ICA attempts to recover these sources. The non-linear variant has also been extensively studied. It is well known that while the linear ICA problem is identifiable Comon (1994), the non-Linear problem (in an unsupervised setting, at least) was shown to be unidentifiable in most settings Hyv¨arinen and Pajunen (1999). Locatello et al. (2018) extend this result in the general case of generative models and prove that it is not possible for arbitrary generative models using factorized priors to disentangle representations.

InfoGAN InfoGAN (Chen et al., 2016) is a Generative Adversarial Network (GAN) (Goodfellow et al., 2016) that maximizes the mutual information between a small subset of the latent variables and the observation. By doing so, they learn interpretable representations which are disentangled. For instance, they model the MNIST dataset with three ‘codes’, the first is a categorical distribu-tion, and the other two are uniform distributions. They find that the by varying the categorical latent code, the digit changes. By varying the other two codes, they find that the first models the rotation and the last one models the width (Figure 1.1). Like our approach, the InfoGAN enables a researcher to specify the precise distribution of the generative factors. However, they do not experiment with geometric spaces. Furthermore, it is well known that GANs are unstable, and difficult to train (Goodfellow et al., 2016). Kim and Mnih (2018) experiment with an improved variant of the InfoGAN, which they call InfoWGAN-GP, which uses the WGAN (Arjovsky et al., 2017) objective along with gradient penalties Gulrajani et al. (2017) to improve training stability. In spite of this, VAE-based methods outperform this variant (Kim and Mnih, 2018).

β-VAE The β-VAE (Higgins et al., 2016, 2017a; Burgess et al., 2018) extends the VAE (Kingma and Welling, 2013) by adding a multiplier to the KL term. Tuning this β term, they find that

(22)

the the model learns representations that are axis aligned with human intuition of the ground truth factors, compared to the standard VAE. They found that a high β > 1 is necessary for ‘good’ disentanglement. However, having a high β forces the model to trade-off between the disentanglement and reconstruction, resulting in models which have either good reconstruction with poor disentanglement or good reconstruction with poor disentanglement. This is due to the additional pressure (due to a high β) for the posterior to match the isotropic Gaussian. This constrains the implicit capacity of the latent bottleneck - it has to both be factorized and still be sufficient to reconstruct the data. The β-VAE also requires reparametrization, is unable to properly model categorical data and orientation, and additionally suffers from poor reconstruction (Higgins et al., 2017a). The objective for the β-VAE is the same as the VAE (Equation (2.9)), except for a multiplicative factor β attached to the DKL term.

In their follow up work, Burgess et al. (2018), explore this further. They hypothesize that reconstructing with this latent bottleneck encourages embedding the data points on a set of repre-sentational axes where nearby points on the axes are also close in data space. In particular, they find that the β-VAE aligns latent dimensions with components that make different contributions to reconstruction: the bottleneck might induce the model to align factors that help most with the reconstruction - for instance, this might be the position of the sprite in the dSprites dataset (see 4.4) - which provides the most reduction in the reconstruction error. Continuing this intuition, they extend the β-VAE by explicitly controlling the latent information bottleneck. By gradually increasing the information capacity of the model, the model should first capture factors which aid in reconstruction most and once this contribution diminishes, it should capture other factors which give better gains in reconstruction. In their experiments they find that the model first learns position, then scale, followed by shape and orientation. This new model is termed the Annealed-VAE. While the AnnealedVAE fixes problems with the reconstruction seen in the β-VAE, it still is unable to model categorical data and orientation robustly (Burgess et al., 2018). In addition, they report that some factors that were captured in the latent traversals in the β-VAE are lost in the AnnealedVAE (Burgess et al., 2018).

FactorVAE Kim and Mnih (2018) argue that the approach taken by β-VAE, i.e penalizing the mutual information between the observation and latent space (i.e I(x; z)), more than what the VAE already does might be neither necessary nor desirable for disentangling. Instead, FactorVAE aims to directly enforce a factorized prior QZ where each dimension is independent. This is

done by introducing another loss term to the VAE loss (Equation (2.9)) DKL(QZ|| ¯QZ), where

¯ QZ

def

= ΠK_i=1Q(Zk). The authors approximate this quantity using the density-ratio trick (Nguyen

et al., 2010; Sugiyama et al., 2012) to approximate the density ratio that arises inside the KL, by using a discriminator Goodfellow et al. (2016)

While the FactorVAE does outperform the β-VAE in terms of reconstruction and certain disen-tanglement metrics, the FactorVAE is unable to disentangle some factors robustly (Kim and Mnih, 2018): ‘Orientation’ and ‘Shape’ factor in dSprites which is discrete (see Section 4.4).

Other VAE-based models Several other models modify the original VAE objective or add other terms to induce a representation that is disentangled: Similar to the FactorVAE, the β-TVCAE (Chen et al., 2018) proposes to estimate the Total Correlation, but uses a biased Monte-Carlo estimate instead of the density ratio estimation in Kim and Mnih (2018); the DIP-VAE Kumar et al. (2017) use a disentangled prior which pushes the aggregated posterior to match a

(23)

factorized prior, and estimate this divergence by regularizing the deviation of the covariances of the encoder; Ansari and Soh (2019) use an inverse-Wishart prior on the covariance matirx of the latent code and tune it to control the independence of the learnt latent dimensions; Watters et al. (2019) introduce an alternative to the deconvolutions used in the decoder in VAE based models, called the Spatial Broadcast decoder, which provides a bias for disassociating position and other features, and improves disentangling, reconstruction accuracy and generalization; Dupont (2018) successfully models discrete latent variables along with continuous factors using the Gumbel trick (Jang et al., 2016) - to our knowledge, this along with the InfoGAN Chen et al. (2016) are the only models capable of successfully modeling categorical generative factors in a principled manner; While Paquet et al. (2018) doesn’t explore disentanglement explicitly, they propose a factorial mixture prior composed of subspaces, each of which are quantized with a Gaussian mixture model, and manage to capture discrete properties like gender and presence of glasses in the latent space; Ridgeway and Mozer (2018) propose the F-statistic loss, which encourages the separation of two or more distributions, and achieve state-of-the-art results on few-shot learning tasks for the Deep embedding task;

Disentangled Representations with OT based models Rubenstein et al. (2018a,b) explore early work in applying OT to the task of learning disentangled representations. They report compa-rable performance to the β-VAE. Note that while they employ standard distributions and explore disentanglement in the context of using Optimal Transport, we employ deterministic encoders instead of random encoders, and we use geometric latent spaces.

3.5 Literature on geometric latent spaces

The Hyperspherical VAE (HVAE) (Davidson et al., 2018) uses a Hypersphere as a prior, with a Von Mises-Fisher posterior. It has a complex posterior sampling procedure and uses reparametrization using rejection sampling. They show improved performance compared to the VAE with a Gaussian prior in certain settings. Falorsi et al. (2018) extend the VAE with continuously differentiable symmetry group (Lie groups). They highlight the importance of choosing a geometry that matches the true latent manifold for a well behaved latent space. Rey et al. (2019) uses the properties of Brownian motion to implement the reparametrization trick. They train on MNIST using a sphere as a latent space, similar to Davidson et al. (2018). In addition to the sphere, they also use a torus in R3 _{as the latent space. While this method allows for a arbitrary manifolds as a latent space,}

the reparametrization trick itself is complex and the authors observe that the model is difficult to train (Rey et al., 2019).

The models listed above use the reparametrization trick to obtain geometric latent spaces. This either makes the sampling procedure very complex or even impossible for some geometries. Furthermore, while the models listed above have geometric latent spaces, none of them study it in the context of disentanglement.

(24)

4 Methodology

In this section, the core methods for the proposed models are explored. The section is organized as follows: Section 4.1 specifies the priors used, along with how they are combined; Section 4.2 expands on content from Section 2.5 and explains the estimation of the divergences from Tolstikhin et al. (2017); Patrini et al. (2019); and finally, Section 4.3 details the methods of evaluation, including qualitative methods and quantitative metrics.

4.1 Priors

In this section, we will briefly explore the different priors we use in the experiments. The only constraints for each prior is (i) there should be a way to sample from the prior (ii) the ground cost defined on the samples from the prior needs to be non-negative and differentiable. Furthermore, since we want the output of the encoder to lie on a geometry, the vector output of the encoder is constrained. For instance, the output is fed through a softmax so that it lies on the Simplex.

The suitability of each prior for a given generative factors varies. For instance, it might be suitable to use a prior amenable for discrete variables for Shape, compared to other priors. This is further explored in Section 4.1.6. Furthermore, since datasets have several generative factors, we need to use a factorial prior, to model them - this is explored in Section 4.1.5.

4.1.1 Hypercube

We consider the (‘filled’) Hypercube as well in our experiments: In 1-D, a hypercube is a straight line; In two dimensions, it is a a square; In three dimensions it is a cube, and so on. More concretely, we use a (independent) uniform distribution along each dimension in the hypercube. For a 1-D hypercube defined in [a, b]; b ≥ a, this becomes a (continuous) univariate uniform distribution on the line: U (x) =      1 b−a for a ≤ x ≤ b, 0 for x < a or x > b (4.1)

In essence, this prior uniformly spreads mass over a [a, b], such that each point in this bounded region has equal mass. In one dimension, this amounts to a uniform distribution over a line segment from a to b, with all points in between a and b (inclusive) having equal mass. For higher dimensions, we can construct a prior by sampling each dimension independently. In this work, we consider hypercubes formed by all sign permutations of the coordinates (±1, ±1, · · · ± 1): For a 1-D Hypercube, this forms a line connecting the coordinates [−1, +1]; For a 2-D hypercube, this forms a square with co-ordinates {(−1, +1), (+1, +1), (+1, −1), (−1, −1)} and so on. In this work, we denote such a prior as Un_{, which forms a n-cube, with mass spread uniformly inside the}

Hypercube. Figure 4.1 contains samples from U1_{, U}2 _{and U}3_.

Constraining To ensure that the output of the encoder lies inside this hypercube, we constrain the output of the encoder using a tanh. This ensures that all points are constrained in [−1, 1] along all dimensions.

(25)

(a) A histogram of 10000 sam-ples from a 1-Uniform i.e a Line

(b) Samples from a 2D filled hy-percube i.e a Square

(c) Samples from a 3D filled hy-percube

Figure 4.1: Samples from 1D, 2D and 3D filled Hypercubes

Distance The standard Euclidean distance is typically used as the ground cost (Tolstikhin et al., 2017; Patrini et al., 2019): c(zi, zj) = v u u t D X d=1 (zi,d− zj,d)2 (4.2)

Note that for better numerical stability, we use the squared euclidean instead

c(zi, zj) = D X d=1 (zi,d− zj,d)2 (4.3) 4.1.2 Hypersphere

A n-Sphere or a Hypersphere is a n-Dimensional manifold embedded in a (n + 1)-dimensional euclidean space. It can be defined as8_:

Sn

= {z ∈ Rn+1: ||z|| = 1} (4.4)

Similar to the Hyperspherical VAE (Davidson et al., 2018) and the SAE (Patrini et al., 2019), we assume a Uniform distribution on the surface of a hypersphere as a prior. Figure 4.2 contains samples from S1 _{and S}2

Constraining To constrain the output of the encoder to lie on the hypersphere, we simply divide the output of the encoder by its norm. If ˜z is the output of the encoder, the correspoding point on the Hypersphere, z, can be obtained using the following formula:

z = z˜

||˜z|| (4.5)

Distance There are several ways to compute the distance in a hypersphere. The simplest way is to compute the aforementioned squared Euclidean distance (4.3). In an S1, we can also compute

8_{Generally speaking, the radius is a positive real number. We only consider unit hyperspheres as a prior in this}

(26)

the arc-length, which is the minimum distance between two points on the circle, and lies in [0, π]. This is simple to compute since it is equal to the angle between two points (in radians).

(a) Samples from unit-S1 (b) Samples from unit-S2

Figure 4.2: Samples from Sn

4.1.3 Simplex

A simplex is a generalization of a triangle / tetrahedron to higher dimensions. A n-Simplex is embedded in n + 1 dimensions. We consider only the probability simplex, wherein all the points that reside in the simplex have values that sum up to one - formally, the n-Simplex can be defined as: ∆n_{= {z ∈ R}n+1: n+1 X i=1 zi= 1 and ∀zi≥ 0} (4.6)

We can have different distributions on this Simplex. Following Patrini et al. (2019), we can use the Dirichlet distribution, which is a continuous distribution parameterized by α ∈ Rn+1 (for a n-Simplex), called the concentration parameter. Figure 4.3 illustrates the effect of choosing different concentrations. Dir(z|α) = 1 B(α) n+1 Y i=1 zαi−1 i (4.7) (a) α = [1, 1, 1] (b) α = [0.1, 0.1, 0.1] (c) α = [0.01, 0.01, 0.01]

(27)

(a) τ = 1 (b) τ = 0.1 (c) τ = 0.001

Figure 4.4: Samples from the ∆2 _{(Gumbel-Softmax) distributions with different concentration τ .}

As τ → 0, the distribution approaches a Categorical distribution

where B(α), a normalizing constant, is the multivariate beta function.

In addition to the Dirichlet distribution, we consider the Gumbel-Softmax distribution (Jang et al., 2016). This involves the addition of noise from the Gumbel distribution, a continuous distribution on the simplex that can approximate categorical samples. This is parametrized by a temperature parameter τ . As τ approaches 0, samples from the Gumbel-Softmax distribution become ‘one-hot’ and the Gumbel-Softmax distribution becomes identical to the categorical dis-tribution. This is illustrated in Figure 4.4.

Distance We use the squared Euclidean distance (4.3)

Constraining For the Dirichlet, the output of the encoder ˜z is fed through a softmax to ensure that it lies on the Simplex:

zi=

exp ˜zi

P

jexp ˜zj

∀i ∈ 1, . . . , n + 1 (4.8)

For the Gumbel-Softmax distribution, the following function is used:

zi=

exp((˜zi+ gi)/τ )

P

jexp((˜zj+ gj)/τ )

∀i ∈ 1, . . . , n + 1 (4.9)

where {gi}n+1i=1 are i.i.d samples drawn from Gumbel(0, 1).

4.1.4 Stochasticity

We additionally experimented with adding noise to the output of the encoder. This was done to add regularization to the model, and to obtain a ‘smooth’ latent space (where points close in the data space are mapped to points close in the latent space) (An, 1996; Sietsma and Dow, 1991; Ghosh et al., 2019). The addition of noise is treated like an hyperparameter. Care has to be taken to ensure that the ‘noisy’ z still lies on the geometry. We add noise per geometry:

Hypersphere For this geometry, we add standard noise N (0, 1e−4) after they are normalized (see (4.5)), and the resulting points are normalized again so they lie on the hypersphere.

(28)

4.1.5 Factorial Prior

To learn a representation for a non-trivial dataset, we have to consider an encoder which encodes into a geometry which is a combination of independent factors. For instance, we can have a S1×U1

which is a prior with two factors, a S1_{and a U}1_{. Sampling from this prior is straightforward, we}

just sample each factor independently and concatenate these to form a single sample. We use the following notation for a factorial prior with K factors:

P_Zfac=

K

Y

k=1

P_Zk (4.10)

Each prior has a subspace of Z associated with it. Take for instance, Pfac

Z = U1× S1, which

is in R3_{. Samples from this prior have the first dimension composed of samples from U}1 _{and the}

second and third dimensions are composed of samples from S1

Employing a factorial prior to disentangle is common in the field. For instance the β-VAE forces statistically independent latent factors via appropriately tuning the β factor for instance (Higgins et al., 2017a). The FactorVAE (Kim and Mnih, 2018) similarly forces the (aggregated) posterior to be factorized. Other approaches are also similar, see Section 3.

4.1.6 Suitability of priors

Part of our hypothesis involves investigating whether some geometries are better able to capture certain ground truth latent variables. For instance, a natural choice for encoding categorical data is to use a distribution over the Simplex. The nature of this distribution is entirely dependent on the nature of the data itself. For instance, if there is a natural transition between two objects, then a Dirichlet distribution which has samples in between two vertices could be a good choice -since now the encoder can map those intermediate objects to that space.

Similarly, the nature of the data itself can point to a choice. The dSprites dataset (see Section 4.4), has images with varying position X. Since the sprite in all images doesn’t ‘wrap’ around the edge, there might be regions in the manifold where the encoder fails to map points, if we use a geometry with a rotational symmetry, such as Sn.

4.2 The Wasserstein and Sinkhorn Autoencoder

The Sinkhorn Autoencoder (SAE) and the Wasserstein Autoencoder (WAE) was introduced briefly in Section 2.5. The key advantage of the SAE and the (deterministic) WAE is that we can use arbitrary priors, with little to no change in the algorithms to estimate the divergences when we change priors. In this section, we will detail how exactly these quantities are computed and used in the optimization procedure.

Both algorithms deal with samples i.e we can take M samples from QZ and PZ to get the

empirical distributions, concentrated on M points: ˜PZ= M1

PM m=1δzm and ˜QZ = 1 M PM m=1δz˜m

WAE The WAE Tolstikhin et al. (2017), which was discussed in Section 2.5, define two costs on the latent space, a GAN-based DZ, and a MMD-based DZ. The MMD-based computation

is detailed in Algorithm 1, and the GAN-based DZ is detailed in Algorithm 2. Note that both

(29)

Algorithm 1: WAE-MMD (deterministic) Input: β, kernel k, cost in the data space c

Initialize The parameters of the encoder Q and the decoder G while not converged

Sample x1, . . . xm∈ X

Sample z1, . . . zm from the prior

˜

zi← Q(Z|xi) ∀i ∈ {1, . . . M }

Update the encoder / decoder by descending:

1 M PM i=1c(xi, G( ˜zi)) +_{M (M −1)}β Pl6=jk(zl, zj) +_{M (M −1)}β Pl6=jk(˜zl, ˜zj) − _M2β2 P l,jk(zl, ˜zj)

Algorithm 2: WAE-GAN (deterministic) Input: β, kernel k, cost in the data space c

Initialize The parameters of the encoder Q , the decoder G and the latent discriminator D while not converged

Sample x1, . . . xm∈ X

Sample z1, . . . zm from the prior

˜

zi← Q(Z|xi) ∀i ∈ {1, . . . M }

Update the parameters of the discriminator by ascending:

β M

PM

m=1log D(zi) + log(1 − D(˜zi))

Update the parameters of encoder / decoder by descending:

1

Mc(xi, G( ˜zi)) − β log D( ˜zi)

distribution, from which we sample from. We forgo this since we want to retain the ability to use arbitrary priors.

SAE The SAE was introduced in Section 2.5. Here, we discuss the Sinkhorn algorithm, which is the cost we use in the latent space to ‘match’ the aggregated posterior and the prior. Consider the empirical distributions ˜QZ and ˜PZ. The entropy regularized OT-cost for these distributions,

with ≥ 0 is given by the matrix:

R∗:= arg min

R∈SM

1

MhR, ˜CiF− · H(R) (4.11)

where ˜Cij= ˜c(˜zi, zj) is the matrix associated with the cost ˜c in the latent space, R is a doubly

stochastic matrix in SM = {R ∈ RM ×M_≥0 |R1 = 1, RT1 = 1} (i.e each row and column sum up to

one) and h·, ·iF is the Frobenius inner product and H(R) = −P M

i,j=1Ri,jlog Ri,j is the entropy of

R. This algorithm runs in nearly O(M2_{) time (Altschuler et al., 2017). For better differentiability,}

we deviate from (2.27) and use the unbiased sharp Sinkhorn loss (Luise et al., 2018; Genevay et al., 2017) by dropping the entropy terms (only) in the evaluations:

S˜c,:= 1 MhR ∗_{, ˜}_Ci F− 1 2M(hR ∗ ˆ QZ, ˜CQˆZiF + hR ∗ ˆ PZ, ˜CPˆZiF) (4.12) where R∗_ˆ

QZ, ˜CQˆZ are computed on ( ˆQZ, ˆQZ) and R

∗ ˆ

PZ are computed on ( ˆPZ, ˆPZ) as inputs to

Algorithm 3. This algorithm is a fixed point algorithm which computes a cost using a inner loop which is unrolled during backpropogation.

(30)

Algorithm 3: Sinkhorn

Input: {z1, . . . zM} ∼ ˜PZ, {˜z1, . . . ˜zM} ∼ ˜QZ, , L, cost matrix ˜C ∈ RM ×M_≥0

K = exp(− ˜C/) Initialize: u ← 1

repeat until convergence, but at most L times v ← 1/(KT_u)

u ← 1/(Kv)

R∗← Diag(u)K Diag(v) Result R∗

Sinkhorn Cost for Factorial Priors Since we’re using factorial priors (4.10), there are multiple ways we can compute the cost on the latent space. Consider the factorial prior (4.10) P_Zfac, which has K factors. There are various methods to compute the latent cost

Euclidean distance on samples The simplest method is to compute the (squared) euclidean distance (4.3) on samples from ˜PZ and ˜QZ without considering the ground costs of the factors.

Computing a single cost matrix This method involves first computing a cost matrix per factor, by considering only the subspace associated with the given prior and the corresponding ground cost. These cost matrices are then reduced to a single cost matrix using an aggregation operation, and then used in Algorithm 3. The algorithm itself is outlined in Algorithm 4. We consider the following aggregation operations:

• max: ˜Ci,j= max( ˜Ci,j1 , . . . Ci,jK)

• sum: ˜Ci,j=P K k=1C˜

k i,j

• abs-sum: ˜Ci,j =PK_k=1| ˜Ci,jk |

• l2: ˜Ci,j= q PK k=1( ˜C k i,j)2

For instance, if we have P_Zfac= U1× S1_{, the process to compute the matrix is as follows:}

1. Compute ˜C1

i,j = ˜c1(zi1, ˜zi1), which results in the matrix ˜C1, computed on samples from the

first prior U1_{, using the corresponding ground cost ˜}_c

1, where zi1 is the subspace of PZfac

associated with the U1_{(and ˜}_z1

i is the corresponding subspace of QZ).

2. Compute ˜C2

i,j = ˜c2(zi2, ˜zi2), which results in the matrix ˜C2, computed on samples from the

first prior S1, using the corresponding ground cost ˜c2 (ex. arc length), where zi2 is the

subspace of Pfac

Z associated with the S1 (and ˜zi2 is the corresponding subspace of QZ).

3. ∀i, j : ˜Ci,j← agg( ˜Ci,j1 , ˜Ci,j2 ). The resulting matrix ˜C is used in Algorithm 3

Component-Wise Costs This method involves first computing the latent costs on the factors separately and then combining them into a single cost. This is outlined in Algorithm 5:

Similar to the previous aggregation methods, we experiment with multiple aggregation functions i.e max, l2, sum. This method is outlined in Algorithm 5

Geometric Priors for Disentangling Representations

MSc Artificial Intelligence

Master Thesis