The Shape of a Black Box: A closer Look at structured Latent Spaces

(1)

Master Thesis

The Shape of a Black Box:

A Closer Look at Structured Latent Spaces

by

Tim R. Davidson

11432721

August 2019

36 ECTS September 2018 - August 2019 Supervisor: Jakub M. Tomczak, PhD. Assessors: Prof. Efstratios Gavves Prof. Max Welling

(2)

(3)

They say it takes a village to raise a child - similarly it takes a university to educate a student. During my time at the University of Amsterdam I had the great fortune to meet, collaborate with, and learn from some of the brightest minds I’ve ever met: ranging from classmates to PhD students, and from postdocs to professors, not a day went by without picking up something to stimulate new ideas or gently disregard old ones.

First I’d like to thank my various collaborators of the last two years: Luca Falorsi, Nicola de Cao, Jakub Tomczak, Pim de Haan, Taco Cohen, Patrick Forré, Thomas Kipf, and Maurice Weiler. I am proud of the research we did together, and thankful for the opportunity to share our work with the Machine Learning community around the world.

While there were numerous fellow students that contributed to making my time at the UvA extraordinary, some left a truly lasting impression. Luca Falorsi and Nicola De Cao, thank you for the special research collaboration we shared: you impressed me daily with your clever insights, humble attitude, and top-notch engineering. In fact, this thesis would most definitely not have shaped up the way it did without your ongoing support. Jonas Köhler and Pim de Haan, truly two of the brightest and most talented folks one can hope to call friends. Both excel in a way that few can follow, push you to be better, and challenge you to keep up.

Then my thesis advisors, Jakub Tomczak and Stratis Gavves, who stuck with me across borders, multiple topic changes, and a variety of other ups and downs. Thank you both for your continued support and structured feedback.

The last year was marked by many early mornings and long nights. It was a period not easy for me, or the person closest to me. Thank you my incredible girlfriend, for sticking with me nonetheless, for cheering me up when I was down, and pushing me forward when I felt stuck.

Finally I would like to thank my parents. Looking back, I often have the impression that my life is a story of inches; a series of seemingly small pushes at the right time that lead to outsized outcomes, drastically changing the title of the next chapter. Without you the chapter ’Graduate School, Artificial Intelligence’ would likely never have been started - thank you for pushing me in the right direction.

(4)

The central foundation underlying many advances in machine learning modeling is that of representation: what form should ob-jects and concepts be represented in to best capture their relative relationships. In the case of a narrow comparison, for example between simple objects of the same ‘class’, one can often attempt to analytically identify defining axes of comparison. This approach however becomes increasingly less feasible as the complexity of objects and concepts grows, necessitating more powerful tools to recognize obfuscated relations. An ideal solution model therefore would learn to automatically find a suitable representation when human intuition falls short.

We examine this problem through the framework of the Vari-ational Auto-Encoder(VAE) (Kingma and Welling, 2013; Rezende

et al., 2014), an influential unsupervised machine learning model that successfully combines the ideas of variational inference (VI) (Hinton and Van Camp, 1993; Jordan et al., 1999) and auto-encoding (AE) (Rumelhart et al., 1985; Ballard, 1987), to learn a generative model of observed data points. Building on the mani-fold hypothesis, a VAE generally strives to learn the probability distribution that best describes the generative process concen-trated in the vicinity of some low dimensional manifold. Since its inception, various extensions and critiques have been proposed predominantly focused on the case of ‘flat’, Euclidean space.

Instead, our recent work explores the plethora of opportunities presented by bending our horizon to utilize curved space, and the challenges that come with it. We discuss the conflicts that can arise by restricting a VAE to flat space, most notably the default Gaussian distribution, and show how a hyperspherical perspective can alleviate some of these (Davidson et al., 2018). A general theory is presented that extends the class of available spaces with non-trivial topologies to all Lie groups (Falorsi et al., 2019), carefully analyzing the specific instantiation of SO(3) and the possible homeomorphism considerations that arise (Falorsi et al., 2018). Finally, various novel extensions to the S-VAE (Davidson et al., 2018) are proposed, designed to counter its

known limitations in higher dimensions. ii

(5)

Abstract i

Contents iii

List of Figures vii

List of Tables xi

1 Introduction 1

1.1 What Is A Representation? . . . 1

1.2 The Abstract, The Real, and The Bridge . . . 2

1.3 Thesis Hypothesis . . . 5

1.4 Thesis Structure . . . 6

2 VAE: Breaking Down the Parts 7 2.1 Building A Variational Auto-Encoder . . . 7

2.1.1 Auto-Encoders . . . 7

2.1.2 Variational Inference . . . 9

2.1.3 Variational Auto-Encoders . . . 10

2.2 Loss objective . . . 10

2.2.1 Critiques on KL divergence . . . 11

2.2.2 Disentanglement and β-VAE . . . 11

2.3 Estimating Gradients . . . 14

2.3.1 Score Function . . . 15

2.3.2 Reparameterization Trick and Extensions in Flat Space 16 2.3.3 Reparameterizing Distributions on Lie Groups . . . 17

3 Manifold Considerations 20 3.1 Manifold Matching . . . 20

3.2 Issues with The Gaussian Distribution . . . 22

3.3 Changing the Variational Family Q . . . 24 iii

(6)

3.3.1 Example 1: Von Mises-Fisher Distribution . . . 26

3.3.2 Example 2: Group of Oriented 3D Rotations SO(3) . . 31

3.3.3 Additional Recent Examples Of Non-Trivial Manifold Reparameterizations . . . 33

4 Removing the S-Bottleneck 34 4.1 Background: Learning on the Hypersphere . . . 34

4.1.1 Understanding the Hyperspherical Surface . . . 35

4.1.2 The Von Mises-Fisher Distribution . . . 36

4.2 Models Extending the S1-VAE . . . 37

4.2.1 Global Radius Transformation . . . 37

4.2.2 Stochastic Onion: Point-Wise Radius Scaling . . . 38

4.2.3 Matryoshka: Dimensionality Decomposition . . . 41

5 Experiments 44 5.1 Experimental Setup . . . 44

5.2 Global Radius Scaling to Improve Numerical Stability . . . 45

5.3 Gaussian Concentration . . . 46

5.4 Stochastic Onion . . . 47

5.4.1 Synthetic Onion Data . . . 48

5.4.2 Radial Disentanglement . . . 50

5.4.3 Link Prediction on Graphs . . . 52

5.5 Matryoshka . . . 55

5.5.1 Concentration: Fix Degrees of Freedom . . . 57

5.5.2 Concentration: Fixed Ambient Space . . . 60

6 Conclusion and Outlook 63 6.1 Conclusions . . . 63

6.1.1 Contributions . . . 64

6.1.2 Future work . . . 64

Appendices 66 A Preliminaries . . . 66

A.1 Kullback-Leibler (KL) Decomposition . . . 66

A.2 Concrete / Gumbel-Softmax . . . 66

A.3 Product Distributions . . . 67

A.4 Mixed-Pair Entropies . . . 67

A.5 Full Covariance VAE: Cholesky Decomposition and Reparameterization . . . 67

B Bessel Fraction Approximations . . . 69

(7)

B.2 (Oh et al., 2019a) . . . 70

C Radius as Global Parameter . . . 70

D Radius as Point-Wise Parameter . . . 71

E Matryoshka: Dimensionality Decomposition . . . 73

F Experiments: Supplementary Figures . . . 74

F.1 Stochastic Onion . . . 74

F.2 Matryoshka . . . 77

G Experiments: Supplementary Tables . . . 80

G.1 Radius Scaling . . . 80

G.2 Stochastic Onion . . . 85

G.3 Matryoshka . . . 91

Bibliography 97

(8)

(9)

2.1.1 The two main frameworks combined to create the Variational Auto-Encoder: the Auto-Encoder and Variational Inference. In (a) x, and ˆx represent the original data input and

reconstruc-tion respectively, c the compressed or encoded representareconstruc-tion; (b) represents the variational inference optimization objective, where the large circle Q indicates the variational family search space, and q∗ _{the closest member to the ‘true’ distribution p. .} ₈

2.3.1 Illustration of reparameterization trick of a Lie group v. the classic reparameterization trick. . . 17 2.3.2 Illustration of a LI-Flow . . . 18 2.3.3 Samples of the Variational Inference model and Markov Chain

Monte Carlo of the VI experiments. Outputs are shifted in the z-dimension for clarity. . . 19 3.0.1 Visualization of a circle lying in curved space. Notice how

prop-erties like the circumference C, change based on the manifold structure. (Reproduced from (Carroll and Ostlie, 2017)) . . . 21 3.2.1 Graphical illustration of the ‘Soap Bubble’ effect of the

Gaus-sian distribution in high dimensions. Plotted the probability density with respect to radius r for various values of the di-mensionality D. (Reproduced from (Bishop, 2006)) . . . 23 3.3.1 Plots of the original latent space (a) and learned latent space

representations in different settings, where β is a re-scaling factor for weighting the KL divergence. (Best viewed in color) 28 3.3.2 Latent space visualization of the 10 MNIST digits in 2

dimen-sions of both N -VAE (left) and S-VAE (right). (Best viewed in color) . . . 29 3.3.3 Three interpolations of two VAE models. Discontinuities in

the reconstructions of the Normal model (top) are highlighted by a dashed line. . . 31

(10)

4.1.1 The log(x + 1) surface area of a hypersphere with varying radius. 35 4.1.2 Samples of von Mises-Fisher distributions with the same

di-rectional parameter µ, but different values for concentration parameter κ. . . 36 4.2.1 Two stochastic radius hyperspherical latent spaces, with µ, κ

representing the mean direction and concentration parameters of a vMF. On the left (a) a continuous radial distribution, potentially using the entire volume of a hypersphere, on the right (b) a discrete radial distribution partitioning the latent space in k distinctly separate rings. . . 38 4.2.2 Graphical models of a standard VAE, Joint-VAE,

Cascade-VAE, and our Onion-Cascade-VAE, where x, z, z0_{, d} _{denotes the data,}

continuous latent code, intermediate continuous latent code, and the discrete latent code respectively. Solid lines denote the generative process and the dashed lines denote the inference process. (Extended from Jeong and Song (2019)) . . . 40 4.2.3 Structural interpolation of S9 ⊂ R10, where each corner

rep-resents a separate dimension such that the dimensionality of the cross-product of (b), (c) can be smoothly embedded in R10. Each separate shape is equipped with an independent concentration parameter κ. . . 41 4.2.4 Graphical model of Mtr-VAE, where x denotes the data and

z0, zj, zk the separate continuous latent variables. Solid lines

denote the generative process and the dashed lines denote the inference process . . . 43 5.4.1 Plots of the original latent space (a) and learned latent space

representations in different settings. (Best viewed in color) . . 49 5.4.2 Plots of the original latent space (a) and learned Onion-VAE

latent space representations in different settings. The β index indicates the parameter not set to 0. (Best viewed in color) . . 51 5.4.3 Latent embedding of MNIST data using high β3 pressure. . . . 52

5.4.4 Radial interpolation of samples of Onion-VAE models with high β3, generated by sampling points on the unit-hypersphere

and scaling them by each model radii. From top to bottom we subsequently show samples of a model trained with 2, 5, and 10 radii. . . 53

(11)

5.4.5 Radial interpolation of samples of Onion-VAE models with pressure on β2,3,4, generated by sampling points on the

unit-hypersphere and scaling them by all model radii. From top to bottom we subsequently show samples of a model trained with 2, 5, and 10 radii. . . 53 5.4.6 Latent embedding of MNIST data using low β2,3,4 pressure. . . 54

5.4.7 Latent embedding of MNIST data using higher β2,3,4 pressure. 54

5.4.8 Evolution of q(r) during training for VGAE models with dif-ferent number of radii. . . 56 5.5.1 Value of the von Mises-Fisher Kullback-Leibler divergence

varying the concentration parameter κ on the y-axis, and the dimensionality m on the x-axis. (Best viewed in color) . . . . 60 F.1 Square decoded sheet of MNIST, ranging from the origin to

the maximum model radius. . . 74 F.2 Square decoded sheet of Fashion MNIST, ranging from the

origin to the maximum model radius. . . 75 F.3 Histogram of node connectivity for the Cora and Citeseer

citation network graphs, where the maximum bucket is capped at 30 edges for visibility. In reality, the max connected node in Cora reaches 168 edges, and the max connected node in Citeseer 99. . . 76 F.4 Evolution of q(r) during training for VGAE model with 20

shells. While it appears that the higher radii are less favored here, with some not receiving any nodes as the model conver-gence, closer inspection did not show any of the nodes with higher connectivity favoring the higher radii. . . 76 F.5 Reconstructions and Samples of S5 degree interpolation of

product-spaces. For each product-space, left are reconstruc-tions and right samples. . . 77 F.6 Reconstructions and Samples of S9 degree interpolation of

product-spaces. For each product-space, left are reconstruc-tions and right samples. . . 78 F.7 S1 interpolations of broken up S9. On top an example of an

‘ignored’ sub-space, leading to little to no semantic change when decoded. Bottom a semantically meaningful sub-space that consistently changes the stroke thickness. . . 79

(12)

(13)

3.1 Summary of results (mean and standard-deviation over 10 runs) of unsupervised model on MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. Best results are highlighted only if they passed a student t-test with p < 0.01. . . 30 5.1 Summary of negative log likelihood results of Gaussian VAE on

Static MNIST and Fashion MNIST, where σ, diag(σi), Σij,

rep-resent a scalar, diagonal, and full covariance matrix respectively. . . . 46 5.2 Dataset statistics for citation network datasets. . . 55 5.3 Summary of results of unsupervised S-VAE model on Static

MNIST, Fashion MNIST, Caltech, and Omniglot. LL repre-sents negative log-likelihood, L|q| the ELBO, and m the latent space dimensionality. The (*) results for Fashion-MNIST are due to the model’s inability to reach stable convergence for m= 40. . . 57 5.4 Summary of results (mean and standard-deviation over 3 runs)

of unsupervised Mtr-VAE on Static MNIST with base space S5. RE and KL correspond respectively to the reconstruction

and the KL part of the ELBO. . . 58 5.5 Overview of best results (mean over 3 runs) of various S40

product-space degree dimensionality interpolations compared to best single Sk-VAE (k ≤ 40) indicated (*). Here a indicates

the ambient space dimensionality, κ the number of concen-tration parameters, i.e. breaks, and [Sk] the product-space

composition. . . 59 5.6 Summary of interpolation results of Mtr-VAE on Static MNIST. 61 5.7 Summary of interpolation results of Mtr-VAE on Static MNIST. 61

(14)

5.8 Overview of best results (mean over 3 runs) of various S40

product-space ambient dimensionality interpolations compared to best single Sk-VAE (k ≤ 40) indicated (*). . . 62

1 Summary of results of unsupervised Sm(r)-VAE on Static

MNIST. RE and KL correspond to the reconstruction and the KL part of the ELBO. . . 80

2 Summary of results of unsupervised Sm(r)-VAE on Fashion

MNIST. RE and KL correspond respectively to the reconstruc-tion and the KL part of the ELBO. (*) For m = 40 results on this dataset proved too unstable. . . 81 3 Summary of results of unsupervised Sm(r)-VAE on Caltech.

RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 82 4 Summary of results of unsupervised Sm(r)-VAE on Omniglot.

RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 83

5 Summary of results (mean over 15 runs) of Onion VGAE on

Cora. . . 85

6 Summary of results (mean over 15 runs) of Onion VGAE on

Citeseer. . . 86 7 Summary of best results (mean over 10 runs) of Onion VGAE

on Citeseer, using normalized innerproduct-decoder. . . 87 8 Summary of best results (mean over 10 runs) of Onion VGAE

on Cora, using normalized innerproduct-decoder. . . 87 9 Summary of best results (mean over 10 runs) of Stochastic

Onion VGAE on Citeseer, using unnormalized innerproduct-decoder. . . 88 10 Summary of best results (mean over 10 runs) of

Stochas-tic Onion VGAE on Cora, using unnormalized innerproduct-decoder. . . 88 11 Summary of results of unsupervised Stochastic Onion model

on Static MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 89 12 Summary of results of unsupervised Stochastic Onion model

on Static MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 90 13 Summary of results of unsupervised model on MNIST. RE and

KL correspond respectively to the reconstruction and the KL part of the ELBO. Mean and standard deviation are reported over 3 runs. . . 91

(15)

14 Summary of results (mean over 3 runs) of S40 degree

interpo-lations for unsupervised Mtr-VAE on Static MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 92 15 Summary of results of S40 degree interpolations for

unsuper-vised Mtr-VAE on Caltech. . . 92 16 Summary of results of S40 degree interpolations for

unsuper-vised Mtr-VAE on Fashion MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. 93 17 Summary of results of S40 degree interpolations for

unsuper-vised Mtr-VAE on Omniglot. . . 93 18 Summary of results of S40 ambient interpolations for

unsu-pervised Mtr-VAE on Static MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. 94 19 Summary of results of S40 ambient interpolations for

unsuper-vised Mtr-VAE on Fashion MNIST. . . 94 20 Summary of results of S40 ambient interpolations for

unsuper-vised Mtr-VAE on Caltech. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO. . . 95 21 Summary of results of S40 ambient interpolations for

(16)

(17)

"[Our world] as we experience it, is an illusion, a collection of mere appearances like reflections in a mirror or shadows on a wall"

Plato

In chess, every player is assigned an ‘ELO’ score, named after its inventor Arpad Elo1 _{to estimate a player’s skill level. As ‘skill level’ is not directly}

measurable, an approximate representation using a player’s historical relative performance compared to other players is used instead. This is done through a latent variable assumption for a player’s winning chances, which value is constantly reassessed through realized wins, losses, and draws. While in the case of a narrow comparison, for example between simple objects of the same ‘class’ e.g. chess players, one can often attempt to analytically identify suitable defining axes of comparison, this approach becomes increasingly less feasible as the complexity of objects and concepts grows, necessitating more powerful tools to recognize obfuscated relations. An ideal solution model therefore would learn to automatically find a suitable representation when human intuition falls short. Attempting to define the best approach to defining and designing such descriptions, or representations, has long been one of the central questions in the core disciplines of Philosophy, Mathematics, and Physics. Therefore, in order to guide our study into creating automated representation learning models, we will begin with a short exploration of views from these respective fields.

1.1 What Is A Representation?

Plato’s Allegory of the Cave, closely related to his Allegory of the Divided Line, is one of the first and most well-known works to explicitly discuss the

1

https://en.wikipedia.org/wiki/Elo_rating_system

(18)

essence of objects and their representation. In it, subjects are chained to a wall in a cave, restricted to only observe shadows projected on the wall in front of them cast by objects being held in front of a fire behind them. As one of the subjects escapes, and is lead out of the cave into the sun, he comes to realize his flawed perception of reality. The story is used as an example to explain Plato’s theory of Forms: stating that the physical objects we observe around us are mere ‘shadows’, or ‘incomplete representations’ of the true Form underlying them. This essential problem of separating essence from particulars has many different interpretations with different consequences. For example, where Plato assumed Forms to exist independently from their representations in a separate dimension, Aristotle in Metaphysics instead argues that Forms’ existence depends on the particular instantiations that represent them. In the latter view, observed representations are directly tied to the essence of an object and thus ascribe a more important role to empirical analysis. More generally, discussions of this type can be categorized under the denominator Epistemology, the branch of philosophy concerned with the theory of Knowledge. As we are interested in investigating the meaning and means of acquiring ‘good’ representations, the large body of work surrounding Acquaintance theory, a subset of epistemology appears especially relevant. Here a fundamental distinction is made between knowledge by acquaintance, and knowledge by description, i.e. when building a view on the essence of a concept or object, what types of experiences can we hold to reveal true knowledge about the concept or object in question? The first is thought to somehow represent an indisputable acquisition of knowledge by direct interaction with the object, whereas the second requires an indirect description, e.g. the difference between seeing a color and being described a color by a third party2. How exactly to define the difference between the two,

or if there even is a difference is the subject of heated debate.

1.2 The Abstract, The Real, and The Bridge

While Philosophy is an important starting point to shape our framework of thinking about representations, what rapidly becomes clear is that an honest study of object representations quickly dissolves into a delicate discussion on the nature of knowledge from which practical inference appears unattainable. Thus, to avoid an impasse and facilitate a fruitful exploration, we believe it necessary to pose the following assumptions3:

2_{We recommend (Russell, 1905) for a detailed discussion.}

3_{In fact, the stated assumption appear closely linked to the ideas of Bundle Theory}

(19)

1. There exist multiple objects, each with properties. 2. Every property can be observed.4

3. To observe means to be able to measure.5

4. An object is defined to be the combination of its observable properties. We urge the reader to critically examine the stated assumptions, and recognize the ambiguous meaning of the used terminology, which in and by itself already heavily depends on a supposed internalized idea about the words ‘object’ and ‘property.’

Assumption (2) is especially fragile, as it purposely avoids the question on the identity of the observing agent, e.g. while most human agents are able to quickly distinguish a red object from a blue one, for the blind agent this property is hidden leading to the logical conclusion that the property ’color’ does not exist. Yet, it is easy to imagine a world in which this property is observable, e.g. by the first group of agents - allowing for the possibility of an indirect observance for the blind agents, and thus maintaining a consistent world. However, following this line of argumentation, we must acknowledge the potential existence of numerous (infinite) ‘potentially’ observable properties, that are as of yet unknown to us leading to a dangerously slippery slope.

Similarly, assumption (3) and its footnote caveats directly implies a prob-abilistic interpretation of observation, i.e. our created definition of an object is based on the outcomes of a random process, only allowing for weak or strong inductive arguments. Combined with assumption (4) this leads to ever changing weak, or strong definitions of the same object.

However, having defined a world in which objects can be defined through their properties, we have a minimal basis to attempt to differentiate one object from another. This logically results in a comparison based on the similarity or difference between certain properties, based on the currently best available measurements. Now, taking this set of assumptions as given, there are various ways in which one can attempt to make sense of reality (The Real). If we hold true that objects are defined by observations of their properties, but that these observations indeed can be wrong or imprecise, it becomes useful to argue about the abstract nature and consequences of hypothetical properties instead (The Abstract). Furthermore, if we follow

4_{We can become aware of them either (i) directly, e.g. measuring height with a tapeline,}

or (ii) indirectly, e.g. presenting chess skill by the ELO score. If a property can neither be observed directly nor indirectly, we must treat it as not existing.

5_{Yet there: (i) can be multiple ways of measuring the same property, and (ii)}

(20)

assumption (4), it would appear only natural to focus attention on the effort of relating different properties to each other, i.e. to create maps to reach a quantifiable notion of similarity.

Mathematics The approach of operating exclusively in the Abstract, is most closely favored by Mathematics. Instead of concerning oneself with finding a representation, i.e. a combination of properties for a given object in the Real, one generally is more interested in analyzing the properties of a given representation in a vacuum. As formulated by Robert Penrose (Penrose

and Mermin, 1990)

Mathematical truth is absolute, external and eternal, and not based on man-made criteria ... mathematical objects have a timeless existence of their own.

As such, we could picture the Abstract as a giant map, consisting of many disconnected islands on which ever more complex provably true theories are constructed starting from basic property notions. Simultaneously, a large amount of effort is conducted in connecting these islands, i.e. to try and make the map continuous by relating different properties to each other.

Physics While following a similar justification initially, i.e. accepting the idea of imperfect measurements of object properties, the field of Physics appears to take the exact opposite route; instead of remaining in the Abstract, physicists actively seek to build bridges to relate the observed properties in the physical Real, with islands of known concepts and consequences in the Abstract. They investigate imperfect observations of properties predominantly to try to predict and explain the physical world around us. Indeed, with this goal in mind the notion of temporarily holding imperfect, possibly completely wrong hypotheses relating different property maps in the Abstract to build a representation of an object in the Real is to be deemed acceptable. One actively searches for representations of objects, and following assumption (3) iteratively updates beliefs about the ‘best’ representation by combining new findings in the Abstract and new measurements of an object’s properties in the Real as they become available. The idea of a representation is therefore constantly in flux.

Machine Learning So how does Machine Learning fit into this quest

of finding and analysing representations? Like Physics, Machine Learning research exists most naturally in the Real: based on observations, we attempt to relate, differentiate, and even learn how to create new instantiations of

(21)

objects. However, unlike Physics we desire to learn these mappings in an automated fashion. That is, we would like to breakdown the learning strategies so successfully employed for the last thousands of years by human scientists but remove the human-time component from the research loop. Which as discussed before, for a large part can be naively described as relating observed property patterns in the Real, to known absolute patterns in the Abstract. However, the high dimensional space of possible non-linear relations between observed properties, seems too vast to investigate exhaustively - even with the rapid increase of computing power allowing us to explore many more of the known unknowns. A crucial component in this endeavor therefore is that of inductive bias, i.e. in the absence of a ground-truth, how should an agent generalize learned representations tactics to observations of a ‘new’ object? What are the assumptions of truth one can use to best confine the search space to posit a new hypothesis?

Here we find the balancing act of working both on the left and right side of the bridge; we must both strive to inject our learning models with mathematical first principles, while also being comfortable in approximation and hypothesis building in the absence of a complete theory of everything. Especially when it comes to relating objects through the observations of their properties, accepting as given we might not fully understand or even be capable to observe all properties, we must design at all time our learning models to benefit from knowledge of the Abstract. This last part has been the key focus of my research journey to form a better understanding about what automating the creation of ‘good’ representations might look like: utilizing known maps from the Abstract to provide a structured exploration, so that we are aware as best we can of the assumptions that form our inductive biases. While this interpretation of object representation is far from sufficient to construct a general theory of learning representations, it does appear to be necessary to push forward an ever clearer, and more comprehensive automated understanding of reality.

1.3 Thesis Hypothesis

In this work we question one of the most widespread inductive biases in unsupervised machine learning: the use of Euclidean space and the Gaussian distribution in latent variable modeling. While often plausible and mathe-matically convenient, we pose that various data are more naturally described by distributions defined on spaces with non-trivial topologies.

Specifically, we explore the possibilities of using hyperspherical parameter-izations and distributions explicitly defined on Lie Groups as alternatives. For

(22)

each we’ll develop the necessary tools to utilize them in an unsupervised deep learning setting, and perform various experiments designed to examine their potential to replace the default Gaussian assumption in popular use-cases.

We will predominantly investigate this problem of learning structured object representations through the lens of the Variational Auto-Encoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014), a powerful unsupervised generative model that presents a principled framework to perform variational inference in an auto-encoding setting at scale. Carefully examining the various parts and assumptions that make up a VAE, we discuss its power and limitations to learning a generative model of observed data, and analyze the obstacles to using non-Gaussian distributions.

1.4 Thesis Structure

We first start in section 2 with a short description of the main developments that were combined to construct the VAE, and discuss some of the critiques targeting the original loss function, the concept of disentanglement, and the reparameterization trick.

In section 3, we discuss on a high level considerations of working with the mathematical construct of manifolds in deep generative modeling, the consequences of choosing a certain dimensionality or distribution, as well as describe two model extensions that explore these considerations (Davidson et al., 2018; Falorsi et al., 2018).

Finally in section 4 we explore various ways of improving the performance of hyperspherical based VAE models by trying to move beyond the hyper-spherical bottleneck. We empirically evaluate our proposed model extensions both qualitatively and quantitatively on various synthetic and real datasets in section 5, and conclude our study in section 6, where we offer conclusions and an outlook on where we could go from here.

(23)

"Truth is much too complicated to allow anything but

approximations"

John von Neumann

The Variational Auto-Encoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014), is one of the most popular unsupervised machine learning models. It gained its popularity through its ability to efficiently learn a generative model p(x|z)p(z) of (high-dimensional) observed data X at scale, enabling sampling from p(x), quality (low-dimensional) representations Z ∼ q(z|x), as well as approximate likelihood evaluation. In this section we will first discuss two of the most important frameworks that lead to the creation of the VAE: Auto-Encoders and Variational Inference. We will follow this with a careful discussion on the different parts that make up a VAE, and the various critiques and extensions proposed since its initial formulation.

2.1 Building A Variational Auto-Encoder

2.1.1 Auto-Encoders

One of the earliest ideas in modern unsupervised machine learning is that of Auto-Encoding (AE). Although the first usage example is not clearly at-tributed to a single author, some of the earliest mentions occur in (Rumelhart et al., 1985; Ballard, 1987; Hinton and Salakhutdinov, 2006). Based on the manifold hypothesis, which states that many high-dimensional observed data are concentrated near a (much) lower dimensional manifold, the ap-proach (illustrated in Fig. 2.1(a)) is fairly straightforward: starting from a high-dimensional observation, we aim to find the maximum compression, or encoding of the data, that can be reconstructed or decoded back into the original observation with minimal quality loss. In order to scale this idea to

(24)

(a) Standard auto-encoding architecture (b) Optimization objective VI

Figure 2.1.1: The two main frameworks combined to create the Variational Auto-Encoder: the Auto-Encoder and Variational Inference. In (a) x, and

ˆ

x represent the original data input and reconstruction respectively, c the compressed or encoded representation; (b) represents the variational inference optimization objective, where the large circle Q indicates the variational family search space, and q∗ _{the closest member to the ‘true’ distribution p.}

larger datasets an encoder network is used to encode the data, i.e. a single ‘function’ for all data points in a dataset. Similarly a decoder network can be used to decode the encoded representations back into the original observations. The loss function of a basic auto-encoder therefore becomes

L(ψ, φ) = argmin

ψ,φ

l(x, φ(ψ(x))) (2.1)

where ψ, φ represent the encoding and decoding networks respectively, and l(·) some loss function, e.g. the Mean Squared Error (MSE). While this simple form of self-supervision presents a powerful way to compress data at scale, the basic method has two clear limitations: (1) it is not robust to slight variations of the observed data as observations are evaluated on a case by case basis, e.g. rotations or color changes of the same object will likely be regarded as completely different, and (2) we do not learn a generative probabilistic model of our dataset. Attempts to improve on the first came from (Vincent et al., 2008; Bengio et al., 2013b), which extensions introduced noise into the input data, better allowing for the notion that our measurements of observations can be imprecise.

(25)

2.1.2 Variational Inference

A very different approach to uncovering lower dimensional representations comes in the form of Variational Inference (VI) (Hinton and Van Camp, 1993; Jordan et al., 1999)1_{. In VI we employ a latent variable model Z → X to}

explain our observed (high-dimensional) data. Here x is a vector of D observed variables, z ∈ RM denote latent variables, and p

φ(x, z) is a parameterized

model of the joint distribution. The objective is to optimize the log-likelihood of the data, log pφ(x) = logR pφ(x, z)dz. We posit a variational family Q, and

utilize expectation-maximization in an attempt to find the member q∗ _{∈ Q}

closest to the ‘true’ distribution p, where closeness is most often measured through the Kullback-Leibler (KL) divergence (see Fig. 2.1(b))

q∗ = argmin q∈Q KL(q || p) = Z q(x) log(q(x) p(x))dx (2.2)

In order to manage the complexity of this optimization problem, care must be taken in choosing Q flexible enough to closely approximate p, yet narrow enough to not blow up the search space. However, marginalizing over the latent variables in pφ(x, z) is generally intractable. One way of solving this

issue is to maximize the Evidence Lower Bound (ELBO) log Z pφ(x, z)dz = log Z pφ(x, z) q(z) q(z)dz (2.3) = Eq(z)[log(pφ(x, z))] + H(q(z)) + KL(q(z)||p(z|x)) ≥ Eq(z)[log(pφ(x, z))] + H(q(z)) (2.4) = Eq(z)[log pφ(x|z)] − KL(q(z)||p(z)), (2.5)

where in (2.3), q(z) is the approximate posterior distribution belonging to the chosen family Q, and in (2.4) we note that KL ≥ 0. The bound is tight if q(z) = p(z|x), meaning q(z) is optimized to approximate the true posterior. While this approach offers a principled probabilistic interpretation of our data, it too has some important limitations: (1) we are limited in our approximation of p, in how far the closest member of Q is, such that we require an a-priori intuition on the form of p, and (2) classic VI learns a q∗ for each datapoint, making it hard to scale to larger datasets. While the latter limitation can be alleviated using amortization, akin to the encoder network outlined in the auto-encoding setup as explained in the next section, the former is highly problem specific and in theory allows for arbitrarily bad approximations.

(26)

2.1.3 Variational Auto-Encoders

The main contribution of the Variational Auto-Encoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014), is to combine the frameworks of AE and VI, to create a scalable method with a principled probabilistic interpretation to learn data representations. As in VI, we assume a latent variable model of our data, with x, z and pφ(x, z) as before, where pφ is parameterized by a

neural network. We again aim to optimize the ELBO as defined in (2.5), but while following VI theory q(z) should be optimized for every data point x, to make inference more scalable to larger datasets the VAE setting introduces an amortized inference network qψ(z|x; θ) parameterized by a neural network

that outputs a probability distribution for each data point x. The final objective is therefore to maximize

L(φ, ψ) = Eqψ(z|x;θ)[log pφ(x|z)] − KL(qψ(z|x; θ)||p(z)), (2.6)

where qψ(z|x; θ)is the approximate posterior or encoder, pφ(x|z)the generative

model or, decoder, p(z) the prior distribution, θ the variational parameters, and ψ, φ the parameters of the encoder and decoder networks respectively.

We can further efficiently approximate the ELBO by Monte Carlo es-timates, using the reparameterization trick. This is done by expressing a sample of z ∼ qψ(z|x; θ), as z = T (θ, ε, x), where T (·) is a reparameterization

transformation and ε ∼ s(ε) is some noise random variable independent from θ. In subsection 2.3 we will discuss the reparameterization trick in more detail.

2.2 Loss objective

The loss objective is a crucial part for any model, as it largely determines and steers the way in which a model learns. As derived in Section 2.1.3, the VAE objective is a combination of a reconstruction loss and the KL divergence between the approximate posterior and the latent space prior, together representing a lower bound on the data log likelihood (ELBO)

L(φ, ψ) = Eqψ(z|x;θ)[log pφ(x|z)] − KL(qψ(z|x; θ)||p(z))

| {z }

critiqued

(2.7) Although the loss objective L, presents a theoretically sound lower bound, much research has been conducted into possible modifications to improve this objective from an information theoretical perspective, and to encourage a disentangled latent space. We will in turn discuss some of the most important

(27)

2.2.1 Critiques on KL divergence

If we ignore for a moment the KL divergence introduced by the VAE for-mulation, we recover the auto-encoding formulation of equation (2.1). The properties of encoding an observed data input, X, to a compressed represen-tation Z, and subsequently decoding it to some target Y

X → Zψ → Y,φ (2.8)

have been studied extensively through the information theoretic framework of the Information Bottleneck (Tishby et al., 1999; Shamir et al., 2010; Tishby and Zaslavsky, 2015). It separates the learning objective into the mutual information between the observed data and the learned representation I(X; Z), and that of the learned representation and the target I(Z; Y). By minimizing the first, and maximizing the latter, a model can control the minimality of the learned representation Z.

In the context of information theory, this framework was extended in (Hoffman and Johnson, 2016; Alemi et al., 2017, 2018) to explicitly discuss the extra bottleneck considerations in the variational setting, i.e. when we add back the KL term. In (Hoffman and Johnson, 2016) it is shown how the KL term explicitly penalizes the mutual information between the latent representation and the reconstruction target (see equation (2.10) in the next subsection). This same observation is discussed in (Alemi et al., 2017, 2018) in more detail, where the latter presents a broad theoretical framework to interpret learning latent representations in terms of the rate-distortion trade-off The authors carefully measure the interpolation between the auto-encoding formulation of equation (2.1), and the VAE formulation of equation (2.7) in terms of rate-distortion by varying a β coefficient on the KL. They further show that reducing β to be < 1 can alleviate the known problem of powerful decoders ignoring the latent code (Bowman et al., 2016), and advocate to include the KL divergence (or ‘Rate’), when reporting on a model’s best log-likelihood and ELBO to get a better intuition about the trade-offs made by the learned representations.

2.2.2 Disentanglement and β-VAE

The notion of ‘disentanglement’ first popularized in (Bengio, 2013; Bengio et al., 2013a), can roughly be understood as the desire to separate, or dis-entangle underlying factors of variation in the latent space Z, corresponding to semantically meaningful features in the observed space X. A successful disentanglement under such a notion, would leave all semantic features of an observed object unchanged given a change in a single dimension of Z,

(28)

i.e. q(z) = Qiqi(zi), except the feature corresponding to that specific

z-dimension. While conceptually it is not hard to see why such a disentangled latent space would be highly desirable, practically such an objective of latent inter-dimensional independence seems too simplistic, as captured succinctly by Cohen (2013):

We do not believe that independence is ultimately the right way to formalize disentangling. The reason is that many distinct factors of variation (the natural properties we use to describe objects), are in fact correlated. We want to separate factors that can in

principle be varied independently, even if in a particular data set (the totality of experience of an intelligent agent, for instance) these factors are correlated and hence not independent.

Indeed, taking as example an image of a human face - it is hard to picture enlarging the mouth while fixing the cheeks, and vice versa. That is not to say there do not exist instances in which such a separation isn’t conceptually possible. For example, the task of separating ‘style’ and ‘content’ (Tenenbaum and Freeman, 2000; Elgammal and Lee, 2004; Gatys et al., 2016) has proven very successful. Another fruitful line of research comes in separating ‘content’ from group actions performed on this content such as ‘rotations’ (Cohen and Welling, 2014, 2016; Worrall et al., 2017). However for our purposes, the above research directions do not fall into the exploration of learning a generative model of observed data in an unsupervised manner2_{. Yet, even}

for semantic separation tasks only approximately feasible, striving to inject meaningful structure into our latent space appears useful.

A recent work aimed at encouraging a disentangled latent space through a simple modification of the original VAE formulation comes from (Higgins et al., 2017; Burgess et al., 2018), which reformulates equation (2.7) as a constrained optimization problem arriving at the following Lagrangian under the KKT conditions (Kuhn and Tucker, 1951; Karush, 1939)

L(φ, ψ, β) = Eqψ(z|x;θ)[log pφ(x|z)] − β(KL(qψ(z|x; θ)||p(z)) − ), (2.9)

where is the strength of the constraint, and β the regularisation coefficient that constrains the capacity of the latent information channel Z. Further, as , β ≥ 0, can be dropped, leaving just the new hyperparameter β. Higgins et al. (2017); Burgess et al. (2018) claim that by increasing the β coefficient, and thus constraining the capacity of the latent information channel

2_{A first step towards separating content from 3D rotation through ‘manifold-valued’}

(29)

adding implicit independence pressure on the learned posterior, disentangled representations are encouraged.

Various works have build upon the β-VAE formulation, trying to explain and/or improve its disentanglement performance. Two of the most notable works attempt this through the use of Total Correlation (TC) (Watanabe, 1960), a generalization of mutual information for multiple random variables. In (Kim and Mnih, 2018), the KL term of equation (2.7) is broken down following (Hoffman and Johnson, 2016; Makhzani and Frey, 2017)

Ep(x)[KL(q(z|x)||p(z))] = I(x; z) + KL(q(z)||p(z)), (2.10)

where q(z) = Epdata(x)[q(z|x)] is known as the ‘marginal posterior’ or

‘aggre-gated posterior’, which can be obtained by marginalizing over the empirical distribution, pdata. They argue that while increasing the pressure on the

second term on the right hand side pushes q(z) towards p(z), encouraging independence in the dimensions of z given a factorial prior p(z), penalizing the mutual information between x and z leads to poor reconstructions. A similar argument is made by (Chen et al., 2018), who decompose the KL term of equation (2.7) even further as

Ep(x)[KL(q(z|x)||p(z))] = KL(q(z, x)||q(z)p(x)) | {z } Index Code MI + + KL(q(z)||Y j q(zi)) | {z } TC +X j KL(q(zj)||p(zj)) | {z } dimension-wise KL , (2.11)

and also argue that penalizing the mutual information between the observed and latent code would be undesirable. Both Kim and Mnih (2018) and Chen et al. (2018) proceed to outline a training regime to explicitly penalize the TC term, while leaving the mutual information untouched.

Due to the inherent absence of labels in an unsupervised learning setting, all of the β-VAE related approaches propose new metrics to evaluate the level of disentanglement on a synthetic dataset where the true factors of variation are known. Yet while scores on the proposed metrics become increasingly better, a recent large scale empirical study by Locatello et al. (2018) on the plethora of disentanglement claims and metrics sketches a gloom image: Not only do they prove that unsupervised learning of disentangled representations is fundamentally impossible without inductive biases both on the considered learning approaches and the data sets, they also find that

1. While all considered methods prove effective at ensuring that the individual dimensions of the aggregated posterior (which is

(30)

sampled) are not correlated, we observe that the dimensions of the representation (which is taken to be the mean) are correlated.

2. We do not find any evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model choice. Furthermore, good trained models seemingly cannot be identified without access to ground-truth labels even if we are allowed to transfer good hyperparameter values across data sets.

3. For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, for example through a decreased sample complexity of learning.

Based on their exhaustive empirical findings, the authors urge future research to put more emphasis on reproducibility, and challenge the research community to rethink the general approach to disentanglement.

2.3 Estimating Gradients

In order to directly optimize the variational parameters of our chosen distri-butions in a deep learning setting, we need a way to backpropagate gradients of both the inference and generative networks present in the loss function of equation 2.7 ∂ ∂(φ, ψ)L(φ, ψ) = ∂ ∂(φ, ψ)Eqψ(z|x;θ)[log pφ(x|z)] | {z } problem: inference − ∂ ∂(φ, ψ)KL(qψ(z|x; θ)||p(z))| | {z }

ok: inference and generative

(2.12) The second term poses no problem both for optimizing the generative parame-ters as the expression is constant w.r.t. to φ, nor for the inference parameparame-ters as it is often analytically available w.r.t. to ψ. The first term however is slightly more problematic. While the parameters for the generative network can be computed using Monte Carlo samples

(31)

as qψ(z|x)does not depend on φ, this is not the case for the inference network ∂ ∂ψEqψ(z|x;θ)[log pφ(x|z)] = Z z ∂ ∂ψqψ(z|x; θ) | {z } problem log pφ(x|z)dz, (2.14)

which we cannot approximate using Monte Carlo samples, as differentiating (2.14) does not yield an expectation. There are two main approaches to deal with this problem: the score function and the reparameterization trick, which we’ll discuss in the following two sections.

2.3.1 Score Function

The score function (Glynn, 1990; Paisley et al., 2012; Mnih and Gregor, 2014) also known as REINFORCE (Williams, 1992), is based on the log derivative trick, which exploits the properties of the derivative of a logarithm

∂ ∂ψlog qψ(x) = ∂ ∂ψqψ(x) qψ(x) → qψ(x) ∂ ∂ψlog qψ(x) = ∂ ∂ψqψ(x) (2.15) Applying this transformation to our expression in (2.14), we are again able to build a MC sampler ∂ ∂ψEqψ(z|x;θ)[log pφ(x|z)] = Z z qψ(z|x; θ) ∂ ∂ψlog qψ(z|x; θ) log pφ(x|z)dz = Eqψ(z|x;θ)[ ∂ ∂ψlog qψ(z|x; θ) log pφ(x|z)] ≈ 1 K X k ∂ ∂ψlog qψ(zk|x; θ) log pφ(x|zk) (2.16) The great benefit of the score function is that it can be applied to any distribution in flat space. Unfortunately, the variance of the estimated gradients using expression (2.16) is often too large to be practically useful. Various variance reduction methods have therefore been proposed, the most effective of which exploit the fact that the expected value of the score function estimator is zero, to create control variates (Greensmith et al., 2004; Paisley et al., 2012; Ranganath et al., 2014). While in theory the use of such reduction techniques reduces the variance, in practice it is often quite burdensome and difficult to find the right control variate.

(32)

2.3.2 Reparameterization Trick and Extensions in Flat

Space

The reparameterization trick (Price, 1958; Bonnet, 1964; Salimans et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014) is a technique to simulate samples z ∼ qψ(z; θ) as follows

z = T (; θ), where ∼ s(), (2.17)

such that is independent from θ3_{, and the transformation T (; θ) is}

differen-tiable w.r.t. θ4. It has been shown that this generally results in lower variance

estimates than score function variants, thus leading to more efficient and better convergence performance (Titsias and Lázaro-Gredilla, 2014; Fan et al., 2015; Gal, 2016). The reparameterization of samples z, allows expectations w.r.t. qψ(z; θ) to be rewritten as

Eqψ(z|x;θ)[log pφ(x|z)] = Es()[log pφ(x|T (; θ))], (2.18)

thus making it possible again to directly optimize the parameters of a probabil-ity distribution through backpropagation. The canonical example is that of the univariate Gaussian distribution, where if we assume z ∼ qψ(z; θ) = N (µ, σ),

with auxiliary noise variable ∼ N (0, 1), then z = µ + σ is a valid reparam-eterization.

Extensions Unfortunately, there exists no general approach to defining a reparameterization scheme for arbitrary distributions. However, various work has been done in extending the reparameterization trick to an ever growing amount of variational families. Figurnov et al. (2018) provide a detailed overview, classifying existing approaches into (1) finding surrogate distributions, which in the absence of a reparameterization trick for the de-sired distribution, attempts to use an acceptable alternative distribution that can be reparameterized instead (Nalisnick and Smyth, 2017), (2) Implicit reparameterization gradients, or pathwise gradients, introduced in machine learning by Salimans et al. (2013), extended by Graves (2016), and later generalized by Figurnov et al. (2018) and the related work of (Jankowiak and Obermeyer, 2018) using implicit differentiation. (3) Generalized reparameter-izations finally try to generalize the standard approach as described in the

3_{At most weakly dependent (Ruiz et al., 2016).}

4_{We highlight the dependency of θ, the variational parameters as opposed to ψ, the}

inference network parameters to follow tradition in the VAE literature. Note however, that θ is produced by the inference network and hence for backpropagating we care about ∂

∂ψ

(33)

Figure 2.3.1: Illustration of reparameterization trick of a Lie group v. the classic reparameterization trick.

preliminaries section. Notable are (Ruiz et al., 2016), which relies on defining a suitable invertible standardization function to allow a weak dependence between the noise distribution and the parameters, and the closely related (Naesseth et al., 2017) focusing on rejection sampling. Finally to work with the discrete categorical distribution concurrent research by Maddison et al. (2017), and Jang et al. (2017) was published.

2.3.3 Reparameterizing Distributions on Lie Groups

As can be seen in the previous section, a large body of research already exists to extend the reparameterization trick. However at the time of writing, to the best of our knowledge no such trick is defined for distributions on spaces with a non-trivial topology such as Lie groups. In (Falorsi et al., 2019), we introduce a general framework to reparameterize distributions on arbitrary Lie groups. Additionally we demonstrate how to create complex and multimodal reparameterizable densities on the Lie group of 3D oriented rotations, SO(3), using a novel Locally-Invertible normalizing flow (LI-Flow)5_.

We demonstrate possible applications of our work in both a supervised and unsupervised setting. Below we will provide excerpts that outline the main reparameterization steps of our framework, as well as highlight one of the prototypical experiments using Variational Inference. For a background on

(34)

Figure 2.3.2: Illustration of a LI-Flow

Lie groups, our main theorem, and a careful discussion of our framework utilizing results from differential geometry and measure theory we refer the interested reader to (Falorsi et al., 2019).

Reparameterization Steps We will explain our reparameterization trick for distributions on Lie groups (ReLie), by analogy to the classic Gaussian example described in (Kingma and Welling, 2013), as we can consider RN

under addition as a Lie group with Lie algebra RN _{itself. The following}

reparameterization steps (a), (b), (c) are illustrated in Figure 2.3.1.

(a) We first sample from a reparameterizable distribution r(v|σ) on g. Since the Lie algebra is a real vector space, if we fix a basis this is equivalent to sampling a reparameterizable distribution from RN_{. In fact, the basis induces}

an isomorphism between the Lie algebra and RN_. 6

(b) Next we apply the exponential map to v, to obtain an element, g ∼ ˆq(g|σ) of the group. If the distribution r(v|σ) is concentrated around the origin, then the distribution of ˆq(g|σ) will be concentrated around the group identity. In the Gaussian example on RN_{, this step corresponds to the identity operation,}

and r = ˆq. As this transformation is in general not the identity operation, we have to account for a possible change in volume using the change of variable formula7_{. Additionally the exponential map is not necessarily injective, such}

that multiple points in the algebra can map to the same element in the group. (c) Finally, to change the location of the distribution ˆq, we left multiply g by another group element gµ, applying the group specific operation. In the

classic case this corresponds to a translation by µ. If the exponential map is surjective (like in all compact and connected Lie groups), then gµ can also be

parameterized by the exponential map8_.

6_{See Appendix G of (Falorsi et al., 2019).}

7_{In a sense, this is similar to the idea underlying normalizing flows (Rezende and}

Mohamed, 2015)

8_{Care must be taken however when g}

µ is predicted by a neural network to avoid

(35)

(a) Line symmetry (b) Triangular symmetry

Figure 2.3.3: Samples of the Variational Inference model and Markov Chain Monte Carlo of the VI experiments. Outputs are shifted in the z-dimension for clarity.

It is important to note that the only restriction on the base distribution r(v|σ) used in step (a), is that it’s reparameterizable. Hence, in order to create complex, multimodal distributions on a target Lie Group, this opens up the possibility to use Normalizing Flows. An illustration of a LI-Flow is shown in Fig. 2.3.2, and the details for an implementation on SO(3) can be found in (Falorsi et al., 2019).

Experimental Results In a prototypical Variational Inference experiment we provide an intuitive example of the need for complex distributions in the difficult task of estimating which group actions of SO(3) leave a symmetrical object invariant. For didactic purposes we take two ordered points, x1,x2 ∈

R3, and perform LI-Flow VI to learn the approximate posterior over rotations. The first set of points is chosen to have a rotational symmetry around one axis, and the second set to contain a threefold discrete, rotational symmetry along one axis. We evaluate the learned distribution by comparing its samples to those of the true posterior obtained using the Metropolis-Hastings algorithm.

Results are shown in Fig. 2.3.3. As expected, the discovered distribution over SO(3) group actions is a rotational subgroup, S1_{. Clearly, the learned}

approximate posterior almost perfectly matches the true posterior in both cases. Instead, using a simple centered distribution such as the pushforward of a Gaussian as the variational family, would make learning the observed topology problematic, as all probability mass would focus around a single rotation.

(36)

tent Manifolds

"Do not try and bend the spoon, that’s impossible. Instead, only try to realize the truth...there is no spoon. Then you’ll see that it is not the spoon that bends, it is only yourself"

Spoon-Boy to Neo

Following the manifold hypothesis, we assume that our observed, high dimen-sional data, is generated by a distribution close to a (much) lower dimendimen-sional (latent) manifold. If our goal is to learn such a latent distribution, there are a number of considerations we need to take into account when designing our model. Three of the most important ones can be roughly summarized as (1) latent space dimensionality, (2) assumed latent manifold structure, and lastly (3) homeomorphic encoding. We start by developing a general intuition for issues presented by points (1) and (2), followed by a worked-out example demonstrating these issues through the popular Gaussian distribution in Section 3.2. Finally we discuss two of our recent attempts to alleviate some of these problems (Davidson et al., 2018; Falorsi et al., 2018).

3.1 Manifold Matching

Let X of dim(D) represent the space of our high dimensional observed data, in which the ‘true’ data manifold M of dim(M), such that D M is embedded. Additionally let Z of dim(L) represent our latent embedding space. In a VAE setting, we attempt to recover M from X , by using a continuous map (encoder) to encode our data from the observed space to the latent space,

enc: X → Z, and then decode the compressed representation to reconstruct 20

(37)

Figure 3.0.1: Visualization of a circle lying in curved space. Notice how properties like the circumference C, change based on the manifold structure. (Reproduced from (Carroll and Ostlie, 2017))

the original data in the observed space. In doing so, we can encounter three main scenarios with respect to the choice of latent dimensionality, all with distinctly different consequences.

M > L, Information Loss When our chosen latent space dimensionality L, is smaller than that of the true latent manifold M, the result will inevitably be a loss of information. This is due to the fact that enc|M cannot be

a homeomorphism, i.e. there exists no invertible and globally continuous mapping between coordinates of M and the ones of Z.

M < L, Information Obfuscation If our chosen latent space dimension-ality is sufficiently large 1_{, M can be smoothly embedded in Z with enc|}

M

a homeomorphism, i.e. no information is lost. Unfortunately, as L > M, a large part of the ambient space will be filled with points that do not lie in enc|M(M), making relevant information hard to find.

M = L, Topology (Mis-) Match In the ideal case, we have chosen our

latent dimensionality to exactly match that of our true manifold. Unfortu-nately, even in this case we are far from certain to successfully recover the structure of M. This is due to the fact that for most real data, M is a

1

Any smooth real M-dimensional manifold can be smoothly embedded in R2M_{, by the}

(38)

manifold with a non-trivial topology, thus likely leading to the true data manifold not being diffeomorphic to the latent embedding manifold.

VAE Specific Consequences Although the consequences of Information

Loss are easy to imagine, e.g. a picture of a cat is reconstructed in the wrong color, or an MNIST digit is decoded in the wrong style, the VAE specific problems that come with Information Obfuscation when M < L, and the possible topology mismatch when M = L are more subtle. As a VAE forces M to be mapped into an approximate posterior distribution that is supported in the entirety of Z, this approach is bound to fail in the absence of a diffeomorphism in two ways:

1. The original embedding enc|M(M) could be leaving most of Z empty.

This leads to having a high likelihood of generating bad samples, as most sampled points z ∈ Z, will not actually lie in enc|M(M), which

is of smaller dimensionality.

2. If the strength of the KL term between the learned approximate poste-rior, and latent prior is increased, the pressure to match the assumed variational family is heightened. When the assumed variational family is defined on RN_{, the encoder is thus encouraged to occupy all the}

latent space. This creates instability and discontinuity, affecting the convergence of the model.

3.2 Issues with The Gaussian Distribution

In order to provide a more tangible example of the potential issues described above, we will study the behaviour of the Gaussian distribution both in low and high dimensions. The Gaussian distribution represents a ‘blob-like’ density defined on RN_{, smoothly covering the entire space, and is especially}

interesting given its prominent role as the default distribution for most VAE models proposed in the literature.

Low dimensions: origin gravity In low dimensions, the Gaussian density presents a concentrated probability mass around the origin, encouraging points to cluster in the center. This is particularly problematic when the data is divided into multiple clusters, or lacks a center all together. For the former, an ideal latent space should separate clusters for each class, while the normal prior will encourage all the cluster centers towards the origin instead. An optimal prior would only stimulate the variance of the posterior without

(39)

Figure 3.2.1: Graphical illustration of the ‘Soap Bubble’ effect of the Gaussian distribution in high dimensions. Plotted the probability density with respect to radius r for various values of the dimensionality D. (Reproduced from (Bishop, 2006))

forcing its mean to be close to the center, naturally encouraging separation between cluster centers. A prior satisfying these properties is a uniform over the entire space, from which a mean bias is absent. Such a uniform prior, however, is not well defined on the hyperplane.

High dimensions: soap bubble effect It is a well-known phenomenon

that the standard Gaussian distribution in high dimensions tends to resemble a uniform distribution on the surface of a hypersphere, with the vast majority of its mass concentrated on the hyperspherical shell (see Fig. 3.2.1 for an illustration of this so-called ‘soap bubble’ effect). From a generative modeling point of view, this complicates tasks like drawing viable samples, or naively interpolating between samples as the probability of staying inside a high density area is small2. Hence it would appear interesting to compare the

behavior of a Gaussian approximate posterior with an approximate posterior already naturally defined on the hypersphere instead.

2_{See the following blog post for a more in-depth explanation www.inference.vc/}

(40)

3.3 Changing the Variational Family Q

A logical leap given the outlined limitations of the default Gaussian choice, and the manifold-mismatch phenomenon examined above, is to explore differ-ent choices of variational family. We will proceed by first discussing various related work on increasing the approximate posterior’s expressiveness, as well as approaches aimed at injecting prior believes to facilitate the discovery of certain data structures. Finally we will conclude by discussing two extensions we worked on to replace the underlying latent topology, by directly reparame-terizing a VAE using respectively a von Mises-Fisher distribution to achieve a hyperspherical topology (Davidson et al., 2018), and distributions on SO(3) to effectively capture the 3D oriented rotations of a static object (Falorsi et al., 2018).

Increasing Approximate Posterior Expressiveness The majority of

research in this direction has been focused on increasing the flexibility of the approximate posterior. The most dominant line of work is categorized under Normalizing Flows, first made popular in (Dinh et al., 2014; Rezende and Mohamed, 2015). The core idea is to sequentially apply a class of simple, invertible transformations on an initially reparameterizable density using the change of variable formula to correct for the potential change in volume, in order to create increasingly complex distributions. This line of work was later iterated upon to design even more flexible distributions using auto-regressive flows (Kingma et al., 2016; Dinh et al., 2017; Papamakarios et al., 2017; Kingma and Dhariwal, 2018), an approach that introduces dependencies between dimensions while still maintaining a tractable Jacobian. Another approach comes from (Tomczak and Welling, 2016), which designed a volume preserving flow using the Householder Transformation. In van den Berg et al. (2018) the simple planar flows introduced in (Rezende and Mohamed, 2015) are generalized, removing their single-unit bottleneck and thus increasing their flexibility while not relying on the parameter heavy auto-regressive principle.

However, while all of the these flows allow for a more flexible posterior, they do not modify the default Normal prior assumption. A first attempt to extend normalizing flows to Riemannian manifolds was made in (Gemici et al., 2016). However, its use-cases are limited in practice as the method relies on the existence of a diffeomorphism between RN _{and the target manifold.}

In (Falorsi et al., 2019) we introduced the possibility to design normalizing flows on Lie groups as discussed in subsection 2.3.3. By taking advantage of the local diffeomorphism properties of the exponential map, we defined a Locally Invertable flow (LI-Flow) and gave an example for the group SO(3).

(41)

Nonetheless, designing normalizing flows for arbitrary manifolds remains an open problem to be explored.

Injecting Structure Through Novel Priors Another way to better

match perceived structure in the data to be modeled, is by introducing a different prior form. Three of the most popular relational assumptions evolve around (1) mixtures of factors, e.g. the existence of sub-populations in a general population, (2) concrete v. continuous factors of variation, e.g. some generative factors such as class assignment or binary outcomes might be better modeled using discrete latent variables, (3) hierarchy, e.g. the defining axis of comparison might be hierarchical. The simplest instantiation of the first in the context of a VAE prior has been studied trough Mixture of Gaussians (MoG) variants in (Nalisnick et al., 2016; Dilokthanakul et al., 2016; Jiang et al., 2017). A variation on this idea is the Variational Mixture of Posteriors (VampPrior) (Tomczak and Welling, 2018), which outlines several advantages over MoG approaches, proposing to directly tie the prior to the approximate posterior and learn it as a mixture over approximate posteriors. In (Graves et al., 2018) the mixture prior is taken even further, by storing and continuously updating latent representations of each data point in the training set, and picking the prior of a new data point as a mixture over approximate posteriors using k-nearest neighbors on the latent codes to determine the cluster assignments. The most recent iteration on the mixture approach comes from (Paquet et al., 2018), which models the prior distribution as a mixture over MoG, representing a latent code as the Cartesian product over these sub-spaces. By modeling all the different mixture components as random variables, they propose a fully variational model aimed at capturing ‘structured compositionality.’

A second line of work instead has been concerned about introducing the option of using discrete latent random variables, as in some cases it is most plausible to assume an observed feature to be modeled by a Categorical distribution, e.g. number of dots on a die. The principal difficulty here for a long time was how to design a reparameterization trick that allows for differentiation in a neural network setup. A continuous approximation to achieve this was independently discovered by Jang et al. (2017) and Maddison et al. (2017), in which the Gumbel-Max trick is utilized to construct the Gumbel-Softmax or concrete distribution3. This approach of creating

approximately discrete latent variables was successfully used in (Dupont, 2018), to create a compositional latent space consisting of both discrete and

3_{More details about this distribution are available in Appendix A.2, as this trick will}