Few-shot Classification by Learning Disentangled Representations

(1)

MSc Artificial Intelligence

Master Thesis

Few-shot Classification by Learning

Disentangled Representations

by Emiel Hoogeboom 10831428

June, 2017

36 ECTS January – June, 2017

Supervisor:

Dr. E. Gavves

Daily Supervisor:

Dr. E. Gavves

Assessor:

Prof. Dr. M. Welling

(2)

Acknowledgements

I would like to thank Efstratios Gavves for his guidance and help the past half year. He could truly inspire me to approach a problem differently. He managed to spend a lot of time with me, despite his busy schedule.

I would also like to thank Jorn Peters, with whom I have had numerous discussions that led to significant insights. Jorn may be one of the smartest guys I know, and I predict that he will one day run his own research lab.

My gratitude goes out to my committee, consisting of Max Welling and Efstratios, who agreed to read my report on short notice.

Finally I would like to thank my parents, and everyone else, who helped me with their support and encouragement.

(3)

Abstract

Machine learning has improved state-of-the art performance in numerous domains, by using large amounts of data. In reality, labelled data is often not available for the task of interest. A fundamental problem of artificial intelligence is finding a representation that can generalize to never seen before classes. In this research, the power of generative models is combined with disentangled representations. The combination is leveraged to learn a representation for content, which generalizes to unseen classes. Potentially, disentangled representations can drastically reduce the number of required training examples, and improve understanding of different factors of variation.

This is achieved, by starting with a known procedure to disentangle representations. By exploring the structure of the content representation, a loss function is composed such that the model learns a few-shot class probability. A mathematical framework is defined, that includes a few-shot class probability. This probability ensures that a disentangled representation is learned. A lowerbound of the log-likelihood, is derived to obtain an objective function that optimizes the log-likelihood conditioned on the support set. The presented method has achieved state-of-the-art performance on the Omniglot dataset at the time of writing.

(4)

Introduction

“Much learning does not teach understanding.” Heraclitus of Ephesus

A deep learning model is a complex function approximator, based on a simple principle that is applied repeatedly. Its complexity makes it incredibly malleable, which allows it to perform tasks such as object classification and detection at high performance levels. This performance is possible, given vast amounts of data from the test domain. However, when inputs appear outside this domain, the words of Heraclitus make deep learning models look foolish.

The human brain is remarkable at object recognition, especially because of object constancy. With different types of illumination, pose or other changes in viewpoint, an object is often easily recognized. Different from machine learning, is that humans can easily generalize from very few examples. This phenomenon is called object constancy. A picture from an apple covered by snow, is still an apple. Most people have no problem with this decision, even if this is the first time that an apple is observed in this exact condition.

“What I cannot create, I do not understand.” Richard Feynman

A promising direction for these problems, is generative modelling. Generative modelling is ele-gantly motivated by the words of Feynman. Deep learning models may simply be cheating by recognizing the sky, when they need to recognize birds. By learning to generate examples, a model is forced to represent the whole image, including the bird. From a practical viewpoint, modelling a generative process has the advantage of being an unsupervised learning problem, and many unlabelled examples are available. However, for a representation to be object constant, it needs to be disentangled for variations in the object and other factors. Disentangled representa-tions are appealing, because a representation suitable for distinguishing cars from trucks, should be disentangled from color. This concept is illustrated in Figure 1.

(7)

Status

In recent years, machine learning has improved state-of-the art performance in numerous do-mains. Notably, deep learning has shown superhuman performance on multiple classification tasks, with extensive amounts of data [1, 2]. In reality, labelled examples may be scarce, which makes it difficult to learn a deep network directly. Moreover, sometimes large quantities of la-belled data are available, but not for all classes of interest. The field that tries to classify images with either one or a handful exemplars, is called one or few-shot learning.

In few-shot learning scenarios, a system is presented with only a few examples per class with known labels. The collection of these examples is called the support set. Another example is then presented to the system, which has to be classified by comparing it with the support set. Early attempts in the field of few-shot learning, only inferred directly from the support set. In more recent studies, the field has shifted towards similarity metric learning. First a metric is learned from a subset of classes, and then few-shot classification is tested on another subset with different classes. The underlying assumption is that large quantities of labelled data are available, but these are unavailable for some classes. For the classes of interest, only a few examples are labelled. The goal of few-shot learning is then to learn an embedding, that generalizes to unseen classes.

Deep generative models have been shown to improve classification performance, in semi-supervised learning settings. The aim is to have a model for the data generation process, because capturing this process means that the data was understood by the model to some degree. For example, a discriminative model might classify a ship based on the surrounding water, but a generative model will learn an actual representation for a ship. The intuition is that learning the actual representation, allows generalization and yields better performance. However, the application of generative models to few-shot learning has thus far been limited.

Disentanglement

We define disentanglement, as a separation in the representation of the attributes of an object. Let us illustrate this concept with an example. Imagine taking a photo of a car. In the camera the photo is represented with a large number of pixels. These pixels are highly entangled, as changing the color of the car would change a large number of the pixels. A representation would be more disentangled, when it generates the same image, but changing a subset of variables changes a subset of attributes, for example the color of the car. Suppose that an object is completely defined by a set of attributes. A valid representation of an object should represent the complete set. Furthermore, a disentangled representation is defined as a separable representation, such that each part represents an exclusive subset of attributes, and the union of all subsets is the complete set. The choice of attributes is arbitrary, and can be chosen to match meaningful human intuitions.

Direction

In this thesis, a mathematical framework to combine generative models, learning disentangle-ment, and few-shot learning is presented. The framework is designed to learn a disentanglement between two subsets of object attributes, content and style. The framework learns to represent

(8)

images in content and style variables, where the content variable is used for few-shot classifi-cation and reconstruction, while the style variable is only used for reconstruction. To enforce that content and style represent attributes of an example exclusively, priors are placed on these variables. These priors allow the framework to eliminate all redundant information in represen-tations during optimization. In this formulation, the content is defined as all helpful information in classifying the image. The style is defined as all other possible sources of variation, that are needed for reconstruction. The following hypotheses shall be addressed:

• Learning disentangled representation can be combined with few-shot classification. • Few-shot classification accuracy is improved, by using disentangled representations.

(9)

1 Related Work

The related work is organized into three different sections on sub-domains of deep learning. The domains few-shot learning, generative models and disentangling representations are discussed. The last section outlines what differentiates this thesis from existing literature.

1.1 Few-shot learning

Few-shot learning is a field where the number of examples is very limited. A key insight by Fei-Fei et al., is that knowledge of previously learned classes can be used, and hence learning does not start from scratch [3]. The union of few-shot learning with deep learning, has shifted the field towards a metric learning approach, and this metric (or embedding) has been learned in various manners.

Siamese networks [4] use the contrastive loss function to learn an embedding on a data set. Another key insight was provided by Vinyals et al., who showed that performance can significantly increase when the train procedure is adapted to match the test procedure closely. In their work, memory networks termed Matching Networks [5] are used to augment the embedding. Ravi et al. improved the meta-learning approach, by proposing a recurrent meta-learner model the updates for the few-shot model [6]. Prototypical Networks [7] use basic components of matching networks, showing that even higher performance can be attained with a relatively simple architecture and procedure, without the need for recurrent networks.

Instead of presuming a fixed distance metric to measure distance between examples, the metric can be directly learned. Competitive results on some datasets have been achieved by Residual networks with skip connections [8]. Learning the distance metric, allows the model to choose a suitable distance measure itself. This demonstrates that a distance function with parameters can be a more suitable choice on some problem instances.

1.2 Generative Models

An emerging field within machine learning is deep generative modelling. A common assumption in generative modelling, is that some lower-dimensional representation exists. One can propose some low-dimensional representation, and learn a transformation whose output resembles samples from the data distribution.

Variational Auto-Encoders (VAEs) [9] are derived from a generative process, by introducing a variational distribution that can be recognized as an encoder in traditional auto-encoders. They learn to reconstruct by encoding images into a lower-dimensional latent space, and decoding a reconstruction from that latent space.

Generative Adversarial Networks (GANs) [10] learn to model the data by defining two competing networks. The discriminator needs to classify which images are real and which are fake, the generator tries to deceive the discriminator. The reconstruction loss of the VAE is defined explicitly, and is often modelled by a pixel wise error. That means that a perfect reconstruction can still have a high error, with small perturbations (e.g. translation of one pixel). The loss of the generator in a GAN is defined implicitly, as the ability to mislead the discriminator. GANs

(10)

1.3 Disentangling Representation

In [11] a combination of VAEs and GANs learn to disentangle variation, separating class infor-mation from style into two latent spaces. To disentangle content from style, the labels are chosen to represent content. The training procedure is then formulated such that all other information will be encoded by the style variable. In other work, an unsupervised disentangled representation is be learned by maximizing the mutual information between a subset of the latent variables and the observation [12].

1.4 Contribution

This paper combines few-shot learning, generative models and disentangled representations. To the best knowledge of the author, disentangled representations have never before been used for few-shot classification.

(11)

2 Preliminaries

In this section, preliminary techniques are explained that will be used in subsequent sections. The techniques discussed, relate to general deep learning models, mathematical derivations of distribution distances, and an application where a disentanglement is learned.

2.1 Variational Autoencoders

Auto-encoders are artificial neural networks used for unsupervised representation learning. The dimension of the input is equal to the dimension of the output, and the purpose of the network is to reconstruct the input. The representation that is learned at the bottleneck of the network, is called the code or latent space. An auto-encoder can be divided in two distinct modules, the encoder and the decoder. The encoder is a function that maps an input x into some latent representation z, Enc : X → Z. The decoder maps the latent representation to the input space, Dec : Z → X . The objective of the auto-encoder is to minimize some distance loss as defined in Equation 1.

Dec*, Enc* = argmin

Dec,Enc

||x − Dec(Enc(x))||2 ₍₁₎

Variational Auto-Encoders (VAEs) [9] assume some generative process from the latent space z to x (Depicted in Figure 2). Note that the latent variable z is treated as a random variable.

x

z

θ

φ

N

Figure 2: Generative process in a graphical model. This model is the basis for the Variational Autoencoder.

By introducing a variational distribution qθ(z|x), a lower bound for p(x) can be derived with

Jensen’s inequality (Equation 2). In this equation, DKL represents the Kullback-Leibler

di-vergence, probability distributions are parametrized by θ, and the variational distribution is parametrized by φ. The decoder is now defined as the conditional distribution Dec := pθ(x|z).

The encoder is defined as the variational distribution Enc := qφ(z|x). A common assumption is

to let qφ(z|x) be a multivariate normal distribution with diagonal variances. Thus, the encoder is

(12)

log pθ(x) = log Z pθ(x, z)dz = log Z qφ(z|x) pθ(x, z) qφ(z|x) dz ≥ Z qφ(z|x) log pθ(x, z) qφ(z|x) dz = Ez∼qφ(z|x)[log pθ(x|z)] + DKL(qφ(z|x)||pθ(z)) (2)

2.2 Generative Adversarial Networks

Generative Adversarial Networks (GANs) [10] are a different type of generative model. A GAN consists of two distinct modules. A generator that maps some latent representation to an example Gen : Z → X , and a discriminator that maps an example to a confidence that signifies how real an example looks, Disc : X → [0, 1]. The two networks a trained as adversaries, in a zero-sum game setting. The value function is depicted in Equation 3. The generator tries to minimize the value function, while the discriminator tries to maximize it, as depicted in Equation 4.

V (Gen, Disc) = Ex∼Pdata

h

log Disc(x)i_{+ E}z∼p(z)

h

log 1 − Disc(Gen(z))i (3)

Gen*, Disc* = argmin

Gen

argmax

Disc

V (Gen, Disc) (4)

Where an auto-encoder uses some defined distance metric to compare the reconstruction to the original, a GAN uses the certainty prediction of a discriminator. Loss functions for reconstruc-tion of high dimensional data such as images, are difficult to define such that sharp images are generated. Instead, a GAN architecture only implicitly defines a loss function, via the discrimi-nator.

Optimization of the value function leads to a problem for the generator, since the strength of the gradient decreases when the discriminator is certain. Practically, instead of optimizing Equation 3, the generator optimizes a value function that has stronger gradients for a certain discriminator, defined in Equation 5. When a discriminator is more certain, i.e. Disc(Gen(z)) → 0, the gradient will be stronger, since _dxd log f (x) =_{f (x)}1 df (x)_dx .

VG(Gen, Disc) = −Ez∼p(z)

h

log Disc(Gen(z))i (5)

2.3 Kullback Leibler Divergence for Multivariate Normals

The Kullback-Leibler divergence is a distance measure between probabilities. In a previous section, we already saw that the variational autoencoder optimizes a KL divergence between the variational distribution and the prior. By our choice of parametrization, we will only be faced with multivariate normal distributions. In Equation 6 the analytical solution between two arbitrary normal distributions, q = N (x|µ, Σ) and p = N (x|m, L) is derived.

(13)

DKL(q||p) = EN (x|µ,Σ)[log N (x|µ, Σ) − log N (x|m, L)] =1 2log |L| |Σ|+ EN (x|µ,Σ) −1 2(x − µ) T_Σ−1_{(x − µ) +}1 2(x − m) T_L−1_{(x − m)} =1 2log |L| |Σ|+ 1 2EN (x|µ,Σ) − Tr((xxT _{− µµ}T _{− 2xµ}T_)Σ−1_{) + Tr((xx}T_{+ mm}T _{− 2xm}T_)L−1₎ =1 2log |L| |Σ|+ 1 2−Tr I + (µµ T _{+ Σ + mm}T − 2µmT)L−1 =1 2 log |L| |Σ|− D + Tr(ΣL −1_{) + (m − µ)}T_L−1_{(m − µ)} (6)

If we parametrize multivariate normal distributions such that the covariance matrix only has diagonal entries, the solution can be further simplified. The normal distributions are redefined to q = N (x|µ, I · σ) and p = N (x|m, I · l). The corresponding KL divergence between q and p is shown in Equation 7. DKL(q||p) = 1 2 "D X i 2 log li σi +σi 2 li2 +(mi− µi) 2 li2 − D # (7)

The equation can be further simplified if p is a prior with zero mean and variance one. The distribution p is redefined such that p = N (x|0, I). The KL divergence between q and p is presented in Equation 8. DKL(q||p) = 1 2 " D X i −2 log σi+ σi2+ µi2 − D # (8)

2.4 Squared Euclidean distance between Multivariate Normals

A straightforward measure of distance is the squared euclidean distance. We define two arbitrary multivariate normal distributions p = N (x|µ, Σ) and q = N (y|m, L) where x and y have an equal number of dimensions. An analytical solution for the expectation of the squared euclidean distance between two multivariate normal distributions is presented in Equation 9.

Ex∼p,y∼q(||x − y||2) = Ex∼p,y∼q(xTx + yTy − 2xTy)

= Ex∼p,y∼qTr(xxT + yyT − 2xyT)

= Tr(µµT+ Σ + mmT+ L − 2µmT) = Tr(Σ + L) + (µ − m)T(µ − m)

(14)

2.5 Disentangling Factors of Variation

In 2016, Mathieu et al. learned a disentangled representation by combining a variational auto-encoder with a generative adversarial network [11]. They specify two latent variables, the con-tent s and the style z. The function of the concon-tent is to contain all class information, and the style should contain any other information, such as how slanted a letter is written. To-gether, s and z provide sufficient information to reconstruct the original example x. An encoder (s, (µ_z, log σz)) = Enc(x) and a decoder x = Dec(s, z) are defined. Furthermore a discriminator

[0, 1] = Disc(x, id) is trained to distinguish real and fake examples. The variable id denotes the label of the example x. In Equations 10 and 11 the loss for the VAE and the GAN are specified. The complete loss can be formulated as in Equation 12, where λ is a scaling factor. Note that the authors chose to include a KL regularization on z, but s is not treated as a random variable.

L(V AE) _{= −E}z∼q(z|x,s)log p(x|z, s) + DKL(q(z|x, s)||p(z)) (10)

L(GAN )= log Disc(x, id) + log (1 − Disc(Gen(z, s), id)) (11)

L = L(V AE)+ λL(GAN ) (12)

The authors propose a training procedure with multiple steps that swaps the latent variables. This training procedure, in combination with the model, ensures that a disentangled represen-tation is learned. A summary of the procedure is described below, please refer to [11] for exact details.

• Two samples from the same class, x1 and x10 are drawn. The VAE is trained to maximize

p(x1|Dec(s1, z1)) and p(x1|Dec(s10, z₁). Note that both produce the same reconstruction.

This ensures that only content information may flow through s.

• To avoid that the network ignores s, a sample from a different class, x2is drawn. The VAE

is trained to minimize the generator GAN loss log Disc(Gen(z2, s1), id(x1)). This ensures

that the content information must flow through s, and may not flow through z.

• Again sampling x1, x10and x₂in a similar fashion. The discriminator is trained to minimize

log Disc(Gen(z1, s1), id(x1)) + log (1 − Disc(Gen(z2, s1), id(x1))). Thus the discriminator is

trained to detect whether a reconstruction used a style z from an example with another class.

Since it is difficult to express disentanglement in numbers, we follow the procedure of the original authors to display interpolations in latent representations. Some of the reconstruction results are depicted in Figure 3. In the left image, a slanted seven is interpolated to an upright nine. Moving downwards from the top left, the seven gradually appears more upright. Going upwards from the bottom right, the nine becomes increasingly slanted. In the right image, a three is interpolated with a seven.

(15)

Figure 3: Interpolation between content and style. Left and right follow the same procedure with different examples. The top left image is a reconstruction of an image in the dataset. The bottom right image is also a reconstruction of an image in the dataset. Horizontally the content s is linearly interpolated. Vertically the style z is interpolated.

(16)

3 Structure of Disentanglement

Disentangled representations are representations where specific variables of the representation can be modified to change specific components. The method of Mathieu et al. [11] learns a disentanglement of content (class) and style (all other variations), but does not put any con-straint on the structure of the disentangled representation. In this section, the structure of the representations is investigated, and modified with additional constraints. Ultimately, the goal is to perform inference for few-shot learning on the disentangled content representation.

3.1 Understanding Disentanglement

The work of [11] is taken as a starting point, a combination of a VAE and a GAN with the specified training procedure. With this model, a disentangled representation is learned on MNIST, and all visualizations are obtained with datapoints in the test set. The structure of the high dimensional content and style representations are visualized with stochastic neighbourhood embedding. Note that z is a distribution, and therefore only µ_z is visualized. In Figure 4 these embeddings are depicted. Notice that the content s is clustered stronger, and style z clustering is less apparent. This is expected since content variables from the same class should contain the same information, making clusters very distinct. In contrast, style is often more continuous (how slanted or bold a digit is), and the same style can be shared between different classes.

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 8 6 4 2 0 2 4 6 8

Figure 4: Visualization of the high dimensional latent variables of the model in [11]. All points represent test data. Left: t-SNE plot of content s. Right: t-SNE plot of style µz (z is a

distribution). Different colors represent different classes.

The structure of the content s is not suited for few-shot classification, because multiple clusters exist for the same class. An example that maps to the necessary cluster, might be absent. In [11] this was not necessarily a problem, since two different points can be mapped to the same class by the decoder. However, for few-shot classification, ideally the content embedding would have one cluster for each class.

(17)

3.2 Distance Penalty for Content

In the previous section visualizations showed that the clusters for content s were scattered. It can advantageous to group examples more tightly when classification is based on the proximity of s.

Therefore, in addition to the VAE and GAN loss, a simple loss that is based on distance between content variables of the same class (Equation 13) is used. In this equation, the subscript notation correspond to the previously described training procedure, s1 and s10 are the same class. In

essence, optimizing the distance penalty will attract style s of examples with the same class. The objective that is optimized is presented in Equation 14.

L(penalty)= ||s1− s10||2 (13)

L = L(V AE)_{+ λL}(GAN )_{+ L}(penalty) ₍₁₄₎

The content s and style µz are visualized in Figure 5. Clearly, classes are clustered more

com-pactly in the embedding. Furthermore, no class has multiple clusters. As a proof of concept for few-shot learning, a single content s of each class is chosen as the support set. The test set is classified using a nearest neighbour approach on the support set. Classification based on a single example in the content domain, has about 99% accuracy. For comparison, the model without a penalty evaluated with the same procedure has only about 90% accuracy. Although the model is not classifying examples of an unseen class, this illustrates two important points: Firstly, a disentangled representation of content can be used for few-shot classification. And sec-ondly, an additional restriction (such as a distance penalty) is effective to learn a useful few-shot embedding. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 7.5 5.0 2.5 0.0 2.5 5.0 7.5 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6

Figure 5: Visualization of the high dimensional latent variables of the model, that also optimises a penalty on distance between same-class content variables. All points represent test data. Left: t-SNE plot of content s. Right: t-SNE plot of style µz (z is a distribution). Different colors

(18)

3.3 Disentangling with Distance Loss Exclusively

Inspired by the results in the previous section, a new distance loss on s is proposed. We formulate a classification probability for the correct class, based on euclidean distance. The probability is normalized similar to a softmax function (Equation 15). With examples from the same class, the content s should lie close together. For other classes, they should lie far apart. The first term of the loss contracts content variables of the same class, and the second term of the expands the distance between content variables of different classes.

Different from previous work, we also choose s to be a random variable, and let the encoder output ((µ_s, log σs), (µz, log σz)) = Enc(x). The objective function in previous models did not

treat s as a random variable, and therefore it was not regularized. Because the new objective does constrain s, the variable is now modeled as a distribution. Experiments showed that without this modification, s encodes all information and z is ignored. The VAE loss is depicted in Equation 16, which now includes s as a random variable. Note that both latent variables are now regularized with their priors. In Equation 17 the objective to optimize is shown.

L(distance)= − log " exp(−||s1− s10||2) PC i=1exp(−||s1− si0||2) # = ||s1− s10||2 | {z } Contraction term + log C X i=1 exp(−||s1− si0||2) | {z } Expansion term (15)

L(V AE)_{= −E}z∼q(z|x),s∼q(s|x)log p(x|z, s) + DKL(q(z|x)||p(z)) + DKL(q(s|x)||p(s)) (16)

L = L(V AE)+ λL(distance) (17)

Without the adversarial procedure, learning a disentanglement is less explicitly enforced. How-ever, the intuition is that the distance loss will ensure that the classes will cluster in the embed-ding s. To create a reconstruction, the decoder can obtain information through s and z. The encoder needs to send information through the latent space, by changing the distribution from the prior. Changing the distribution of the latent space, incurs a penalty via the KL divergence. If class information is already available in s, the model will avoid putting the same information in z, because doing so would incur another penalty.

The model is trained with the following procedure. Draw two samples from the same class, x1

and x10. Also draw samples from other classes: x₂0, . . . , x_C0. All gradients for the decoder are

originating from L(V AE) for x1. The gradient signal for the encoder comes from both L(V AE)

and L(distance) with s1 and s10 in the contraction term, and all s₁0, . . . , s_C0 in the expansion

term.

Interpolations of style and content are depicted in Figure 6, by changing s and z linearly between two examples. Notice that content information is conveyed via s and style via z. Note how a four written slanted become upright and shaky, in the style of the eight. Thus, the model is able to learn a disentanglement with a euclidean distance loss, instead of the adversarial procedure.

(19)

Figure 6: Interpolation between content and style, reconstructions created with a VAE trained with distance loss. Left and right follow the same procedure with different examples. The top left image is a reconstruction of an image in the dataset. The bottom right image is also a reconstruction of an image in the dataset. Horizontally the content s is linearly interpolated. Vertically the style z is interpolated.

In Figure 7, t-SNE visualizations of the content and style variables of test examples are depicted. The content variables are strongly clustered, and the style variables show less structure based on class. Notice that content grouping has become tight, and that style grouping has become less noticeable. Thus, only by restraining the distance of s for images of the same class, a disentanglement can be learned. Furthermore, a continuous representation for content is learned that is tightly clustered.

8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 4 2 0 2 4 4 2 0 2 4

Figure 7: Visualization of the high dimensional latent variables of the VAE with MNIST, trained with distance loss. All points represent test data. Left: t-SNE plot of content µ_s. Right: t-SNE plot of style µ_z. Different colors represent different classes.

(20)

4 Model: Generative Few-shot Learning

The previous section described how a disentanglement can be learned, and hinted at how a few-shot learning loss may actually aid in learning a disentanglement. In this chapter, a generative model for few-shot learning is formally defined, inspired by disentangling representations. Generative models in semi-supervised learning can have x conditioned on some latent variable z and the class variable y. However, in few-shot learning scenarios the number of classes is large, and classes during test time have never been seen before. Therefore, conditioning directly on y is impractical. Instead, the example x is conditioned on content s and style z.

In this section a lowerbound of the conditional log-likelihood will be derived, for a simplified use case. The actual derivation involves a few more terms, which make it notation heavy. Therefore, the complete derivation is presented in appendix A.

4.1 Generative Model

A generative model for an example x and its class y in few-shot learning is defined. There are latent variables for content s ∼ p(s) and style z ∼ p(z). The observed class y = p(y|s) is conditionally independent of x and the observed example is conditioned on both content and style x = p(x|s, z). The corresponding graphical model is depicted in Figure 8. Class information information is often encoded in discrete variables, but this formulation allows the content variable s to be continuous, which makes generalization for few-shot learning possible.

y

s

x

z

N

Figure 8: Graphical model that shows how content and style influence the example and its label. The example x is conditioned on s and z, and the class y is only conditioned on s.

Analogous to the derivation of [9], a lower bound for log p(y, x) can be obtained. As depicted in the graphical model, the priors for the content and style are independent, thus p(s, z) = p(s)p(z). In contrast, the posterior p(s, z|x) cannot be factorized. However, we impose that the variational distribution can be factorized to simplify the model, such that q(s, z|x) = q(s|x)q(z|x). (Equation 18)

(21)

log p(y, x) = Z Z q(s, z|x) log p(y, x)dsdz = Es,z∼q(s,z|x)[log p(y, x)] = Es,z∼q(s,z|x)

log p(y, x|s, z) − logq(s, z|x) p(s)p(z)+ log

q(s, z|x) p(s, z|y, x)

= Es,z∼q(s,z|x)[log p(y, x|s, z)] − DKL(q(s, z|x)||p(s)p(z)) + DKL(q(s, z|x)||p(s, z|y, x))

≥ Es,z∼q(s,z|x)[log p(y, x|s, z)] − DKL(q(s, z|x)||p(s)p(z))

= Es∼q(s|x),z∼q(z|x)[log p(y|s)p(x|s, z)] − DKL(q(s|x)||p(s)) − DKL(q(z|x)||p(z))

(18)

The terms in this equation can be interpreted as an autoencoder with a classification model. The term p(y|s) is a class probability, p(x|s, z) is a reconstruction probability and q(s|x) and q(z|x) can be interpreted as encoders. For now, the term DKL(q(s, z|x)||p(s, z|y, x)) is neglected, as it

is non-negative.

4.2 Class probability

Thus far, we have obtained a lower bound to optimize log p(y, x). Inside the lower bound, the term p(y|s) refers to the class probability. Defining a class probability with some discriminator can be problematic, since classes will be different when tested. Instead, the class probability p(y|s) is defined relative to other examples. This definition is inspired by few-shot learning literature.

4.2.1 Embedding Distance in Literature

In the method proposed by Vinyals et al. [5], a modified softmax equation is used to compute the classification prediction (Equation 19). This equation can be modified to output a probability distribution p(y|x) = QC

c=1yˆ yc

c , where C is the total number of classes. Variable S denotes

the support set, which contains a few examples with labels. The function d can be an arbitrary distance metric, either a basic function such as Euclidean distance, or a complex function modeled by a deep network. f (x) is an embedding of an input vector x. The embedding function can be learned by a deep network. The equation is suitable for few-shot learning, because it makes a prediction for y, and is defined relative to the support set.

ˆ y = X (x0_,y0_)∈S exp −d(f (x), f (x0)) P (x00_,y00_)∈Sexp −d(f (x), f (x0)) y0 (19)

There are two distinct challenges when this method is applied to the generative model. Firstly, instead of a point estimate, the encoder predicts distributions. Not every distance metric may lead to meaningful distribution distances. For instance, in variational autoencoders, the varia-tional distribution is often modeled by a multivariate normal distribution. In few-shot literature, d is often modeled by cosine distance. However, two equally likely samples (1, 1) and (-1, -1) from

(22)

normal distribution. The effect of two commonly used distance metrics is illustrated in Figure 9. Secondly, the classification probability is actually conditioned on the support set S, which has thus far been ignored in the graphical model. The formula p(y|s) needs to be defined relative to the support set, and will therefore be conditioned on the support set. As a result, the class probability will be redefined from p(y|s) to include the support set, p(y|s, SS, YS).

Figure 9: Left: Visualization of the probability landscape of a multivariate normal distribution with mean at (1, 1) and diagonal variances of one. Center: Visualization of the cosine distance between a sample and the point (1, 1). Right: Visualization of the negative squared euclidean distance between a sample and the point (1, 1).

4.2.2 Model Class Probability

The content s is chosen as the embedding for x, as the content variable should contain all necessary information to classify an example. Thus, the conditional probability distribution over in the model is now defined as in Equation 20. The probability of a class increases when the distance between the content of an example and a support example of that class decreases. To avoid clutter, the normalization constant is written as Z. The variable C denotes the total number of classes in the support set. The variable c is used to select the c’th component of a vector. p(y|s, SS, YS) = C Y c=1   X (ss,ys)∈(SS,YS) exp −d(s, ss) Z ysc   yc (20)

For the distance function d, a simple squared euclidean distance is chosen, d(a, b) = ||a − b||2_.

There are two arguments to do so: Firstly, an expected euclidean distance for multivariate normals, corresponds to a distance that is intuitively coherent, as depicted in Figure 9. But more importantly, we previously showed that by using euclidean distances, an actual disentanglement can be learned.

In the special case when the single support set has only one example per class (i.e. 1-shot), the ex-pectation of the numerator can be analyzed analytically. Assume that the objective function will include the form Es,SS[

P

iyilog p(yi|s, SS, YS)]. Note then, that for the matching class, the

nu-merator Es∼N (µ,Σ),s0_{∼N (m,L)}[− log exp −d(s, s0)] can be simplified into E_{s∼N (µ,Σ),s}0_{∼N (m,L)}[d(s, s0)],

where two arbitrary normal distributions are assumed for s and s0.

Since d is squared euclidean distance, optimizing numerator in this special case will minimize the expected squared euclidean distance between two multivariate normal distributions, Tr(Σ +

(23)

L) + (µ − m)T_{(µ − m) (section 2). This term is the combination of the squared distance between}

means, and the sum of all diagonal variances.

4.3 Support Set

By specifying the class probability, a new concept was introduced, the support set. To be accurate, the graphical model is adapted to include the support set.

The model with support set is depicted in Figure 10. Every training example is now connected to its own support set Sn. A support set example has the same generative process as a normal

example. The only difference is that the support set is used to classify the example. Note that to optimize the complete likelihood log p(x, y, XS, YS), all probabilities represented by arrows

in the graphical model would need to be defined. This includes the class probability equation p(ys|ss), which was the reason we introduced the support set in the first place. Alternatively, the

conditional distribution log p(x, y|XS, YS) can be optimized and does not require the definition

for p(ys|ss). xs ys ss zs y s x z Sn N

Figure 10: Graphical model that includes a support set S. Every example in the dataset is connected to its support set, that defines relatively what class the example belongs to.

4.3.1 The Support-Conditional Log-likelihood

Instead of optimizing log p(x, y, XS, YS), which would maximize the likelihood of all observed

data, it is possible to optimize log p(x, y|XS, YS), where the example is conditioned on the

support set. Intuitively, this corresponds with the few-shot learning context, where an example is classified given a support set. To integrate the support set with the previously derived model, first the term p(y|s) from Equation 18 is redefined as p(y|s, SS, YS), which changes the left-hand

side of the log-likelihood to also condition on SS and YS, as shown in Equation 21.

log p(y, x|SS, YS) ≥ Es∼q(s|x),z∼q(z|x) h log p(y|s, SS, YS)p(x|s, z) i − DKL(q(s|x)||p(s)) − DKL(q(z|x)||p(z)) (21)

This equation is conditioned on support content SS. To obtain the log-likelihood conditioned

on support examples XS, p(y, x|SS, YS) can marginalized over support content SS with the

(24)

the intractable posterior distribution, which will be solved in the next section. Also note that the variational distribution q(SS|XS) can be factorized as Q_x_sq(ss|xs) and shares the same

parameters as for an ordinary example x.

log p(y, x|XS, YS) = log

log p(y, x|SS, YS) − log

q(SS|XS) p(SS|XS, YS) = ESS∼q(SS|XS) h log p(y, x|SS, YS) i − DKL(q(SS|XS)||p(SS|XS, YS)) | {z } intractable (22)

4.3.2 Resolving the Posterior

At first glance, the KL divergence with the posterior seems problematic. Recall that, during the derivation of the probability of an example, the term DKL(p(s, z|y, x)||q(s, z|x)) was

ne-glected.

In a moment the neglected term will be reintroduced, to cancel the intractable term. To match the neglected term, first the posterior for the support set is redefined so that it includes the style ZS, p(SS, ZS|XS, YS). Note that the first term of the lower bound does not need to include an

expectation over ZS because the term is independent of ZS, leaving the first term unchanged

(Equation 23).

log p(y, x|XS, YS) = log

Z Z p(SS, ZS|XS, YS)p(y, x|SS, YS)dSSdZS ≥ ESS∼q(SS|XS) h log p(y, x|SS, YS) i − DKL(q(SS, ZS|XS)||p(SS, ZS|XS, YS)) | {z } intractable (23)

The intractable term was encountered before in Equation 18, albeit for an ordinary example. Realize that x and XS are identically distributed, as they come from the dataset. The posteriors

for an example and a support set should not differ, as these are also identically distributed. And thus, the expected value of the difference between the terms will become zero, as portrayed in Equation 24. For now, the support set has been assumed to consist of only one example.

E(x,y),(xs,ys)∼Pdata   DKL(q(s, z|x)||p(s, z|x, y)) | {z } neglected term − DKL(q(ss, zs|xs)||p(ss, zs|xs, ys)) | {z } intractable term   = 0 (24)

The expected value is zero, when the support set has only one example. In reality however, the support set always has multiple examples. The training procedure of few-shot learning, uses

(25)

batches of queries. Therefore, the expected value of the difference, will be greater than or equal to zero, under the condition that |B| ≥ |S| (the size of the batch is greater than or equal to the size of the support set). Because the procedure to derive the lowerbound is repetitive and notation heavy, the complete derivation is shown in appendix A.

4.4 Collecting All Components

Collecting components from Equations 18, 23 and 24, the formulation for the model is displayed in Equation 25. The approximation is valid in the expectation over data when the number of samples for support set and queries are balanced, or the number of queries is greater. Although the final term is simplified for a single support set example, the same principle applies for larger support sets, as long as the batch is greater.

log p(x, y|XS, YS) ≥ ESS∼q(SS|XS) h Es∼q(s|x),z∼q(z|x)log p(y|s, SS, YS)p(x|s, z) i − DKL(q(s|x)||p(s)) − DKL(q(z|x)||p(z)) + DKL(q(s, z|x)||p(s, z|x, y)) − DKL(q(SS, ZS|XS)||p(SS, ZS|XS, YS)) EPdata h log p(x, y|xs, ys) i ≥ EPdata " Ess∼q(ss|xs),s∼q(s|x),z∼q(z|x) h log p(y|s, ss, ys)p(x|s, z) i −DKL(q(s|x)||p(s)) − DKL(q(z|x)||p(z)) # (25)

In summary, the log-likelihood started from a generative process for x. To define class probability, a support set was included. The log-likelihood was updated to condition on the support set, by marginalizing over the posterior. By rewriting the posterior, the expected difference between two intractable terms will be positive. In the last step components were collected, and all terms of the equation can be computed.

4.5 Inference

During classification the term p(y|x, XS, YS) is maximized. Maximizing this term is equivalent

to maximizing the joint probability as depicted in Equation 26. Since all other terms in the joint probability are independent of y, practically only the class probability term needs to be max-imized. Technically, the expected value should be computed, however, approximating samples with the mean of the distribution did not affect performance significantly.

argmax

y

[log p(y|x, XS, YS)] = argmax y

[log p(y, x|XS, YS) − log p(x|XS, YS)]

= argmax

y

(26)

4.6 Implementation

A variational autoencoder is defined with two latent variables, s and z with nlatent values each.

The content code s is an input for the reconstruction and the class probability. The style code z is only used to reconstruct an example. The term p(x|s, z) is modeled by the decoder, and represents reconstruction error. This term is modeled with a the Bernoulli loss, such that log p(x|s, z) =P

ixilog ˆxi, where ˆx = Dec(s, z), and the summation is over pixel values. The

terms q(s|x) and q(z|x) are modeled by encoder with multivariate normal distributions that have diagonal variances, such that s ∼ N (µ_s(x), Iσs(x)) and z ∼ N (µz(x), Iσz(x)). The DKLterms

limit the divergence between the variational and the prior distributions and can be obtained analytically (section 2). Unless mentioned otherwise, expectations for ordinary examples are approximated with a single sample. Expectations for the support set are approximated with the mean of the distribution.

In principal, the model is not constrained to learn two completely separate embeddings. The model could use s only for classification and z only for reconstruction. However, making use of the latent variables incurs a small penalty via the Kullback-Leibler divergence between the prior and the variational distribution. The term can be seen as a regularizer on the latent codes. Since the classification pushes content codes s apart, useful information for construction is also available in s. Although the model can theoretically choose to put duplicate information in z, this would incur an additional penalty on the KL divergence of z. And thus, in optimal conditions, the model saves content information in s and style information in z.

(27)

5 Datasets

Few shot learning scenarios differ from conventional classification tasks in machine learning. The concept of few-shot learning is that a few examples with label information are presented, known as the support set. Also, a different example is presented without the label. The task is to classify the example by using the support set. In general, the support set contains the same number of examples per class, and the class of the example is always in the support set. We define two variables that describe the few shot learning setting: nway denotes the number of

classes in the support set, and nshot denotes how many examples per class are in the support

set. For instance, when nshot is 1 and nway is 5, this is a 5-way 1-shot classification problem.

Importantly, during evaluation the classes in the support set have never been seen before. To evaluate models related to few-shot classification, we use four different datasets, differing in size and complexity. In this section we will first define the few-shot learning episode. Subsequent sections present the details of four different datasets.

5.1 Episodes

Suppose we have some pool of training examples Dtrain and test examples Dtest. Each training

example belongs to a class. We choose Dtrain and Dtest such that a class is exclusively present

in only one of the sets.

Few-shot learning is comprised of episodes: a few examples are given with label information, a new example needs to be classified. We describe an episode following the procedure of [5]. For each episode, we take nway different classes L ∼ D. For each class, we sample nshot examples

as the support set S ∼ L. Also, we sample a batch B ∼ L with nqueries per class, for the nshot

different classes. We make sure that S and B are disjoint, i.e. they contain different samples. The task now, is to classify B with the information in S. An example of a session is shown in Figure 11.

The classes in Dtrainare different from the classes in Dtest. Therefore, a successful model will have

to effectively use the limited information in the support set to make the correct prediction.

Support set Example . . .

Figure 11: Configuration of a few-shot learning session. On the left side the support set is shown. For every example the correct label is known. The right side shows the image that needs to be classified.

5.2 MNIST

(28)

practical dataset to test new algorithms and architectures. The training set contains 60000 examples and the test set 10000 examples, with 10 different classes. Images are 28 by 28 pixels. In Figure 12 a random sample of the dataset is shown.

Figure 12: Random samples from the MNIST dataset.

MNIST is not particularly well suited to evaluate few-shot learning performance. There are only a limited amount of classes with many examples per class. The MNIST dataset is mainly used to show disentanglements and examine latent variables. Actual few-shot learning evaluation results are presented on more complicated and better suited datasets.

5.3 Omniglot

The Omniglot dataset was created by Lake et al. to test algorithms while having only a handful of labelled examples [13]. The dataset contains 1623 different characters from 50 different alphabets. For every character, there are only 20 different examples available. In Figure 13, 20 samples from the dataset are presented. To preprocess the data, the procedure from [5] is followed. Images are resized to 28 by 28 pixels. The first 1200 characters are training data that is augmented with rotations of 0, 90, 180 and 270. The remaining 423 characters are used for evaluation. In contrast with [7], we do not augment test data unless specified otherwise.

Figure 13: Random samples from the Omniglot dataset. Images are resized to 28 by 28 pixels and colors are inverted.

5.4 miniImageNet

The miniImageNet dataset was created by Ravi et al. to have a more difficult baseline for few-shot classification. Derived from the original ImageNet, the dataset contains only 100 different classes, with 600 examples per class. The dataset is split up in 64 train classes, 16 validation classes and 20 test classes. To preprocess the data, pixel values are rescaled to the range [0, 1], by dividing by 255. Images are resized to 84 by 84 pixels. The train images are rotated by 0, 90, 180 and 270 degrees to create more image classes. In Figure 14 a few random examples from the miniImageNet test set are presented.

(29)

Figure 14: Random samples from the miniImageNet test dataset.

5.5 Quick, Draw

The Quick, Draw dataset has been collected by Google Creative Lab.1 _{Users were asked to draw}

a concept, based on a textual description, such as “airplane” or “Eiffel Tower”. This can lead to very different drawings of the same concept. For instance, the description “clock” let some users draw an analogue clock, and others a digital one.

While users were drawing a concept, a recurrent neural network was guessing what the user tried to draw. Also, users were limited to 20 seconds within they had to draw the concept. A session finished either in 20 seconds, or when the network guessed the concept correctly. As a consequence, some images might be incomplete. Both Omniglot and Quick, Draw use a process where users are asked to draw a concept. The difference is that Omniglot users were presented with a visual example. Quick, Draw users were presented with a concept, which can be expressed in many different ways. Therefore, the intra-class variation in Quick Draw is not only caused by drawing style, but also the interpretation of the concepts to draw.

In total there are 345 classes that we split in 275 train, 35 validation and 35 test classes. Each classes contain numerous images, but we use only the first 100 images per class. Each image is a 28 by 28 pixels in gray scale. A few samples from the dataset are depicted in Figure 15.

(30)

6 Experiments

In this section the experiments with the model are discussed. The first part gives an overview of the techniques that were used. The second part presents and analyzes the results.

6.1 Setup

In this section the experimental setup is detailed. First a non-standard batch normalization layer is introduced, because samples from the data are not identically distributed. Then the network architectures and hyperparameter configurations are discussed.

6.1.1 Moving Average Batch Normalization

Batch normalization [14] has significantly improved deep learning optimization in some instances. However, using batch normalization may be problematic when samples in a batch are not inde-pendent and identically distributed. Since a batch is skewed with only nway different classes, we

may experience high variance for the first and second moments over different batches. Instead, we propose a simple method resembling [15], where moving averages are used at train and test time. In the pseudocode below the exact mechanism is specified. Note that x is an input and y is the corresponding output. Furthermore, β and γ are parameters trained with backpropagation. The moment variables are not trained, but updated as specified.

mu, sigma , b e t a , gamma = i n i t ( )

d e f m o v i n g a v e r a g e n o r m ( x , i s t r a i n i n g , d e c a y ) : y = ( y + ( x − mu) / sigma ) ∗ gamma

i f i s t r a i n i n g :

mu b , s i g m a b = compute moments ( x ) mu = mu + d e c a y ∗ ( mu b − mu)

sigma = sigma + d e c a y ∗ ( s i g m a b − sigma ) r e t u r n y

6.1.2 Architecture

The model can be separated in two distinct parts. An encoder for content and style, and the decoder for reconstructions. For computational efficiency, we use only one encoder that outputs both content and style. This means that weights are shared for content and style. The encoder architecture is largely inspired by [7], because their network achieved state of the art performance, at the time of writing.

Different from [7], we choose to have only 3 max pooling layers. Furthermore, we pad feature maps during pooling so that no information is discarded. The last layer is a fully connected layer to output µ_s, log σs, µz and log σz, corresponding to the distributions s ∼ N (s|µs, Iσs2) and

(31)

Table 1: Encoder architecture, input is an image with 1 channel and 28 by 28 pixels, outputs are µs, log σs, µz, log σz.

Name Feature maps Output size

input 1 28, 28

conv1 1 · nf ilters 28, 28

max pool 1 · nf ilters 14, 14

fc 4 · nlatent 1, 1

The decoder takes s and z as inputs and transforms them with a fully connected layer to a feature map with shape (4, 4, nf ilters· 8). In subsequent layers, the resolution is increased by

a re-sizing and then a convolution operation. We choose to increase the resolution with factors of 2, and therefore the final 32 by 32 output needs to be cropped such that we have a 28 by 28 output. In Table 2 the specifics of the decoder are presented.

Table 2: Decoder architecture, inputs are s, z and output is an image with 28 by 28 pixels Name Feature maps Output size

Input 2 · nlatent 1, 1 fc 8 · nf ilters 4, 4 upsample 8 · nf ilters 8, 8 conv1 8 · nf ilters 8, 8 upsample 4 · nf ilters 16, 16 conv2 4 · nf ilters 16, 16 upsample 2 · nf ilters 32, 32 conv3 2 · nf ilters 32, 32 conv4 1 32, 32 slice 1 28, 28 6.1.3 Configuration

During training, Adam [16] is used to optimize the network. The model is trained for 50000 iterations with a learning rate of 1e-4. Since the reconstruction loss heavily outweighs the few-shot loss, the few few-shot term log p(y|s, SS, YS) is magnified. To match training procedures from

literature closely, all other terms in the loss are divided by a factor λ, instead of multiplying the few shot term with λ. The factor λ was set to 1000 experimentally.

Every iteration, a support set of nway· nshot samples is drawn. Also nway· nqueries samples are

drawn to be classified. Authors generally observed increased performance when trained with a higher nway in [7]. Therefore, nway is set to 30. An overview of all additional parameters can be

(32)

Table 3: Configuration for experiments during training

Name Value

nf ilters 128

nlatent 32

learning rate 1e-4

λ 1000

nway 30

nshot 1

nqueries 15

6.2 Evaluation

The model will be evaluated on the Omniglot, miniImageNet and the Quick, Draw dataset. This section will test two hypotheses that were introduced in the introduction: (1) The model actually learns a disentanglement. This can be tested by visualizing the reconstructions with perturbations of the content and style. If the model successfully learns a disentanglement, (2) few-shot performance will improve performance when a disentangled representation is learned. The effect of disentanglement is tested, by comparing the model to a deep network with the same architecture, but without the generative loss.

6.2.1 Omniglot

The model is trained on the Omniglot dataset with the hyperparameter settings as described in section 6.1. The disentanglement of the model is visualized as in previous sections: Indirectly, by reconstructing images of perturbed content and style variables. And directly, by visualizing the structure of the high-dimensional content and style variables with stochastic neighbourhood embedding.

Interpolations of the content and style variables are depicted in Figure 16. These pictures show to what extend content and style have been disentangled. The interpolation results show that the network has successfully learned to encode the stylistic translation and scale attributes in z, since these tend to change vertically. Also, the content of an image tends to change horizontally, which confirms that content is encoded in s.

(33)

Figure 16: Interpolations between the latent variables of two images. The upper left and the bottom right are reconstructions from the test set. All other images are linear interpolations over content s and style z. The content variable s changes horizontally, the style z changes vertically.

Reconstructions of examples with interchanged variables are shown in Figure 17. Note that all characters in a column still have the general shape of the original example, which again demonstrates that content is modeled by s. In addition, note that the actual location, size and roundness are modeled by z, as one would expect style to be modeled.

The pictures also show the limitations of the model: the reconstructions can be blurry, and sometimes lack certain strokes. Recall that Omniglot only has 20 examples per class, and the model has never actually seen the classes in the test set before. Another reason for blurriness is the reconstruction loss, that is formulated pixel-wise. This penalizes the model heavily for reconstructions that have been translated slightly.

Figure 17: Interchanging the content and style of examples. Reconstructions are generated by taking s from the column and z from the row. Images from the dataset that are used as input are depicted at the sides.

(34)

The structure of the representation is visualized in Figure 18. The latent variables for content and style, are embedded into a 2D manifold. Each test example is represented by a dot, where the color denotes the class. Since the examples are ordered by alphabet, similar colors often correspond to letters in the same alphabet. For every class, precisely one example is shown as an image. Some cluster annotations are provided to make the diagrams more understandable, note that these are subjective and are not necessarily complete.

The top embedding shows the structure of s, representing content. Images with classes that are similar, lie close together. For instance, a group of ‘o’-shaped characters is grouping together. Furthermore, a few box-shaped characters are visibly clustering. In general, characters from the same alphabet are grouped together more than others. There seems to be no real pattern for other factors. For instance, the ‘o’-group has very different sizes, and still their s variables lie close together.

The bottom embedding depicts the style z. In contrast with the previous embedding, now grouping based on location, scale and other factors is expected. Some clusters are annotated intuitively to demonstrate the different styles. For instance, the top-group contains images that are drawn relatively high in the image. Thus, the diagrams illustrate that content is indeed modeled by s, and style is modeled by z.

(35)

10 5 0 5 10 10 5 0 5 10 0 50 100 150 200 250 300 350 400 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 0 50 100 150 200 250 300 350 400

(36)

The classification performance is presented in Table 4. Since the authors in [7] evaluated on augmented test data, performance is reported on normal and augmented test data. It is worth mentioning that augmenting the test data increases accuracy, but may not be a realistic problem setting. Furthermore, the disentangled VAE outperforms the models consistently in every setting. The last row denotes the performance of the same architecture without the generative loss, and therefore without disentangled representations. Clearly, the performance drops consistently for all tasks. And thus, learning disentangled representations significantly improves few-shot classification.

Table 4: 20-way classification performance on the Omniglot dataset. The ‘+’ sign denotes that performance is measured on an augmented test dataset (90 degree rotations).

1-shot 1-shot+ 5-shot 5-shot+ Matching Networks [5] 93.8 - 98.7% -Prototypical Networks [7] - 96.0% - 98.9% Disentangled VAE 95.9% 97.0% 98.8% 99.1% Only few-shot loss 94.8% 96.0% 98.4% 98.9%

6.2.2 miniImageNet

The miniImageNet images were resized to 84 by 84 pixels matching [5, 7]. To control the increased resolution, an additional max pooling layer is added after conv4 in the encoder. The reconstruction task is simplified, by re-sizing the target to 32 by 32 pixels. And thus the final slicing layer is removed in the decoder.

Experiments showed that the model performed worse than baseline models. In limited data settings, the network relies on heavy regularization. With Omniglot, this problem did not really occur. But a more complex dataset such as miniImageNet, the network quickly overfits. However, tests on Omniglot revealed that removing the fully connected layer, significantly degrades the quality of the learned disentanglement. Visualizations of the reconstructions also reveal that the network does not disentangle anything, as the decoder only relies on z (Figure 19). Furthermore, the network learns to create vague reconstructions, showing that the generative model itself is limited.

(37)

Figure 19: Interchanging the content and style of examples. Reconstructions are generated by taking s from the column and z from the row. Images from the dataset that are used as input are depicted at the sides.

The miniImageNet dataset is complex compared to Omniglot, and the classes exhibit large variations within a class. The capacity for the VAE is limited when it comes to ImageNet, and the architecture changes for disentangling remove important regularization. To achieve better performance, the model needs to be revised such that these problems are addressed.

6.2.3 Quick, Draw

The Quick, Draw dataset is in the same format as Omniglot, and thus the same settings as described in section 6.1 are used. The few-shot classification performance on Quick, Draw is difficult to report, since the validation accuracy of either baselines and the disentangled VAE did not converge. Even when converged during training, the validation error fluctuated rapidly. Overall, the performance of the baseline is better, because the same problem as with miniIma-geNet occurs: the network overfits because of the fully connected layer.

Interestingly, although somewhat vague, a disentanglement is actually learned. Analogous to previous visualization, the interpolation and interchanging of s and z are depicted (Figure 20). The pictures show that the stylistic attributes such as rotation are modeled by z, while the general shape of an object is modeled by s.

Quick, Draw resembles Omniglot in certain ways, but there are also important differences. In Omniglot, the variation is caused by the drawing style of the user, while the images of Quick, Draw users also vary because of the variation in interpretation. For instance, disentangling the variation in clocks, would require the model to encode “analogue” or “digital” in the style z. However, for another class such as airplane, this attribute might not have any meaning. Thus, to some degree, disentangled representations are learned on Quick, Draw, but they may be impeded because the stylistic variations are often limited to a single class.

(38)

Figure 20: Left: Interpolations between the latent variables of two images. The upper left and the bottom right are reconstructions from the test set. All other images are linear interpolations over content s and style z. The content variable s changes horizontally, the style z changes vertically. Right: Interchanging the content and style of examples. Reconstructions are generated by taking s from the column and z from the row.

6.3 Expectation of Support Set

The loss function that is optimized, defined in section 4, consists of multiple expectations. For the example that needs to be classified, and for the support set. Since optimization is performed in batches, the queries are approximated with a single sample. For the support set we test two options:

1. ESS∼q(SS|XS)[·] is approximated with a single sample for each support set item, for each

query.

2. ESS∼q(SS|XS)[·] is approximated by the overconfident estimate µs.

In the end, no significant difference in classification performance or learning was encountered. Also, the learned disentanglement visually did not look different. All presented results are reported on models trained with the second method, as it is the most straightforward to imple-ment.

6.4 Discussion

In this section, the results and conclusions of the experiments are summarized. Furthermore some intuitive insights are provided from empirical observations. The first part will discuss results based on architecture changes. The second part will discuss performance on different datasets.

(39)

6.4.1 Architecture

The architecture for the encoder is designed so that it matches Prototypical Networks [7] closely, since they achieved state of the art performance at that time on both Omniglot and miniIm-ageNet. However, to make the model suitable for disentangled representation learning, some aspects have to be modified.

Experiments with different network architectures on Omniglot showed that disentangled rep-resentation are learned when a fully connected layer is used in the encoder, between the last convolutional layer and the latent representation. A downside to fully connected layers, is that in some instances they easily overfit, which caused the model to perform worse on more complex datasets.

Prototypical Networks used max-pooling layers to reduce the resolution. Max-pooling layers are insensitive to small translation perturbations, which improves the regularization, but reduces precision. With generative modelling, ideally the model would be more sensitive to these per-turbations. However, strided convolutions impeded performance drastically. Instead, the fourth max pooling layer is removed, to retain some sensitivity. In contrast with prototypical networks, the max pooling operations are padded, because it allows them to retain more information.

6.4.2 Performance

The Disentangled VAE showed a large performance increase over existing methods on the Om-niglot dataset. However, it had more difficulty with miniImageNet and Quick, Draw.

The hypothesis that learning disentangled representations can be combined with few-shot classi-fication, is confirmed by the direct and indirect visualizations of s and z. Visualizations confirm that s models content and z models style on Omniglot. Moreover, few-shot classification perfor-mance is improved on Omniglot by utilizing disentangled representations, confirming the second hypothesis.

The miniImageNet and Quick, Draw datasets are inherently more difficult. The reconstructions from a VAE tend to be vague and without much detail. Generative modelling combined with deep learning is a relatively new area of research, and future developments could play a crucial role to let this method work on more complex datasets. In addition, disentangling representations might be less helpful when style attributes are not shared between classes. Being able to disentangle the property analogue or digital, does not improve airplane classification.

6.4.3 Model Framework

The mathematically derived model is optimizing a lower bound of the conditional log-likelihood. Furthermore, the variational distribution is imposed to be a multivariate normal distribution with diagonal variance. Nonetheless, the model has learned to disentangle representations, reconstruct images, and perform few-shot classification. Potentially, less restricting approximations can improve the quality of reconstructions and classification performance.

Few-shot Classification by Learning Disentangled Representations

MSc Artificial Intelligence

Master Thesis