A Unified Framework for Conditional Image Generation and Manipulation

(1)

Artificial Intelligence

Master Thesis

A Unified Framework for Conditional

Image Generation and Manipulation

by

David Stap

10608516

October 11, 2019

36EC Februari 2019 - October 2019

Supervisors:

Maartje ter Hoeve MSc

Maurits Bleeker MSc

Assessor:

Prof. dr. Cees G.M. Snoek

(2)

(3)

Abstract

In recent years, Generative Adversarial Networks have improved steadily towards generating in-creasingly impressive real-world images. However, most of this work concerns the unconditioned case, where there is no control over the data being generated. By conditioning the model on additional information, such as binary attributes or a textual description, it is possible to steer the image generation process. Despite this conditioning, there still exists a large set of images that adheres to a particular set of attributes or description. This makes it unlikely that when providing the model with conditioning, the output is exactly as envisioned by a user. To close this gap, we introduce a unified framework for conditional image generation and manipulation. We propose a novel conditional Generative Adversarial Network, building on StyleGAN. The model is conditioned either on a large number of attributes or a textual description. Results from the attribute based model on CelebA-HQ are similar to state of the art unsupervised re-sults, and results for the textual model on CUB and COCO are comparable to the state of the art for text-to-image synthesis. We introduce a dataset consisting of faces and corresponding textual descriptions. Finally, we introduce and compare two approaches for semantic image manipulation. The approaches work by finding semantic relations in latent space. We show that the weights of our conditional model can be used for facial manipulation for a wide range of attributes, effectively unifying conditional image generation and semantic manipulation in a single model.

(4)

Acknowledgments

Writing this thesis has (mostly) been a joy, and I am thankful for the people who were involved in the process. First, I would like to thank Maartje for the freedom she gave me to pursue my interests, and encouraging me to apply for academic positions. Second, Maurits, who got involved at a later stadium and provided valuable feedback and insights. I also want to thank the TROI team at the Amsterdam police force for a lot of random talks, much needed lunch breaks, and an overall great workplace.

I am also grateful for the support of my family and friends. Thanks for always being there, and providing much needed distractions.

(5)

Acronyms

AdaIN Adaptive Instance Normalization. 16, 19

AttnGAN Attentional Generative Adversarial Network. 23–26, 34, 38, 40, 41, 46, 57 CA Conditioning Augmentation. 23, 36, 39–41

CelebA Large-scale CelebFaces Attributes dataset. 26, 31, 36, 58, 59

CelebA-HQ Large-scale CelebFaces Attributes High Quality dataset. 8, 12, 30–32, 36–38, 42, 59

CelebTD-HQ Large-scale CelebFaces Textual Descriptions High Quality dataset. 31–33, 36, 39–42, 58, 59

cGAN conditional Generative Adversarial Network. 13, 15, 18, 27, 28 CMPC Cross-Modal Projection Classification. 21, 22, 40

CMPM Cross-Modal Projection Matching. 21, 22, 40 CNN Convolutional Neural Network. 13, 16, 20

COCO Microsoft Common Objects in Context dataset. 32, 39–41, 44 CUB Caltech-UCSD Birds-200-2011 dataset. 31, 32, 39–41, 43 DAMSM Deep Attentional Multimodal Similarity Model. 24, 25

DM-GAN Dynamic Memory Generative Adversarial Network. 25, 40, 41 FFHQ Flickr Faces High Quality. 32, 36, 41

FID Fr´echet Inception Distance. 15, 29, 38, 40, 59

GAN Generative Adversarial Network. 13–16, 18, 20, 22, 23, 27, 28, 46, 48, 51, 53 IS Inception Score. 14, 15, 29, 40, 59

KL divergence Kullback-Leibler divergence. 14, 21, 23, 38, 39 leakyReLU leaky Rectified Linear Unit. 18

MLE Maximum Likelihood Estimation. 20

PCFG Probabilistic Context-Free Grammar. 32, 33, 59 ReLU Rectified Linear Unit. 18

RNN Recurrent Neural Network. 13, 20

SD-GAN Semantics Disentangling Generative Adversarial Network. 25, 40 VAE Variational Autoencoder. 20, 27, 46

(8)

List of Tables

6.1 Fr´echet Inception Distance for attrStyleGAN model on CelebA-HQ. . . 38 6.2 Attribute classification scores for images generated with attrStyleGAN,

textStyle-GAN and for the original CelebA-HQ dataset. . . 38 6.3 Inception Score for various text-to-image models on CUB and COCO datasets. . 40 6.4 Fr´echet Inception Distance for our text-to-image models on CelebTD-HQ. . . 40 6.5 R-precision scores for various text-to-image models on CUB and COCO. . . 41 6.6 Perceptual path length scores for StyleGAN trained on FFHQ and textStyleGAN

trained on CelebTD-HQ. . . 41 7.1 Emotion manipulation scores for our vector arithmetic and sample classification

methods. . . 54 7.2 Percentage of correct attribute changes after manipulation for our sample

(9)

List of Figures

1.1 Mugshots and corresponding forensic and composite sketches. . . 11

2.1 Generative Adversarial Network architecture. . . 14

2.2 conditional Generative Adversarial Network architecture. . . 15

3.1 StyleGAN Generator architecture. . . 19

3.2 Conditioning Augmentation module. . . 23

3.3 Attentional Generative Adversarial Network Generator architecture. . . 25

5.1 Probabilistic Context-Free Grammar used to generate captions for our CelebTD-HQ dataset. . . 33

6.1 attrStyleGAN Generator architecture. . . 35

6.2 Qualitative results for our attrStyleGAN model trained on Large-scale CelebFaces Attributes High Quality dataset (CelebA-HQ). . . 37

6.3 Qualitative results for our textStyleGAN model trained on CelebTD-HQ. . . 42

6.4 Qualitative results for our textStyleGAN model trained on CUB. . . 43

6.5 Qualitative results for our textStyleGAN model trained on COCO. . . 44

7.1 Original images and their recovered latent representations. . . 49

7.2 Interpolations in latent space. . . 49

7.3 Occluded images and their recovered latent representations. . . 50

7.4 Rotated images and their recovered latent representations. . . 50

7.5 Example of disentanglement. . . 52

7.6 Qualitative results for our sample classification method, smile direction. . . 55

7.7 Qualitative results for our sample classification method, gender direction. . . 55

7.8 Qualitative results for our sample classification method, age direction. . . 56 7.9 Qualitative results for our sample classification method, circular artifact direction. 56

(10)

A note on notation

Conventions

The following conventions are used throughout this thesis: we use standard weight letters (x) for scalars, bold lower case letters (x) to denote vectors and bold upper case letters (X) to denote matrices. Subscripts with bold letters (xi) will be used to refer to entire rows and subscripts with

standard weight letters (xi) for specific elements. Subscripts are also used to denote different

variables (X1, X2, hfwd, hbwd). To avoid ambiguity, we sometimes use bracketed superscripts

for the former purpose instead (X(i)1 , X (i) 1 , h (i) fwd, h (i) bwd).

Architecture figures

When drawing architecture figures, we use the following color scheme: green (green) blocks are trainable, purple (purple) blocks resemble a sampling operation and grey (grey) blocks are fixed.

(11)

Chapter 1

Introduction

“I do not fear computers. I fear the lack of them.”

- Isaac Asimov Facial sketches are commonly used in law enforcement to assist in identifying suspects involved in a crime when no facial image of the person is available. Sketches are based on a description provided by the witness. Forensic sketches (drawn by forensic artists) have been used in criminal investigations dating as far back as the 19th century [McQuiston-Surrett et al.,2006]. Typically, forensic sketch artists require a few years of training to become proficient in drawing sketches [Klum et al.,2013]. In recent years, composite sketches (created with computer software) have become a popular and more affordable alternative. This method only requires several hours of training [Klum et al.,2013], and relies on choosing from a set of facial components (e.g. type of nose, mouth, hair, etc.) based on the witness description. Criminal psychology research by Frowd et al. [2005] shows that recall of facial features is significantly reduced after two days of delay when compared to a relatively short time difference (3-5 hours) [Koehn and Fisher,1997; Bruce et al.,2002]. Both sketching methods are time consuming and produce unrealistic results (relative to photo-realism, see Figure1.1for examples) which impedes efficient face recognition, i.e. the matching of composites with existing mugshot databases maintained by law enforcement agencies [Wang et al., 2018]. Thus, there is need for a method that produces photo-realistic facial images given a fine-grained description in a timely manner.

The advent of new deep learning techniques for generative modeling has led to a resurgence of interest in the topic within the artificial intelligence community. In generative modeling, we are given a dataset consisting of samples x drawn from some probability distribution pdata(x).

These samples x are then used to derive the unknown real data distribution pdata(x). The

gen-erative model encodes a distribution over new samples, pgenerative(x; θ), with the aim of finding

a generative distribution such that pgenerative(x)≈ pdata(x) according to some metric. Recently,

generative models have been applied successfully to a wide variety of tasks, most notably the generation of photo-realistic images [Karras et al.,2018b], music generation [Dong et al.,2018], natural language generation [Fedus et al., 2018] and medical data generation [Esteban et al., 2017]. While being best known for applications like these, generative models hold the poten-tial for unsupervised representation learning: to learn the natural features of a dataset without supervision, i.e. without labels [Dumoulin et al., 2016; Donahue et al.,2016; Donahue and Si-monyan,2019]. To see this, observe that the neural networks that are used as generative models have a number of parameters significantly smaller than the amount of data we train them on, so the models are forced to discover and efficiently internalize the essence of the data in order to generate it. The intuition behind the approach of generating data similar to some (large) dataset follows a quote from physicist Richard Feynman: “What I cannot create, I do not understand.” Most generative models provide no way to generate data based on user input. While some work on attribute based generation [Yan et al.,2016;Di and Patel,2017] has been done, these mod-els are conditioned on a small number of attributes and thus lack fine-grained control over the outputs. Arguably, natural language offers the most general and flexible interface for describing

(12)

(a)

(b)

(c)

Figure 1.1: Ten examples of (a) mugshots, (b) corresponding forensic sketches and (c) composite sketches. The composite sketches are drawn using FACES 4.0 software1_{. Image courtesy}_Klum

et al.[2013].

objects in any space of visual categories. For this reason, other researchers have used textual de-scriptions to generate images that conform to the description [Reed et al.,2016b,a;Zhang et al., 2017; Xu et al., 2018], but since no datasets of faces with corresponding textual descriptions exist, this method can not be used to generate facial images. (We further discuss conditional generative models in Chapters2 and3.)

While a textual description or list of attributes can usually describe most of the relevant details of a face (or other type of image), it is likely that a large number of faces adhere to a given description, i.e. the resulting image is unlikely to correspond exactly to what the user has in mind. The problem, then, is twofold: creating a model that 1) can generate images given a fine-grained description and 2) can modify these images given user feedback. This is analogous to the back and forth conversation between sketch artists and witnesses, where the witness pro-vides feedback and the sketch is iteratively improved.

Motivated by the above, in this work we aim to develop a generative model that allows for generating realistic images—in particular facial images—according to a textual description or a set of attributes. Furthermore, we investigate how we can manipulate conditionally generated facial images in a meaningful way. For this manipulation, it turns out useful to find a latent representation of a real image (effectively loading the image into a pre-trained model), which is the reason we propose a method to do this.

1.1 Research questions

We now summarize our research questions, and specify where we address them. Our main ques-tion is:

Can we create a single model for conditional image generation and semantic manipulation? The research is split into two parts; in the first part we deal with subquestions related to conditional image generation (Chapter6), and in the second part we investigate the feasibility of image manipulation (Chapter7).

(13)

Conditional Image Generation

How can we condition a generative model on a large number of facial attributes? This question is answered in Section6.1.1.

How does the image quality of these conditioned models compare to an unsupervised model? This question is answered in Section6.1.1.

Can we condition a generative model on a textual description? This question is answered in Section6.2.1.

How important is the quality of text representation when conditioning on a textual description? This question is answered in Section6.2.

What type of conditioning is best suited for facial images? This question is answered in Section6.2.1.

Semantic Image Manipulation

Can we find a latent representation for a given facial image that is sufficiently similar to the original, making use of a pre-trained conditional model?

This question is answered in Section7.1.2.

What type of attributes can we modify in facial images, making use of a pre-trained conditional model?

This question is answered in Section7.2.3.

1.2 Contributions

With this work we contribute the following:

I) A novel generative model that is conditioned on a large number of attributes.

II) A novel text-to-image synthesis model that meets or beats the current state of the art. III) Two novel methods for modifying (conditionally) generated facial images semantically,

such as adding a smile or changing gender.

IV) A novel method for creating a latent representation of real images.

V) An extension to the CelebA-HQ dataset in the form of descriptions in language form based on facial attributes.

(14)

Chapter 2

Background

“I would rather have questions that can’t be answered than answers that can’t be questioned.”

- Richard Feynman In this chapter we delve into the technical background required for this thesis. Generative Ad-versarial Networks (GANs) are introduced in Section2.1, and their conditional variant in Section 2.2. We discuss common evaluation metrics for (conditional) GANs in Section 2.3. We high-light important style transfer work in Section2.4and explain a technique for decreasing neural network size in Section 2.5. Familiarity with machine learning and neural network architec-tures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) is assumed.

2.1 Generative Adversarial Networks

GANs, introduced byGoodfellow et al.[2014], are (deep) neural network architectures comprised of two nets: the Generator (G) and the discriminator (D). The Generator models the data distribution pg over data Xreal using a prior input noise distribution pz(z) and a mapping to

data space G(z; θg). The Discriminator estimates the probability D(X) that a sample came

from the original distribution rather than from pg. See Figure2.1for an illustration of the GAN

architecture. We train the Discriminator to maximize the probability of correctly assigning labels to Xreal∼ pdataand Xfake∼ pg. The training objective for the Generator is to maximize

the probability of fooling the Discriminator, or equivalently to minimize log (1_{− D(G(z))). This} can be formalized in a game-theoretic minimax optimization setting, leading to the following value function V (G, D):

min

G maxD V (D, G) =EXreal∼pdata(X)[log D(X)] +Ez∼pz(z)[log(1− D(G(z)))] (2.1)

Since both G and D are implemented as neural networks and are thus differentiable, optimiza-tion can be performed using backpropagaoptimiza-tion.

2.2 Conditional Generative Adversarial Networks

In an unconditioned generative model such as a GAN, there is no control over the data being generated. However, by conditioning the model on additional information it is possible to steer the data generation process. In the conditional Generative Adversarial Network (cGAN) setup, first introduced byMirza and Osindero[2014], both the Generator and the Discriminator receive additional information in the form of a conditioning vector c. The conditioning can be any kind of auxiliary information, such as class labels, binary attributes or data from other modalities such as language. The addition of c leads to a slightly different architecture (Figure 2.2) and value function:

(15)

Generator

X

fake

<latexit sha1_base64="X0eVrR02D1Yq7XpN3f/jzkwcr+U=">AAAB/3icbVC7SgNBFJ31GeMrKtjYDAbBKuxGQcugjWUE84BsCLOTu8mQ2Qczd8WwbuGv2FgoYutv2Pk3ziYpNPHAwOGce7lnjhdLodG2v62l5ZXVtfXCRnFza3tnt7S339RRojg0eCQj1faYBilCaKBACe1YAQs8CS1vdJ37rXtQWkThHY5j6AZsEApfcIZG6pUO3YDh0PPTdtZzER4w9dkIsl6pbFfsCegicWakTGao90pfbj/iSQAhcsm07jh2jN2UKRRcQlZ0Ew0x4yM2gI6hIQtAd9NJ/oyeGKVP/UiZFyKdqL83UhZoPQ48M5mn1fNeLv7ndRL0L7upCOMEIeTTQ34iKUY0L4P2hQKOcmwI40qYrJQPmWIcTWVFU4Iz/+VF0qxWnLNK9fa8XLua1VEgR+SYnBKHXJAauSF10iCcPJJn8krerCfrxXq3PqajS9Zs54D8gfX5A/qBlrg=</latexit>

real, fake

<latexit sha1_base64="xCpAGHcnBAlnRyyC9KuSrAn1N28=">AAAB+nicbVDLSgNBEJyNrxhfGz16GQyCBwm7UdBj0IvHCOYBSQizk95kyOyDmV41rPkULx4U8eqXePNvnCR70MSCgaKqi+4pL5ZCo+N8W7mV1bX1jfxmYWt7Z3fPLu43dJQoDnUeyUi1PKZBihDqKFBCK1bAAk9C0xtdT/3mPSgtovAOxzF0AzYIhS84QyP17GIH4RFTE5Gn1GcjmPTsklN2ZqDLxM1IiWSo9eyvTj/iSQAhcsm0brtOjN2UKRRcwqTQSTTEjI/YANqGhiwA3U1np0/osVH61I+UeSHSmfo7kbJA63HgmcmA4VAvelPxP6+doH/ZTUUYJwghny/yE0kxotMeaF8o4CjHhjCuhLmV8iFTjKNpq2BKcBe/vEwalbJ7Vq7cnpeqV1kdeXJIjsgJcckFqZIbUiN1wskDeSav5M16sl6sd+tjPpqzsswB+QPr8wc6i5P5</latexit>

X

real

⇠ p

data

<latexit sha1_base64="27vNb/FnvaPEvdYsz88CVBNS8Yg=">AAACEXicbVDLSsNAFJ34rPUVdelmsAhdlaQKuiy6cVnBPqAJYTKdtEMnD2ZuxBLyC278FTcuFHHrzp1/46TNQlsPDBzOuYe59/iJ4Aos69tYWV1b39isbFW3d3b39s2Dw66KU0lZh8Yiln2fKCZ4xDrAQbB+IhkJfcF6/uS68Hv3TCoeR3cwTZgbklHEA04JaMkz605IYOwHWT/3HGAPkOm0yB3FQ5yUypAAyT2zZjWsGfAysUtSQyXanvnlDGOahiwCKohSA9tKwM2IBE4Fy6tOqlhC6ISM2EDTiIRMudnsohyfamWIg1jqFwGeqb8TGQmVmoa+niz2V4teIf7nDVIILt2MR0kKLKLzj4JUYIhxUQ8ecskoiKkmhEqud8V0TCShoEus6hLsxZOXSbfZsM8azdvzWuuqrKOCjtEJqiMbXaAWukFt1EEUPaJn9IrejCfjxXg3PuajK0aZOUJ/YHz+AJbmnsA=</latexit>

z

_{⇠ p}

z

<latexit sha1_base64="ooN0YuHBLQO6Gfi/rX4RpkwLI3U=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiRV0GXRjcsK9gFNCJPppB06MwkzE6GGLtz4K25cKOLWj3Dn3zhpA2rrgQuHc+7l3nvChFGlHefLKq2srq1vlDcrW9s7u3v2/kFHxanEpI1jFsteiBRhVJC2ppqRXiIJ4iEj3XB8lfvdOyIVjcWtniTE52goaEQx0kYK7KrHkR6FUXY/9RTlMAl+hMCuOXVnBrhM3ILUQIFWYH96gxinnAiNGVKq7zqJ9jMkNcWMTCteqkiC8BgNSd9QgThRfjZ7YgqPjTKAUSxNCQ1n6u+JDHGlJjw0nfmFatHLxf+8fqqjCz+jIkk1EXi+KEoZ1DHME4EDKgnWbGIIwpKaWyEeIYmwNrlVTAju4svLpNOou6f1xs1ZrXlZxFEGVXAEToALzkETXIMWaAMMHsATeAGv1qP1bL1Z7/PWklXMHII/sD6+AcaBmNE=</latexit>

Discriminator

Figure 2.1: GAN architecture. The dashed line indicates that X can be either Xreal or Xfake.

(The Generator and Discriminator architectures are abstracted away.)

min

G maxD V (D, G) =EXreal∼pdata(X)[log D(X|c)] + Ez∼pz(z)[log(1− D(G(z|c)))] (2.2)

Note that in theory, the Generator could discard this conditional information. However, if the Discriminator can pick up any relationship between the real images and their conditioning, which it likely can, it instantly provides a way to distinguish real images from their fake counterparts (since the fakes will then be uncorrelated with their conditioning).

2.3 Evaluating (conditional) Generative Adversarial

Net-works

In this section we describe several common evaluation metrics for (conditional) GANs, that are used in this work.

2.3.1 Inception Score

The Inception Score (IS) is a common metric for automatically evaluating the quality of genera-tive models [Salimans et al.,2016]. As explained byBarratt and Sharma[2018], the IS calculates a statistic of the Inception v3 network [Szegedy et al.,2016] outputs, pre-trained on ImageNet [Deng et al.,2009], when applied to a generated image Xfake∼ G(z) where z ∼ pz(z):

IS(G) = exp EXfake∼pgDKL(p(y| Xfake)|| p(y))

(2.3) where p(y_{| X}fake) is the conditional class distribution, and p(y) =R_X_fakep(y| Xfake)pg(Xfake) is

the marginal class distribution.

The aim ofSalimans et al.[2016] was to codify two desiderata into a metric:

1. The generated images should contain meaningful objects (i.e. the images are sharp rather than blurry), so that the conditional distribution p(y| Xfake) is low entropy.

2. The generator should output diverse images, so that the marginal class distribution p(y) is high entropy.

Satisfying these traits leads to a large Kullback-Leibler divergence (KL divergence) between distributions p(y) and p(y| Xfake), resulting in a high IS.

(16)

Generator

X

fake <latexit sha1_base64="X0eVrR02D1Yq7XpN3f/jzkwcr+U=">AAAB/3icbVC7SgNBFJ31GeMrKtjYDAbBKuxGQcugjWUE84BsCLOTu8mQ2Qczd8WwbuGv2FgoYutv2Pk3ziYpNPHAwOGce7lnjhdLodG2v62l5ZXVtfXCRnFza3tnt7S339RRojg0eCQj1faYBilCaKBACe1YAQs8CS1vdJ37rXtQWkThHY5j6AZsEApfcIZG6pUO3YDh0PPTdtZzER4w9dkIsl6pbFfsCegicWakTGao90pfbj/iSQAhcsm07jh2jN2UKRRcQlZ0Ew0x4yM2gI6hIQtAd9NJ/oyeGKVP/UiZFyKdqL83UhZoPQ48M5mn1fNeLv7ndRL0L7upCOMEIeTTQ34iKUY0L4P2hQKOcmwI40qYrJQPmWIcTWVFU4Iz/+VF0qxWnLNK9fa8XLua1VEgR+SYnBKHXJAauSF10iCcPJJn8krerCfrxXq3PqajS9Zs54D8gfX5A/qBlrg=</latexit>

real, fake

<latexit sha1_base64="xCpAGHcnBAlnRyyC9KuSrAn1N28=">AAAB+nicbVDLSgNBEJyNrxhfGz16GQyCBwm7UdBj0IvHCOYBSQizk95kyOyDmV41rPkULx4U8eqXePNvnCR70MSCgaKqi+4pL5ZCo+N8W7mV1bX1jfxmYWt7Z3fPLu43dJQoDnUeyUi1PKZBihDqKFBCK1bAAk9C0xtdT/3mPSgtovAOxzF0AzYIhS84QyP17GIH4RFTE5Gn1GcjmPTsklN2ZqDLxM1IiWSo9eyvTj/iSQAhcsm0brtOjN2UKRRcwqTQSTTEjI/YANqGhiwA3U1np0/osVH61I+UeSHSmfo7kbJA63HgmcmA4VAvelPxP6+doH/ZTUUYJwghny/yE0kxotMeaF8o4CjHhjCuhLmV8iFTjKNpq2BKcBe/vEwalbJ7Vq7cnpeqV1kdeXJIjsgJcckFqZIbUiN1wskDeSav5M16sl6sd+tjPpqzsswB+QPr8wc6i5P5</latexit>

z

<latexit sha1_base64="ooN0YuHBLQO6Gfi/rX4RpkwLI3U=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiRV0GXRjcsK9gFNCJPppB06MwkzE6GGLtz4K25cKOLWj3Dn3zhpA2rrgQuHc+7l3nvChFGlHefLKq2srq1vlDcrW9s7u3v2/kFHxanEpI1jFsteiBRhVJC2ppqRXiIJ4iEj3XB8lfvdOyIVjcWtniTE52goaEQx0kYK7KrHkR6FUXY/9RTlMAl+hMCuOXVnBrhM3ILUQIFWYH96gxinnAiNGVKq7zqJ9jMkNcWMTCteqkiC8BgNSd9QgThRfjZ7YgqPjTKAUSxNCQ1n6u+JDHGlJjw0nfmFatHLxf+8fqqjCz+jIkk1EXi+KEoZ1DHME4EDKgnWbGIIwpKaWyEeIYmwNrlVTAju4svLpNOou6f1xs1ZrXlZxFEGVXAEToALzkETXIMWaAMMHsATeAGv1qP1bL1Z7/PWklXMHII/sD6+AcaBmNE=</latexit>

_{⇠ p}

z Discriminator

X

real

, c

⇠ p

data <latexit sha1_base64="54Jf7170gqFJ3fLMUeUUH8eWXTU=">AAACHnicbVDLSsNAFJ34rPUVdelmsAgupCRV0WXRjcsK9gFNKJPppB06eTBzI5aQL3Hjr7hxoYjgSv/GSRtBWw8MHM65h7n3eLHgCizry1hYXFpeWS2tldc3Nre2zZ3dlooSSVmTRiKSHY8oJnjImsBBsE4sGQk8wdre6Cr323dMKh6FtzCOmRuQQch9TgloqWeeOQGBoeennaznALuHVKdFdox/dJphR/EAx4XdJ0CynlmxqtYEeJ7YBamgAo2e+eH0I5oELAQqiFJd24rBTYkETgXLyk6iWEzoiAxYV9OQBEy56eS8DB9qpY/9SOoXAp6ovxMpCZQaB56ezJdWs14u/ud1E/Av3JSHcQIspNOP/ERgiHDeFe5zySiIsSaESq53xXRIJKGgGy3rEuzZk+dJq1a1T6q1m9NK/bKoo4T20QE6QjY6R3V0jRqoiSh6QE/oBb0aj8az8Wa8T0cXjCKzh/7A+PwGUyyj1w==</latexit>

c

<latexit sha1_base64="Q/Uz70/Xu/+QX5/9Mi6uPupBr4E=">AAACBnicbVDLSsNAFJ34rPUVdSnCYBFclaQKuiy6cVnBPqAJYTKZtEMnD2ZuxBK6cuOvuHGhiFu/wZ1/46TNQlsPXDiccy/33uOngiuwrG9jaXlldW29slHd3Nre2TX39jsqySRlbZqIRPZ8opjgMWsDB8F6qWQk8gXr+qPrwu/eM6l4Et/BOGVuRAYxDzkloCXPPHIiAkM/zOkEO4pHOPUcYA+QBwTIxDNrVt2aAi8SuyQ1VKLlmV9OkNAsYjFQQZTq21YKbk4kcCrYpOpkiqWEjsiA9TWNScSUm0/fmOATrQQ4TKSuGPBU/T2Rk0ipceTrzuJoNe8V4n9eP4Pw0s15nGbAYjpbFGYCQ4KLTHDAJaMgxpoQKrm+FdMhkYSCTq6qQ7DnX14knUbdPqs3bs9rzasyjgo6RMfoFNnoAjXRDWqhNqLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucP3XqZYQ==</latexit>

_{⇠ p}

data

Figure 2.2: cGAN architecture. In this case, both the Generator and the Discriminator receive conditioning vector c. The dashed line indicates that X can be either Xreal or Xfake. (The

Generator and Discriminator architectures are abstracted away.)

2.3.2 Fr´

echet Inception Distance

The Fr´echet Inception Distance (FID) also uses the Inception Network to extract features [Heusel et al., 2017], but instead of evaluating generated samples in a vacuum FID actually compares mean µ and covariance Σ of generated samples Xfake∼ G(z) where z ∼ pz(z) with real samples

Xreal∼ pdata(X):

FID(G) =_||µreal− µfake|| 2

+ Tr Σreal+ Σfake− 2(ΣrealΣfake)

1

2 (2.4)

where Xreal∼ N (µreal, Σreal) and Xfake∼ N (µfake, Σfake) are the 2048-dimensional activations

of the Inception-v3 [Szegedy et al.,2016] pool3 layer for real and generated samples respectively.

Both IS and FID have known issues, and a common criticism is that these scores measure sample quality and fail to capture sample diversity [Barratt and Sharma,2018;Gulrajani et al., 2018]. Evaluating GANs is an open research problem [Odena, 2019]: there are many proposals but little consensus. However, since most work reports IS or FID, we do the same so that we can quantitatively compare to previous work.

2.3.3 Perceptual path length

Karras et al.[2018b] devised perceptual path length to quantify the disentanglement of a latent space, by measuring how drastic the image changes as interpolation is performed. Interpolation of latent space vectors may yield surprisingly non-linear changes in the image [Laine, 2018], which is an indicator for an entangled latent space; the factors of variation are not properly separated. Intuitively, a more disentangled latent space should result in a perceptually smoother transition. To determine the perceptual similarity between two images, Karras et al. [2018b] use a perceptually-based pairwise image distance [Zhang et al., 2018b] that is calculated as a difference between two VGG-16 [Simonyan and Zisserman, 2014] embeddings.

The average perceptual path length in latent spaceW can be formalized as lW =E h1 2d G(lerp(w1, w2; λ)), G(lerp(w1, w2; λ + )) i , (2.5) where lerp(w1, w2; λ) = λw1+ (1− λ)w2 with λ∈ (0, 1) (2.6)

(17)

is the linear interpolation operation between latent vectors w1 and w2, w1, w ∼ f(P (z)), λ ∼

U(0, 1), G is the Generator and d(·, ·) evaluates the perceptual distance between two images. In a natural definition would be infinitely small, but in practice it is approximated by setting = 10−4. (Because similarity metric d is quadratic in nature [Zhang et al., 2018b], division is done using 2 _{instead of .)}

2.4 Neural Style transfer

Neural style transfer is a research direction in computer vision. The goal in neural style transfer is manipulating an image in such a way that the visual style of another image is adopted, holding constant other characteristics of the image. In order to do this, image style needs to be separated from image content [Jing et al.,2019].

Seminal work by Gatys et al. [2015b] opened up the neural style transfer field. The authors studied how to use a CNN to reproduce painting styles on natural images. The content of a photo is modeled as features from a pre-trained classification network, and the style of a paint-ing is further modeled as the summary feature statistics. They demonstrated for the first time that a CNN is capable of extracting content information from a natural image, and style in-formation from a painting. The key idea is to iteratively optimise an image to match desired feature distributions, which involves both the content information and artwork style information. Below we highlight a more recent work on neural style transfer that forms part of the basis for StyleGAN [Karras et al., 2018b] (described in the next chapter, Section3.1), a GAN on which we build our conditional models.

Arbitrary style transfer in real-time with adaptive instance normalization

This work by Huang and Belongie [2017] contains provides a theoretical anaylsis of the effect of batch norm, instance norm and conditional instance norm on style transfer modules. The authors argue that styles of an image can be normalized by normalizing the channel-wise mean and variance of the image features. By taking this new insight into account, the authors propose Adaptive Instance Normalization (AdaIN), a technique that aligns channel statistics of a content and style image, producing impressive style transfer results. The idea is to shift and scale the content image features to align with the style image in the training process, i.e.

AdaIN(v1, v2) = σ(v2)

v1− µ(v1)

σ(v1)

+ µ(v2) (2.7)

where v1 and v2 are the feature representations of the input content image and style image

respectively, and σ and µ are the mean and standard deviation. Crucially, AdaIN runs in real time and unlike some previous style transfer research, works with an unlimited number of styles.

2.5 Mixed precision training

Increasing the size of a neural network typically improves performance but also increases the memory and compute requirements for training and inference. Micikevicius et al.[2017] intro-duced mixed precision training, a methodology for training neural networks using half-precision (FP16) floating point numbers—instead of single-precision (FP32) floating point numbers used in modern deep learning libraries such as PyTorch [Paszke et al.,2017] and TensorFlow [Abadi et al., 2015]—without the need for modifying hyperparameters and without incurring perfor-mance loss when compared to FP32 baselines. As a result, memory requirements are nearly halved and (on recent GPUs) arithmetic is faster. Since FP16 has a narrower range than FP32, Micikevicius et al.[2017] propose three techniques for preventing the loss of critical information.

Firstly, in order to match the performance of the FP32 networks, an FP32 master copy of weights is saved and updated with the weight gradient during the optimization step. In each iteration

(18)

an FP16 copy of the master weights is used in the forward and backward pass, halving the stor-age and bandwidth needed by FP32 training. While maintaining an additional copy of weights increases the memory requirements for the weights by 50% compared with FP16 training, im-pact on total memory usage is much smaller. To see this, consider that the training memory consumption is dominated by activations, due to large batch sizes and layer activations being saved for use in the back-propagation step. These activations are stored in FP16 format, which roughly halves the total memory consumption.

Secondly, loss-scaling is performed to preserve gradient values with small magnitudes. In most architectures, much of the FP16 representable gradient value range is unused, while many values are below the minimum representable FP16 range—any value with magnitude smaller than 2−24

becomes zero in FP16. Scaling up the gradients will shift them to occupy more of the repre-sentable range and preserve values that are otherwise zero. The authors find that scaling by a factor of 8 (i.e., increasing the exponent by 3) is sufficient to match FP32 performance.

Thirdly, they use FP16 arithmetic that accumulates into FP32 outputs, which are converted to FP16 before storing to memory. Without this accumulation in FP32, some FP16 models did not match the accuracy of the baseline models. NVIDIA Volta GPUs introduce Tensor Cores that multiply FP16 matrices resulting in either FP16 or FP32 outputs [Jia et al., 2018].

(19)

Chapter 3

Related work

“Generative Adversarial Networks are the most interesting idea in the last 10 years in machine learning.”

- Yann LeCun This chapter provides an overview on the work that has been done on GANs and cGANs so far that is relevant for this study in Sections 3.1 and 3.3. Conditioning on text is discussed in Subsection 3.3.1. We highlight important work about the relationship between images and natural language in Section3.2and discuss relevant work on image manipulation in Section3.4.

3.1 Unconditional image synthesis with Generative

Ad-versarial Networks

In recent years, GANs have improved steadily towards generating increasingly impressive real-world images [Goodfellow et al., 2014; Radford et al., 2015; Brock et al., 2018; Karras et al., 2018a,b]. The original GAN [Goodfellow et al., 2014] used fully connected layers as building blocks, but work by Radford et al.[2015] showed that better results can be obtained when us-ing transposed convolution, batch-normalization [Ioffe and Szegedy, 2015] and Rectified Linear Unit (ReLU) activations [Nair and Hinton, 2010] for the Generator, and convolution, batch-normalization and leaky Rectified Linear Unit (leakyReLU) activations [Maas et al.] for the Discriminator. Since then, transposed convolution and convolution have become the core com-ponents in many GAN models.

Work byBrock et al.[2018] showed for the first time that it is possible to successfully generate high-resolution, diverse samples from complex datasets such as ImageNet [Deng et al., 2009]. Their results were obtained by severely upscaling the compute infrastructure (512 Google TPU v3 Pods1_{are used). The authors conclude that prevalent GAN training techniques are sufficient}

to enable scaling to large models and distributed training using a large batch size. Finally, the truncation trick is introduced. Instead of sampling noise vector z from a normal distribution, instead z is sampled from a truncated normal distribution, where values with a magnitude above a certain threshold are resampled. As the threshold is reduced, and elements of z are truncated towards zero (the mode of the latent distribution), individual samples approach the mode of the Generators output distribution. This leads to improvement in individual sample quality at the cost of reduction in overall sample variety.

Karras et al. [2018a] proposed a new training methodoly for GANs, where both the Generator and Discriminator are grown progressively. Training is started from a low resolution, and new layers are added as training progesses. This incremental nature allows the Generator to first discover large-scale structure before shifting attention to increasingly finer details.

(20)

FC FC FC FC FC FC FC FC Normalize Conv 3x3 Const 4x4x512 AdaIN A A B B AdaIN Conv 3x3 Conv 3x3 AdaIN A A B B AdaIN Upsample

…

StyleGAN Generator <latexit sha1_base64="+clhhqQXdegAkMUWYFmB3JRIjXE=">AAACAnicbVA9SwNBEN2LXzF+Ra3EZjEIVuEuClpGLWIlEU0UkhD2NpNkyd7esTsnhiPY+FdsLBSx9VfY+W/cfBSa+GDg8d4MM/P8SAqDrvvtpObmFxaX0suZldW19Y3s5lbVhLHmUOGhDPWdzwxIoaCCAiXcRRpY4Eu49XvnQ//2HrQRobrBfgSNgHWUaAvO0ErN7E4d4QGTa+xLKJ1e0hIo0AxDPWhmc27eHYHOEm9CcmSCcjP7VW+FPA5AIZfMmJrnRthImEbBJQwy9dhAxHiPdaBmqWIBmEYyemFA963Sou1Q21JIR+rviYQFxvQD33YGDLtm2huK/3m1GNsnjUSoKEZQfLyoHUuKIR3mQVtCA0fZt4RxLeytlHeZZhxtahkbgjf98iypFvLeYb5wdZQrnk3iSJNdskcOiEeOSZFckDKpEE4eyTN5JW/Ok/PivDsf49aUM5nZJn/gfP4AajyXcA==</latexit> 4x4 8x8

X<latexit sha1_base64="Y4i+C7WG6URrMW2COLDjP6vDnmw=">AAACOHicbVDLSsNAFJ34tr6qLt0Ei+CqJFXQpejGnQ/sA5paJtMbHTqZhJkbsQz5LDd+hjtx40IRt36B0zbg88IMh3Pu5d5zwlRwjZ736ExMTk3PzM7NlxYWl5ZXyqtrDZ1kikGdJSJRrZBqEFxCHTkKaKUKaBwKaIb9o6HevAGleSIvcJBCJ6ZXkkecUbRUt3wSxBSvw8i08m6AcIsmon3ITcBlPpZCc55fmkAzxVMsfhwI8L3aboA8Bv2FdvJuueJVvVG5f4FfgAop6rRbfgh6CctikMgE1brteyl2DFXImYC8FGQaUsr69AraFkpq13TMyHjublmm50aJsk+iO2K/Txgaaz2IQ9s59KJ/a0PyP62dYbTfMVymGYJk40VRJlxM3GGKbo8rYCgGFlAbib3VZddUUYY265INwf9t+S9o1Kr+TrV2tls5OCzimCMbZJNsE5/skQNyTE5JnTByR57IC3l17p1n5815H7dOOMXMOvlRzscnRiiu+w==</latexit> fake2R1024⇥1024⇥3

z⇠N (0, I)

<latexit sha1_base64="dEtPzS7cAg3CLLmj+2hyZBP7Utg=">AAACHHicbVDLSsNAFJ3UV62vqEs3g0WoICVpBV0W3ehGKtgHNKFMppN26OTBzESoIR/ixl9x40IRNy4E/8ZJmoK2Hhg4c8693HuPEzIqpGF8a4Wl5ZXVteJ6aWNza3tH391riyDimLRwwALedZAgjPqkJalkpBtygjyHkY4zvkz9zj3hggb+nZyExPbQ0KcuxUgqqa/XLQ/JkePGD0lsCeol2R8jFt8klZlnJCdwxq+T475eNqpGBrhIzJyUQY5mX/+0BgGOPOJLzJAQPdMIpR0jLilmJClZkSAhwmM0JD1FfeQRYcfZcQk8UsoAugFXz5cwU393xMgTYuI5qjJdUcx7qfif14uke27H1A8jSXw8HeRGDMoApknBAeUESzZRBGFO1a4QjxBHWKo8SyoEc/7kRdKuVc16tXZ7Wm5c5HEUwQE4BBVggjPQAFegCVoAg0fwDF7Bm/akvWjv2se0tKDlPfvgD7SvHz0hoqg=</latexit> n⇠N (0, I)<latexit sha1_base64="p/Nt0NAX/4a66GmPmpO+hmclfZ4=">AAACHHicbVDLSsNAFJ34rPUVdelmsAgVpCStoMuiG91IBfuAppTJdNIOnUzCzEQoIR/ixl9x40IRNy4E/8ZJmoK2Hhg4c8693HuPGzIqlWV9G0vLK6tr64WN4ubW9s6uubffkkEkMGnigAWi4yJJGOWkqahipBMKgnyXkbY7vkr99gMRkgb8Xk1C0vPRkFOPYqS01Ddrjo/UyPVinsSOpH6S/TFi8W1SnnlWcgpn/CY56Zslq2JlgIvEzkkJ5Gj0zU9nEODIJ1xhhqTs2laoejESimJGkqITSRIiPEZD0tWUI5/IXpwdl8BjrQygFwj9uIKZ+rsjRr6UE9/VlemKct5Lxf+8bqS8i15MeRgpwvF0kBcxqAKYJgUHVBCs2EQThAXVu0I8QgJhpfMs6hDs+ZMXSatasWuV6t1ZqX6Zx1EAh+AIlIENzkEdXIMGaAIMHsEzeAVvxpPxYrwbH9PSJSPvOQB/YHz9ACj5opw=</latexit>

w

_2W

<latexit sha1_base64="txbM5/OAU12/qnxVgckrJicjAcA=">AAACBXicbVC7TsMwFL0pr1JeAUYYIlokpiopA4wVLIxFog+pjSrHdVqrjhPZDqiKsrDwKywMIMTKP7DxNzhpBmg5kqXjc+7Vvfd4EaNS2fa3UVpZXVvfKG9WtrZ3dvfM/YOODGOBSRuHLBQ9D0nCKCdtRRUjvUgQFHiMdL3pdeZ374mQNOR3ahYRN0BjTn2KkdLS0DyuDQKkJp6fPKTJgPI0/2LEkm5aG5pVu27nsJaJU5AqFGgNza/BKMRxQLjCDEnZd+xIuQkSimJG0sogliRCeIrGpK8pRwGRbpJfkVqnWhlZfij048rK1d8dCQqknAWersx2lIteJv7n9WPlX7oJ5VGsCMfzQX7MLBVaWSTWiAqCFZtpgrCgelcLT5BAWOngKjoEZ/HkZdJp1J3zeuO2UW1eFXGU4QhO4AwcuIAm3EAL2oDhEZ7hFd6MJ+PFeDc+5qUlo+g5hD8wPn8AJaCY+g==</latexit>

Figure 3.1: StyleGAN Generator architecture. Illustration based onKarras et al. [2018b].

A Style-Based Generator Architecture for Generative Adversarial Networks In follow-up work,Karras et al.[2018b] propose a new architecture for the Generator inspired by style transfer literature. The resulting architecture is called StyleGAN, and combines progressive training, a mapping network, AdaIN and stochastic variation, all explained below. This model forms the basis for our conditional model described in Chapter6.

Instead of feeding the Generator a noise vector z directly, first a non-linear mapping is applied to z resulting in w. The authors show empirically that this mapping better disentangles the latent factors of variation. This implies that attributes, such as gender, are easily separable in the resulting latent space.

Learned affine transformations then specialize w to styles that control AdaIN (see Section2.4) [Huang and Belongie,2017] operations.

Finally, the authors provide the Generator with explicit noise inputs in the form of single-channel images consisting of Gaussian noise. Karras et al. [2018b] argue that for traditional Generator architectures, the network is tasked with inventing a way to generate pseudo-randomness from earlier activations when they are needed. The explicit noise inputs alleviate the Generator of this task, freeing network capacity for generating realistic images. See Figure3.1for a graphical depiction of the StyleGAN Generator.

(21)

Why Generative Adversarial Networks?

GANs are implicit generative models, since they do not require a likelihood function to be specified, only a generating procedure. In contrast, likelihood-based methods optimize (log) likelihood explicitly using Maximum Likelihood Estimation (MLE). Likelihood-models can be divided into three categories:

• Autoregressive models [Van den Oord et al.,2016b,a,c;Salimans et al.,2017] factorize the distribution over observations into conditional distributions and process one component at a time. In the case of images, these models process the data one pixel at a time. A disadvantage is limited parallelizability, since the computational length of synthesis is proportional to the dimensionality of the data. In the case of high-resolution images this is particularly troublesome.

• Variational Autoencoders (VAEs) [Kingma and Welling,2013] optimize a lower bound of the log-likelihood of the data distribution. The resulting samples tend to be unspecific and/or blurry [Hou et al.,2017;Pu et al.,2017].

• Flow-based generative models [Dinh et al.,2014, 2016;Kingma and Dhariwal, 2018] apply invertible transformations to a sample from a prior such that exact log-likelihoods of observations can be computed. A comparison byOdena [2019] shows that there seems to be a substantial gap between the computational cost of training GAN and flow models. (It might be the case that maximum likelihood training is computationally harder to do than adversarial training.)

Our choice for GAN is motivated by the above-listed shortcomings of likelihood-based methods, as well as the desirable properties of GANs: the possibility to create sharp and compelling samples at high resolutions while remaining distribution agnostic (i.e., no assumptions about the underlying data distribution need be made). We conclude by stating that the fundamental trade-offs between GANs and other generative models is an open research problem [Odena,2019].

3.2 Relationship between images and natural language

Exploring the relationship between images and natural language has in recent years attracted great interest of the research community, due to its importance in a wide range of applications such as bi-directional image and text retrieval [Yan and Mikolajczyk, 2015; Ma et al., 2015], natural language object retrieval [Hu et al.,2016], visual question answering [Antol et al.,2015], image captioning [Xu et al., 2015; Vinyals et al., 2016; Karpathy and Fei-Fei, 2015] and text-to-image synthesis [Reed et al., 2016b; Zhang et al., 2017;Xu et al.,2018]. These applications have in common that it is critical to measure the similarity between visual data and textual descriptions, which we henceforth refer to as visual-semantic similarity. All this work can be categorized as multimodal machine learning [Baltruˇsaitis et al.,2019]; a branch of machine learn-ing that can process and relate information from multiple modalities.

Generally, the learning framework for image-text matching attempts to learn a joint embedding for visual signals and natural language. In this framework, a two-branch architecture is used where one branch extracts image features (typically using CNNs) and the other encodes text representations (typically using RNNs). Finally, the joint embedding is learned with a designed objective function.

We highlight two works that are relevant to our study below.

Learning Deep Representations of Fine-Grained Visual Descriptions

Work byReed et al.[2016a] demonstrated for the first time that the discrimination ability of the learned image-text embeddings can be greatly enhanced by introducing category classification loss as an auxiliary task, by optimizing a function that is related to the 0-1 classification loss but is continuous and convex:

(22)

1 N N X n=1 Lv(vn, tn, yn) +Lt(vn, tn, yn), (3.1)

where vnare visual features, tnare text representations and ynare class labels. The visual (Lv)

and textual (_Lt) misclassification losses are formalized as

Lv(vn, tn, yn) = max y∈Y 0, ∆(yn, y) +Et∼T (y)[F (vn, t)− F (vn, tn)] (3.2) and Lt(vn, tn, yn) = max y∈Y 0, ∆(yn, y) +Ev∼V(y)[F (v, tn)− F (vn, tn)] , (3.3)

where F (·, ·) is a compatibility function that uses features from learnable encoder functions θ(·) for images and ϕ(·) for text:

F (v, t) = θ(v)>_ϕ(t). _(3.4)

Their model achieved strong performance on zero-shot text-based image retrieval due to the auxiliary classification task.

Deep Cross-Modal Projection Learning for Image-Text Matching

Zhang and Lu [2018] propose a Cross-Modal Projection Matching (CMPM) loss, which mini-mizes the KL divergence between the projection compatibility and the matching distributions, and a Cross-Modal Projection Classification (CMPC) loss, that attempts to further enhance feature compactness by categorizing the vector projection of representations from one modality onto the other with a norm-softmax loss.

CMPM loss Given a mini-batch with n image-text feature pairs _{(vi, tj), yi,j}nj=1, where

yi,j = 1 indicates a matched image-text pair and yi,j = 0 indicates an unmatched pair, the

probability of matching vi to tjis defined as

pi,j= exp (v> i¯tj) Pn k=1exp (v>i¯tk) such that ¯tj= tj ||tj|| . (3.5) Geometrically, v>

i ¯tj represents the scalar projection of image feature vi onto text feature tj and

pi,jcan be viewed as the percent of scalar projection of a particular pair (vi, tj) among all pairs

in a mini-batch.

Since there might exist more than one matched text sample for an image in a mini-batch, the true matching probability is normalized as

qi,j=

yi,j

Pk

n=1yi,k

. (3.6)

The matching loss of associating vi with correctly matched text samples is defined as the KL

divergence from qi to pi: Li= N X j=1 pi,jlog pi,j qi,j+ . (3.7)

Note that minimizing DKL(pi||qi) attempts to select a pithat has low probability where qihas

low probability. The mini-batch matching loss from image to text is computed by Li2t= 1 N N X i=1 Li. (3.8)

(23)

Similarly, the text to image mini-batch lossLt2ican be formulated by exchanging v and t in Eq.

3.5–3.8, and the resulting CMPM loss is obtained byLcmpm=Li2t+Lt2i.

CMPC loss This novel classification function — where the cross-modal projection is inte-grated into the norm-softmax loss — is used to further enhance the compactness of the matched embeddings. To improve the discriminative ability of image features vi and text features ti,

weight normalization is imposed on the softmax loss. Furthermore, the authors project image and text features to integrate them into a classification framework. (Also, bias b is omitted by the authors.) This results in the following definition:

Lipt= 1 N N X i=1 − log exp (W > yiˆvi) P jexp (W>j ˆvi) such that_||Wj|| = 1, ˆvi= v>i¯ti· ¯ti. (3.9)

To calculate the projection of text features to image featuresLtpi, replace ˆviwith ˆti= t>i v¯i· ¯vi.

We know that, for the original softmax, the (binary) classification results depend on||Wk||||x|| cos (θk)

with k∈ {1, 2}, where θk is the angle between x and Wk. For the norm-softmax, all the weight

vectors are normalized, so the classification result depend only on_{||x|| cos (θ}k). Intuitively, this

restriction encourages the features x to distribute more compactly along the weight vector in order to be correctly classified.

The final CMPC loss can be calculated with _Lcmpc = Lipt+Ltpi, and the overall objective

function is obtained by combining the CMPM and CMPC losses, i.e. _{L = L}cmpm+Lcmpc.

Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed loss functions, achieving state-of-the-art R-precision scores for several datasets.

3.3 Conditional image synthesis with Generative

Adver-sarial Networks

We discuss previous text conditioning work in this section.

3.3.1 Conditioned on text

Text to image synthesis aims to exploit the generality of natural language descriptions to gener-ate images. The goal is to, given a textual description, genergener-ate an image that is visually realistic and semantically consistent with the description. This is a multimodal problem, complicated by the fact that there typically exist very many correct pixel configurations that correctly depict a textual description. Moreover, this task sheds light on the sort of work that can be done at the intersection of natural language processing and computer vision.

Early work

Pioneering work by Reed et al. [2016b] resulted in the GAN-INT-CLS model that is capable to generate plausible low resolution images from a textual description. Follow-up work [Reed et al.,2016a] conditioned the generator on both text and a location, resulting in GAWWN which generates better quality images with higher resolution. Zhang et al.[2017] created StackGAN, which decomposed the problem into two steps; a Stage-I GAN generates a low resolution image conditioned on text, and a Stage-II GAN generates a high resolution image conditioned on the Stage-I result and text. Their improved StackGAN++ model [Zhang et al., 2018a] consists of three Generators and Discriminators that handle increasing resolutions. At each resolution, the Generator captures the image distribution at that scale, and the Discriminator processes real and fake samples from the same scale.

(24)

'

<latexit sha1_base64="Y6yJuIANTBVkqHQ229p6vRK7p7k=">AAAB/3icbVDNS8MwHE39nPOrKnjxEhyCp9FOQY9DLx4nuA9YS0nTdAtLk5Kkg1F38F/x4kERr/4b3vxvTLcedPNByOO934+8vDBlVGnH+bZWVtfWNzYrW9Xtnd29ffvgsKNEJjFpY8GE7IVIEUY5aWuqGemlkqAkZKQbjm4LvzsmUlHBH/QkJX6CBpzGFCNtpMA+9kLBIjVJzJV7YyTTIZ0GOrBrTt2ZAS4TtyQ1UKIV2F9eJHCWEK4xQ0r1XSfVfo6kppiRadXLFEkRHqEB6RvKUUKUn8/yT+GZUSIYC2kO13Cm/t7IUaKKiGYyQXqoFr1C/M/rZzq+9nPK00wTjucPxRmDWsCiDBhRSbBmE0MQltRkhXiIJMLaVFY1JbiLX14mnUbdvag37i9rzZuyjgo4AafgHLjgCjTBHWiBNsDgETyDV/BmPVkv1rv1MR9dscqdI/AH1ucPHtiWzw==</latexit>

_t

c

<latexit sha1_base64="fPAL5cE4p08XiacwYEEAKL1lNAA=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkd9rQTGZIMkIZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqz02IuoGQVhxqb9csWtunOQVeLlpAI5Gv3yV28QszRCaZigWnc9NzF+RpXhTOC01Es1JpSN6RC7lkoaofazeeIpObPKgISxsk8aMld/b2Q00noSBXZyllAvezPxP6+bmvDaz7hMUoOSLT4KU0FMTGbnkwFXyIyYWEKZ4jYrYSOqKDO2pJItwVs+eZW0alXvolq7v6zUb/I6inACp3AOHlxBHe6gAU1gIOEZXuHN0c6L8+58LEYLTr5zDH/gfP4A3p6RCw==</latexit>

µ

<latexit sha1_base64="lH/K32GMLSIbP3M7UkI7v9LN/l4=">AAAB+XicbVBLSwMxGMzWV62vVY9egkXwVHaroMeiF48V7AO6S8lms21oHkuSLZSl/8SLB0W8+k+8+W/MtnvQ1oGQYeb7yGSilFFtPO/bqWxsbm3vVHdre/sHh0fu8UlXy0xh0sGSSdWPkCaMCtIx1DDSTxVBPGKkF03uC783JUpTKZ7MLCUhRyNBE4qRsdLQdYNIsljPuL3ygGfzoVv3Gt4CcJ34JamDEu2h+xXEEmecCIMZ0nrge6kJc6QMxYzMa0GmSYrwBI3IwFKBONFhvkg+hxdWiWEilT3CwIX6eyNHXBfh7CRHZqxXvUL8zxtkJrkNcyrSzBCBlw8lGYNGwqIGGFNFsGEzSxBW1GaFeIwUwsaWVbMl+KtfXifdZsO/ajQfr+utu7KOKjgD5+AS+OAGtMADaIMOwGAKnsEreHNy58V5dz6WoxWn3DkFf+B8/gBOqZQY</latexit> <latexit sha1_base64="URuxGgljpoPN+Mhy4Jue0uVY9Os=">AAAB/HicbVBLSwMxGMz6rPW12qOXYBE8ld0q6LHoxWMF+4DuUrLZtA3NY0mywrLUv+LFgyJe/SHe/Ddm2z1o60DIMPN9ZDJRwqg2nvftrK1vbG5tV3aqu3v7B4fu0XFXy1Rh0sGSSdWPkCaMCtIx1DDSTxRBPGKkF01vC7/3SJSmUjyYLCEhR2NBRxQjY6WhWwsiyWKdcXvlgaZjjmZDt+41vDngKvFLUgcl2kP3K4glTjkRBjOk9cD3EhPmSBmKGZlVg1STBOEpGpOBpQJxosN8Hn4Gz6wSw5FU9ggD5+rvjRxxXeSzkxyZiV72CvE/b5Ca0XWYU5Gkhgi8eGiUMmgkLJqAMVUEG5ZZgrCiNivEE6QQNravqi3BX/7yKuk2G/5Fo3l/WW/dlHVUwAk4BefAB1egBe5AG3QABhl4Bq/gzXlyXpx352MxuuaUOzXwB87nD56ulWU=</latexit>

Conditioning Augmentation

<latexit sha1_base64="SFS4eWTr+wtU+RKU3wRZoCcZFh0=">AAACCXicbVC7TsMwFHXKq5RXgJHFokJiqpKCBGOhC2OR6ENqo8px3daq40T2DaKKurLwKywMIMTKH7DxNzhpBmg5kqWjc8691j1+JLgGx/m2Ciura+sbxc3S1vbO7p69f9DSYawoa9JQhKrjE80El6wJHATrRIqRwBes7U/qqd++Z0rzUN7BNGJeQEaSDzklYKS+jXvAHiCph3LAU4XLEb6KRwGTkCVmfbvsVJwMeJm4OSmjHI2+/dUbhDRON1BBtO66TgReQhRwKtis1Is1iwidkBHrGipJwLSXZJfM8IlRBngYKvMk4Ez9PZGQQOtp4JtkQGCsF71U/M/rxjC89BIuoxiYpPOPhrHAEOK0FjzgilEQU0MIVaYKiumYKELBlFcyJbiLJy+TVrXinlWqt+fl2nVeRxEdoWN0ilx0gWroBjVQE1H0iJ7RK3qznqwX6936mEcLVj5ziP7A+vwBtKCa+Q==</latexit>

Text description t

...

<latexit sha1_base64="/I9gMHSVlaZJhXWi8GxrWqesUYk=">AAACGnicbVDLSgMxFM3UV62vqks3wSK4GmaqoMuiG5cV+oK2lEx624ZmHiR31DLMd7jxV9y4UMSduPFvTKddaOuBhMM593KS40VSaHScbyu3srq2vpHfLGxt7+zuFfcPGjqMFYc6D2WoWh7TIEUAdRQooRUpYL4noemNr6d+8w6UFmFQw0kEXZ8NAzEQnKGRekW3g/CASc1ctA+aKxFNDZpSpJmVRST3I4GQJrZtp71iybGdDHSZuHNSInNUe8XPTj/ksQ8Bcsm0brtOhN2EKRRcQlroxBoixsdsCG1DA+aD7iZZbkpPjNKng1CZEyDN1N8bCfO1nviemfQZjvSiNxX/89oxDi67iQiiGCHgs6BBLCmGdNoT7QsFHOXEEGZaMW+lfMQU42jaLJgS3MUvL5NG2XbP7PLtealyNa8jT47IMTklLrkgFXJDqqROOHkkz+SVvFlP1ov1bn3MRnPWfOeQ/IH19QOMfqHK</latexit>

E⇠N (0, I)<latexit sha1_base64="0NJotEDZQxGgMmHOROBUrxle/mQ=">AAACMnicbVDLSgMxFM3UV62vUZdugkWoIGWmCrosimA3UsE+oDOUTCbThmYeJBmhDPNNbvwSwYUuFHHrR5hpB6mtF0JOzj2X3HOciFEhDeNVKywtr6yuFddLG5tb2zv67l5bhDHHpIVDFvKugwRhNCAtSSUj3YgT5DuMdJzRVdbvPBAuaBjcy3FEbB8NAupRjKSi+nrDckLmirGvrsTykRxixJLrNE0sQf30l7lNK7NKIz2Bs+9GetzXy0bVmBRcBGYOyiCvZl9/ttwQxz4JJGZIiJ5pRNJOEJcUM5KWrFiQCOERGpCeggHyibCTieUUHinGhV7I1QkknLCzEwnyRbabUmYWxHwvI//r9WLpXdgJDaJYkgBPP/JiBmUIs/ygSznBko0VQJhTtSvEQ8QRlirlkgrBnLe8CNq1qnlard2dleuXeRxFcAAOQQWY4BzUwQ1oghbA4BG8gHfwoT1pb9qn9jWVFrR8Zh/8Ke37B7TzrMw=</latexit>

Figure 3.2: Conditioning Augmentation module.

Conditioning Augmentation

Zhang et al. [2017] additionally introduced Conditioning Augmentation (CA) to encourage smoothness in the latent conditioning manifold. As shown in Figure 3.2, a textual descrip-tion t is encoded which results in embedding ϕt. Typically this embedding is directly used for

conditioning. However, since the latent space of the embedding is usually high dimensional and labeled training data is limited, this causes discontinuity in the latent data manifold, which can be troublesome when training the Generator. Instead of using this embedding directly, Zhang et al. [2017] sample the resulting conditioning variable c from a Gaussian distribution N (µ(ϕt), Σ(ϕt)) where the mean µ(ϕt) and covariance matrix Σ(ϕt) are functions of embedding

ϕt. Using the reparameterization trick [Kingma and Welling, 2013] both µ(ϕt) and Σ(ϕt) are

learned jointly with the rest of the network. Effectively this results in more training pairs and thus encourages robustness to small perturbations in the manifold. To enforce smoothness and prevent overfitting, the following KL divergence regularization term is used during training:

DKL

N (µ(ϕt), Σ(ϕt))||N (0, I)

(3.10) This term encourages the conditioning distribution to be close to the standard Gaussian dis-tribution. By fixing noise vector z and varying only conditioning c, the authors show that small perturbations of a text embedding usually correspond to objects with various poses and appearances.

Attngan: Fine-grained text to image generation with attentional generative adver-sarial networks

Recently, Xu et al. [2018] proposed Attentional Generative Adversarial Network (AttnGAN), which makes use of an attentional generative network (allowing the generator to draw different image sub-regions by focusing on relevant words) and a deep attentional multimodal similarity model (which computes the similarity between the generated image and the textual description). Evaluation scores of previous work are significantly outperformed by AttnGAN. The attention mechanism is claimed to make the black box GAN model more interpretive by highlighting at-tended words and their corresponding locations in the generated image. In this work we treat AttnGAN as a strong baseline for text to image synthesis, which is why we describe it extensively. The Generator, or attention module as the authors name it, takes as input word features e (pre-trained according to the method in Reed et al. [2016a]) and image features h. The word features are sampled using CA (Section3.3.1), before being converted to the same dimensionality by multiplying with a (learnable) matrix, i.e. e0 _{= Ue. Then, the word-context vector c}

j is

computed for each subregion of the image based on its hidden features h and word features e. (Each column of h is a feature vector of a sub-region of the image.) For the ith _{subregion, c}

i is

(25)

ci= T

X

j=1

αije0j. (3.11)

The matrix αij—which indicates the weight the model attends to the jth word when generating

the ith _{subregion of the image—is computed by}

αij = expscore(hi, ej) PT k=1exp score(hi, ek) , (3.12) where score(hi, ej) = h>i e0j (3.13)

is the dot score function [Luong et al.,2015] which can be thought of as similarity score between the image subregion and word. See Figure3.3for the architecture of the AttnGAN Generator.

The loss is calculated by a Deep Attentional Multimodal Similarity Model (DAMSM). This mod-ule calculates the similarity between the generated image and the sentence by again mapping both to a shared vector space. As a first step, a similarity matrix for all possible pairs of words e and image regions v is calculated:

s = e>v, (3.14)

where si,j is the dot product similarity between word i and image region j. After normalizing

s, the authors compute a region-context vector ci for each word, which can be thought of as a

representation of image sub-regions related to word i, and is computed as weighted sum over all regional visual vectors, i.e.:

ci= Nregions X j=0 αjvj where αj= exp (γ1si,j) PNregions k=0 exp (γ1si,k) . (3.15)

γ1is a hyperparameter that determines how much weight is put on relevant sub-regions. Finally,

relevance between word i and image region j is calculated by the cosine similarity between ciand

ei. Then, the attention-driven image-text matching score between the complete text description

Di and entire image Qi is defined as:

R(Qi, Di) = log XT i=0 exp (γ2R(ci, ei)) 1 γ2

where R is the cosine similarity. (3.16) The DAMSM loss utilizes this matching score, by calculating the posterior probability of sentence Di being matched with image Qi, which is formalized as

P (Di|Qi) =

exp (γ3R(Qi, Di))

PM

j=1exp (γ3R(Qi, Dj))

(3.17) where γ3 is a smoothing factor. Finally, the authors minimize the negative log posterior

proba-bility that images are matched with their text descriptions, L(1)w =−

M

X

i=1

log P (Di|Qi), (3.18)

where w stands for word. The inverse, i.e. the probability that sentences are matched with corresponding images, is also minimized:

L(2)w =− M

X

i=1

(26)

Conv 3x3 Conv 3x3 Conv 3x3 Text encoder CA z_{⇠N (0, I)} <latexit sha1_base64="70cMlMsIpySOAmPtxmWqxwf0uxU=">AAACKXicbVDLSsNAFJ34rPUVdelmsAgVpCRV0GXRjW6kgn1AE8pkMmmHTh7MTIQa8jtu/BU3Coq69UecpClo64FhDufcy733OBGjQhrGp7awuLS8slpaK69vbG5t6zu7bRHGHJMWDlnIuw4ShNGAtCSVjHQjTpDvMNJxRpeZ37knXNAwuJPjiNg+GgTUoxhJJfX1huWEzBVjX32J5SM5dLzkIU0TS1A/zQWMWHKTVqemkR7DKb9Oj/p6xagZOeA8MQtSAQWaff3VckMc+ySQmCEheqYRSTtBXFLMSFq2YkEihEdoQHqKBsgnwk7yS1N4qBQXeiFXL5AwV393JMgX2TGqMltRzHqZ+J/Xi6V3bic0iGJJAjwZ5MUMyhBmsUGXcoIlGyuCMKdqV4iHiCMsVbhlFYI5e/I8addr5kmtfntaaVwUcZTAPjgAVWCCM9AAV6AJWgCDR/AM3sC79qS9aB/a16R0QSt69sAfaN8/xWSotQ==</latexit>

c

<latexit sha1_base64="fPAL5cE4p08XiacwYEEAKL1lNAA=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkd9rQTGZIMkIZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqz02IuoGQVhxqb9csWtunOQVeLlpAI5Gv3yV28QszRCaZigWnc9NzF+RpXhTOC01Es1JpSN6RC7lkoaofazeeIpObPKgISxsk8aMld/b2Q00noSBXZyllAvezPxP6+bmvDaz7hMUoOSLT4KU0FMTGbnkwFXyIyYWEKZ4jYrYSOqKDO2pJItwVs+eZW0alXvolq7v6zUb/I6inACp3AOHlxBHe6gAU1gIOEZXuHN0c6L8+58LEYLTr5zDH/gfP4A3p6RCw==</latexit> FC

Upsample Upsample Upsample Upsample

Joining Residual Residual Upsample

Attention

Joining Residual Residual Upsample

Attention

h<latexit sha1_base64="08GJycCDUIYBdOvDGb59RNJsoFg=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaQKuiy6cVnBPqApZTKdtEMnkzBzI5TQ33DjQhG3/ow7/8ZJm4W2Hhg4nHMv98wJEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJXe53nrg2IlaPOE14P6IjJULBKFrJ9yOK4yDMxrOBN6hU3Zo7B1klXkGqUKA5qHz5w5ilEVfIJDWm57kJ9jOqUTDJZ2U/NTyhbEJHvGepohE3/WyeeUbOrTIkYaztU0jm6u+NjEbGTKPATuYZzbKXi/95vRTDm34mVJIiV2xxKEwlwZjkBZCh0JyhnFpCmRY2K2FjqilDW1PZluAtf3mVtOs177JWf7iqNm6LOkpwCmdwAR5cQwPuoQktYJDAM7zCm5M6L86787EYXXOKnRP4A+fzBxM+kbQ=</latexit> 1 h<latexit sha1_base64="7Be6dDbTw45VjACFxOcDgxRUcQI=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaQKuiy6cVnBPqApZTKdtEMnkzBzI5TQ33DjQhG3/ow7/8ZJm4W2Hhg4nHMv98wJEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJXe53nrg2IlaPOE14P6IjJULBKFrJ9yOK4yDMxrNBfVCpujV3DrJKvIJUoUBzUPnyhzFLI66QSWpMz3MT7GdUo2CSz8p+anhC2YSOeM9SRSNu+tk884ycW2VIwljbp5DM1d8bGY2MmUaBncwzmmUvF//zeimGN/1MqCRFrtjiUJhKgjHJCyBDoTlDObWEMi1sVsLGVFOGtqayLcFb/vIqaddr3mWt/nBVbdwWdZTgFM7gAjy4hgbcQxNawCCBZ3iFNyd1Xpx352MxuuYUOyfwB87nDxTCkbU=</latexit> 2 h<latexit sha1_base64="tcmbnyrU7YIlDLder76C5hK6Hxw=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsy0gi6LblxWsA/oDCWTZtrQJDMkGaEM/Q03LhRx68+482/MtLPQ1gOBwzn3ck9OmHCmjet+O6WNza3tnfJuZW//4PCoenzS1XGqCO2QmMeqH2JNOZO0Y5jhtJ8oikXIaS+c3uV+74kqzWL5aGYJDQQeSxYxgo2VfF9gMwmjbDIfNofVmlt3F0DrxCtIDQq0h9UvfxSTVFBpCMdaDzw3MUGGlWGE03nFTzVNMJniMR1YKrGgOsgWmefowiojFMXKPmnQQv29kWGh9UyEdjLPqFe9XPzPG6QmugkyJpPUUEmWh6KUIxOjvAA0YooSw2eWYKKYzYrIBCtMjK2pYkvwVr+8TrqNutesNx6uaq3boo4ynME5XIIH19CCe2hDBwgk8Ayv8Oakzovz7nwsR0tOsXMKf+B8/gAWRpG2</latexit> 3

e

<latexit sha1_base64="99rzgxwhtT0KamLXPv7Lit8T1dM=">AAACAHicbVDJSgNBEO1xjXGLevDgpTEInsJMFPQY9OIxglkgCUNPT03SpGehu0YNw1z8FS8eFPHqZ3jzb+wsB018UPB4r4qqel4ihUbb/raWlldW19YLG8XNre2d3dLeflPHqeLQ4LGMVdtjGqSIoIECJbQTBSz0JLS84fXYb92D0iKO7nCUQC9k/UgEgjM0kls67IYMB16QQe52ER4xe4iVr3O3VLYr9gR0kTgzUiYz1N3SV9ePeRpChFwyrTuOnWAvYwoFl5AXu6mGhPEh60PH0IiFoHvZ5IGcnhjFp0GsTEVIJ+rviYyFWo9Cz3SOz9Xz3lj8z+ukGFz2MhElKULEp4uCVFKM6TgN6gsFHOXIEMaVMLdSPmCKcTSZFU0IzvzLi6RZrThnlertebl2NYujQI7IMTklDrkgNXJD6qRBOMnJM3klb9aT9WK9Wx/T1iVrNnNA/sD6/AEkWZdn</latexit> words

e

sen <latexit sha1_base64="tq84yNm47st8sNdsKcqRzzuo8TA=">AAAB/nicbVDLSsNAFJ3UV62vqrhyEyyCq5JUQZdFNy4r2Ae0pUymN+3QySTM3IglBPwVNy4Ucet3uPNvnLRZaOuBgcM593LPHC8SXKPjfFuFldW19Y3iZmlre2d3r7x/0NJhrBg0WShC1fGoBsElNJGjgE6kgAaegLY3ucn89gMozUN5j9MI+gEdSe5zRtFIg/JRL6A49vwE0kEP4RETDTIdlCtO1ZnBXiZuTiokR2NQ/uoNQxYHIJEJqnXXdSLsJ1QhZwLSUi/WEFE2oSPoGippALqfzOKn9qlRhrYfKvMk2jP190ZCA62ngWcms7B60cvE/7xujP5VP+EyihEkmx/yY2FjaGdd2EOugKGYGkKZ4iarzcZUUYamsZIpwV388jJp1aruebV2d1GpX+d1FMkxOSFnxCWXpE5uSYM0CSMJeSav5M16sl6sd+tjPlqw8p1D8gfW5w9mZJZq</latexit> Text description <latexit sha1_base64="wayhm8GvcMbE6yvgyp0nNlOivRU=">AAACAHicbVDLSgMxFM34rPVVdeHCTbAIrspMFXRZdOOyQl/QlpJJb9vQTGZI7ohlmI2/4saFIm79DHf+jZm2C209kHA4516Sc/xICoOu++2srK6tb2zmtvLbO7t7+4WDw4YJY82hzkMZ6pbPDEihoI4CJbQiDSzwJTT98W3mNx9AGxGqGk4i6AZsqMRAcIZW6hWOOwiPmNTsRftguBZRZqS9QtEtuVPQZeLNSZHMUe0Vvjr9kMcBKOSSGdP23Ai7CdMouIQ034kNRIyP2RDalioWgOkm0wApPbNKnw5CbY9COlV/byQsMGYS+HYyYDgyi14m/ue1YxxcdxOhohhB8dlDg1hSDGnWBu0LDRzlxBJms9u/Uj5imnG0neVtCd5i5GXSKJe8i1L5/rJYuZnXkSMn5JScE49ckQq5I1VSJ5yk5Jm8kjfnyXlx3p2P2eiKM985In/gfP4AqPaXFw==</latexit> AttnGAN Generator <latexit sha1_base64="+r2QEXLBZpVG+xpZPqKpTSNCwN0=">AAACAXicbVDLSgNBEJyN7/iKehG8DAbBU9hVQY8+DnqSCOYByRJmJ51kcHZ2mekVwxIv/ooXD4p49S+8+TdOkj1oYkFDUdVNd1cQS2HQdb+d3Mzs3PzC4lJ+eWV1bb2wsVk1UaI5VHgkI10PmAEpFFRQoIR6rIGFgYRacHcx9Gv3oI2I1C32Y/BD1lWiIzhDK7UK202EB0zPENXl2TW9BAWaYaQHrULRLbkj0GniZaRIMpRbha9mO+JJCAq5ZMY0PDdGP2UaBZcwyDcTAzHjd6wLDUsVC8H46eiDAd2zSpt2Im1LIR2pvydSFhrTDwPbGTLsmUlvKP7nNRLsnPipUHGCoPh4USeRFCM6jIO2hQaOsm8J41rYWynvMc042tDyNgRv8uVpUj0oeYelg5uj4ul5Fsci2SG7ZJ945JickitSJhXCySN5Jq/kzXlyXpx352PcmnOymS3yB87nD4Dlluw=</latexit> Xfake2R64⇥64⇥3

<latexit sha1_base64="TJH+OQhDcbAnsLbQEW5RQSuq+oI=">AAACNHicbVDLSgMxFM34tr6qLt0MFsFVmVFRl6IbwY2KfUCnlkx6R0MzmSG5I5YwH+XGD3EjggtF3PoNpu0Ivi4kOZxzLzfnhKngGj3vyRkbn5icmp6ZLc3NLywulZdX6jrJFIMaS0SimiHVILiEGnIU0EwV0DgU0Ah7RwO9cQNK80ReYD+FdkyvJI84o2ipTvkkiCleh5Fp5p0A4RZNRHuQm4DLfCSF5jy/NIFmiqdY3NgXsLsTII9Bf73beadc8aresNy/wC9AhRR12ik/BN2EZTFIZIJq3fK9FNuGKuRMQF4KMg0pZT16BS0LJbVr2mZoOnc3LNN1o0TZI9Edst8nDI217seh7Rz40L+1Afmf1sow2m8bLtMMQbLRoigTLibuIEG3yxUwFH0LqI3D/tVl11RRhjbnkg3B/235L6hvVf3t6tbZTuXgsIhjhqyRdbJJfLJHDsgxOSU1wsgdeSQv5NW5d56dN+d91DrmFDOr5Ec5H58kqq4Z</latexit> Xfake2R

256⇥256⇥3

<latexit sha1_base64="XsRkftMxXhe8IJf5q5LOB3ijM80=">AAACNnicbVDLSsNAFJ34tr6qLt0Ei+CqJPW5LLpxI6hYLTS1TKY3OnQyCTM3YhnyVW78DnduXCji1k9w2gbxdWGGwzn3cu85YSq4Rs97csbGJyanpmdmS3PzC4tL5eWVC51kikGDJSJRzZBqEFxCAzkKaKYKaBwKuAx7hwP98haU5ok8x34K7ZheSx5xRtFSnfJxEFO8CSPTzDsBwh2aiPYgNwGX+UgKzVl+ZQLNFE+x+LEvoLazGyCPQX+BrbxTrnhVb1juX+AXoEKKOumUH4NuwrIYJDJBtW75XoptQxVyJiAvBZmGlLIevYaWhZLaNW0ztJ27G5bpulGi7JPoDtnvE4bGWvfj0HYOnOjf2oD8T2tlGO23DZdphiDZaFGUCRcTd5Ch2+UKGIq+BdQGYm912Q1VlKFNumRD8H9b/gsualV/q1o73a7UD4o4ZsgaWSebxCd7pE6OyAlpEEbuyRN5Ia/Og/PsvDnvo9Yxp5hZJT/K+fgEQyKukw==</latexit>

Xfake2R128⇥128⇥3

<latexit sha1_base64="v2CjKLnnDez4iPQAskjcgmk4MMY=">AAACNnicbVBNS8NAEN34bf2qevQSLIKnklTBHkUvXoQqVgtNLZvtRJduNmF3IpYlv8qLv8NbLx4U8epPcNsG8Wtgl8d7M8y8F6aCa/S8oTM1PTM7N7+wWFpaXlldK69vXOokUwyaLBGJaoVUg+ASmshRQCtVQONQwFXYPx7pV3egNE/kBQ5S6MT0RvKIM4qW6pZPg5jibRiZVt4NEO7RRLQPuQm4zCdSaM7zaxNopniKxY8DAX6tHiCPQX+BvbxbrnhVb1zuX+AXoEKKanTLT0EvYVkMEpmgWrd9L8WOoQo5E5CXgkxDSlmf3kDbQkntmo4Z287dHcv03ChR9kl0x+z3CUNjrQdxaDtHTvRvbUT+p7UzjOodw2WaIUg2WRRlwsXEHWXo9rgChmJgAbWB2FtddksVZWiTLtkQ/N+W/4LLWtXfq9bO9iuHR0UcC2SLbJNd4pMDckhOSIM0CSMPZEheyKvz6Dw7b877pHXKKWY2yY9yPj4BPNaujw==</latexit>

Figure 3.3: Attentional Generative Adversarial Network Generator architecture.

The sentence loss functions L(1)s and L(2)s are calculated analogously, resulting in the final

DAMSM loss function

LDAMSM=L(1)w +L(2)w +L(1)s +L(2)s . (3.20)

At the time of publication, AttnGAN significantly outperformed earlier work on several datasets. Recent work

Here we describe text-to-image research that is performed in parallel to our work. These papers all build on the AttnGAN framework.

MirrorGAN Although significant progress has been made on generating visually realistic im-ages, guaranteeing semantic consistency between the text description and generated image re-mains challenging. As a solution,Qiao et al.[2019] introduce a semantic text regeneration and alignment (STREAM) module, which regenerates the text description for the generated image. This idea is similar to CycleGAN [Zhu et al.,2017]—which does image-to-image translation using a cycle consistency loss—butQiao et al.[2019] tackle the text-to-image case, which is arguably harder because the cross-media domain gap between text and images is larger than between images with different attributes (e.g., styles). Text-to-image synthesis can be regarded as the inverse problem of image captioning—where the goal is to generate a caption given an image. The authors unify these tasks in a single framework, where the caption of the generated image should have the same semantics of the original text description. In other words, the generated image ideally acts like a mirror, precisely reflecting the underlying text semantics, hence the model name.

SD-GANYin et al.[2019] focus on semantics of text descriptions used as input for their Seman-tics Disentangling Generative Adversarial Network (SD-GAN) model. Their contributions are: 1) a Siamese mechanism in the Discriminator, and 2) Semantic-Conditioned Batch Normaliation (SCBN). The Siamese mechanism distills the semantic commons while retaining the semantic diversities between different text descriptions for a single image, using a contrastive loss that minimizes the distance of generated images from two descriptions of the same ground-truth im-age, while maximizing those of different ground-truth images. As a result, the generated images are not only based on the input description at one Siamese branch, but also influenced by the description at the other branch. SCBN complements the Siamese structure, by reinforcing the visual-semantic embedding in the feature maps of the Generators. This module enables the tex-tual embedding to manipulate the visual feature maps, for example by scaling them, negating them, ignoring them, etc.

DM-GANZhu et al.[2019] proposed Dynamic Memory Generative Adversarial Network (DM-GAN), and tackle two prevalent problems in text-to-image synthesis models: 1) when the quality of the initial (low-resolution) image is poor, a satisfactory quality can not be obtained and 2) identical text representations are used for all resolutions, although individual words have varying

(27)

levels of importance for different resolutions. The authors propose a dynamic memory module to refine poor initial images. The module consists of a memory writing gate, that selects important text information based on the initial image content, and a response gate, that fuses the memories and image features. As a result, their model can more accurately generate images that adhere to a textual description

Text-to-image for faces

The subtask of text to image synthesis for faces is an underexplored area, likely due to the un-availability of a high-quality dataset. Chen et al.[2019a] are the first to report on this task. A dataset is introduced, consisting of 1000 images from Large-scale CelebFaces Attributes dataset (CelebA) which are annotated by crowd workers. However, the size of the dataset is insufficient for data-hungry generative models, and the resolution (256_{×256) limits the quality of the} result-ing samples. As for the model, a simple modification to AttnGAN [Xu et al.,2018] is proposed, where the text encoder and image decoder are trained at the same time. The authors claim this end-to-end approach leads to superior results, but this claim is not substantiated with evidence. While language is arguably the most natural way for humans to describe objects, the corre-sponding images underlying the natural language description are essentially a group of facts or visual attributes that are extracted from the sentence. This is particularly valid in the case of faces; a list of attributes corresponding to a face might make for a better description. This might be another reason why text to image synthesis for faces is an underexplored area.

3.3.2 Our position

Our text-to-image research differs from the works described in the previous section in the fol-lowing ways:

(i) We evaluate on common text-to-image datasets as well as our own facial dataset with textual descriptions.

(ii) Compared to recent works in parallel to ours, we do not build on the AttnGAN framework, but instead introduce a conditional variant of StyleGAN.

(iii) Our model scales to higher resolutions (1024_{× 1024) than other models, allowing for more} fine-grained details.

(iv) We use recently introduced visual-semantic embeddings [Zhang and Lu, 2018], which are different than the embeddings used in other works.

(v) Our model unifies text-to-image with image manipulation, in the sense that we made design choices that allow us to use model weights for manipulating an image, after conditionally generating an image.

3.4 Image manipulation

Image manipulation can be performed at different levels of abstraction and complexity. Common operations are augmenting the contrast or converting to grayscale, changing color properties, and warping. These straightforward operations do not require object comprehension. Impressive re-sults can be achieved by experts, but when these methods fail they produce rere-sults that do not resemble real images. The reason is their reliance on low-level principles, such as color similarity or gradients, which is why these models fail to capture higher-level information. A more complex type of modification is changing (facial) attributes (e.g., change expression or gender), which we call semantic image manipulation henceforth. In this case, the modification should preserve the realism and other factors or attributes should be left unchanged. Recent solutions to this challenging task rely on generative models [Zhu et al.,2016;Perarnau et al.,2016;Shen and Liu, 2017;Geng et al., 2019] or encoder-decoder architectures [Chen et al., 2018,2019b;Qian et al., 2019]. We discuss these approaches in the next section.

A Unified Framework for Conditional Image Generation and Manipulation

Artificial Intelligence

Master Thesis