An Empirical Comparison of Deep Conditional Generative Models with Applications to Attribute-Guided Face Synthesis

(1)

An Empirical Comparison of Deep Conditional Generative Models with Applications to

Attribute-Guided Face Synthesis

submitted in partial fulfillment for the degree of master of science Kewin Dereniewicz

11815000

master information studies data science

faculty of science university of amsterdam

2018-07-06

Internal Supervisor Internal Supervisor External Supervisor Title, Name Dr Maarten Marx Wenling Shang Benjamin Timmermans Affiliation UvA, FNWI, IvI UvA IBM

(2)

An Empirical Comparison of Deep Conditional Generative

Models with Applications to Attribute-Guided Face

Synthesis

Kewin Dereniewicz

University of Amsterdam

Wenling Shang

University of Amsterdam

Benjamin Timmermans

IBM

ABSTRACT

When police is looking for a suspect or a missing person, they often use a textual description of the individual provided by a witness or create special facial composites which are proven to have a low success ratio. This study proposes a new solution to this problem. Our goal is to create a tool which generates a realistic face image based on attributes describing a person. In order to do so, we investigate several state-of-the-art deep conditional generative models, perform a thorough analysis of their performance, both quantitatively and via human evaluation. Finally, we design and implement an easy to use and efficient web application that can assist police under such scenarios.

KEYWORDS

generative models, face synthesis, conditional generation, sus-pect sketch

ACKNOWLEDGEMENT

I would like to thank Kihyuk Sohn for the script to train the model used for inception score which provided great insight into model evaluation.

I would also like to thank Wenling Shang for providing the figure 3 for cond-crVAE model as it was also the base for other model figures (2, 4).

1 INTRODUCTION

1.1 Background

Burglaries, thieveries, assaults and kidnappings are not un-common events. According to the Netherlands 2018 Crime & Safety Report [1] most of these crimes happen in big cities and tourists are often the main target. In many cases when there is an incident and police is trying to find the suspect or missing person, they only hold a verbal description of that person provided by a witness.

One commonly used technique to help identify the indi-vidual is creating a forensic sketch which is simply a drawing of the suspect. In order to do that a forensic expert has to be invited to create a sketch based on an interview with a witness. These sketches are then posted on the media or directly sent to the public to help identify the suspect. This process usually takes a few days and has a very low success rate [2].

Another technique called composite sketches relies on the use of special software which allows the operator to semi-automatically create a drawing of the suspect. While proven to be superior to forensic sketches, this method still requires a trained expert and has a relatively low accuracy [2].

One of the crucial factors in the effectiveness of suspect identification is the delay between the incident and providing descriptions by a witness. A study done by Frowd et al. [3] has shown that witness’ recall of suspect’s facial features is significantly reduced after just 2 days of delay. The recall has shown to be much more accurate and detailed right after the incident, that is up to 3-4 hours from the event.

1.2 Research problem

In this work we aim at creating a tool to assist Dutch police under such scenarios. Specifically, we build a web application which provides a simple, fast and efficient way to describe a person and then instantaneously generate realistic face images which can be used to identify the suspect, hopefully in a more efficient way than the existing techniques.

In order to do that, we are going to use deep conditional generative models. Such model should take an attribute vector as an input and produce a set of diverse face images reflecting those attributes.

Great progress has been made in generative models in re-cent years. Generative Adversarial Networks (GAN) [4], Varia-tional Autoencoders (VAE) [5] and Autoregressive models [6] are currently the most popular approaches. These models aim at producing an output space that resembles the distribution of the input image space. Conditional GAN (cGAN) [7] and conditional VAE (cVAE) [8] introduce the possibility to alter generated data by using additional input.

In this work we examine such conditional generative mod-els and use them to produce realistic face images described by given attributes. We perform thorough analysis of the con-ditionally generated face images, both based on a separately trained attribute classifier and based on human evaluation. The final models will be used in the web application for fast and efficient suspect identification.

We state the research question of this thesis as follows: How can deep conditional generative models be used efficiently to generate representative images of people for finding suspects and missing persons?

1.3 Overview

The remainder of this thesis is organized in the following way: first we will discuss relevant literature and how it relates to this work. Secondly, we will analyze different generative models, their architecture and implementation details. Thereafter, we evaluate these models in two different ways. Next, we present the web application. Lastly, we briefly summarize what has been done, draw final conclusions and state future works.

(3)

2 RELATED WORKS

2.1 Deep Generative Models

Generative models have been rapidly evolving in recent years. In this section we review several important deep generative model frameworks.

2.1.1 Variational Autoencoder.

Variational Autoencoder (VAE) [5] approximates the intractable posterior of a directed graphical model by optimizing the vari-ational lower bound using deep neural networks. VAE encodes the input data to a probabilistic latent space and regularizes the hidden representation by a standard Gaussian prior. VAE is relatively easy to implement as well as stable to train and it allows for efficient inference. However, it lacks the ability to express complex input spaces and the output images are often blurry.

Channel-recurrent Variational Autoencoder (crVAE) [9] in-troduces a major improvement to VAE. It enhances the gener-ation quality of VAE by realizing a progressive transformgener-ation and generation of the latent space via an LSTM[10] connect-ing blocks of latent subspaces. This model produces more realistic looking images while maintaining the same level of computational efficiency.

2.1.2 Generative Adversarial Network.

The concept of Generative Adversarial Network (GAN) [4] is based on a game between two neural networks. The first one, called generator, captures data distribution and generates new samples. The other one, called discriminator, tries to esti-mate the probability that the given sample came from the true distribution or the model distribution. Every time it notices a difference between the two distributions, the generator tries to adjust its parameters in order to produce a more indistin-guishable sample space. The end goal of the generator is to make the discriminator guess randomly.

GANs are capable of generating very sharp images but they are limited to modeling high density regions of a data distribution [11]. They are also unstable to optimize. Recent works [12–15] have attempted to improve GANâĂŹs stability by introducing new metrics, semi-supervised objectives, nor-malization techniques and new gradient penalty schemes, but none of them completely resolves this issue.

2.1.3 Autoregressive Model.

Autoregressive model learns an explicit distribution on the pixel level. One of the prime examples is PixelRNN [16] which sequentially predicts the pixels in both horizontal and vertical direction. It uses softmax loss, optimizes stably and generates realistic images. The sampling process, however, is relatively inefficient and training such model requires significant com-putational power.

2.1.4 Combinations.

There have been many attempts to combine the above models to achieve better results. In PixelVAE [17] the authors incor-porate VAE into PixelCNN [18] but the model still requires significant computational resources. VAE-GAN [19] proposes to combine VAE and GAN to improve the quality of output images while preserving the comprehensive data coverage of

the latent space. Adversarial Variational Bayes [20] introduces an auxiliary discriminative network that allows to rephrase the maximum-likelihood-problem as a two-player game, es-tablishing a connection between VAE and GAN. This model, however, still suffers from blurry outputs.

2.2 Deep Conditional Generative Models

Conditional models allow to control the generation process through additional input. This gives rise to numerous new applications such as text-to-image generation or 3D modeling. In what follows, we discuss a few examples of conditional models and then state our focus in this thesis.

2.2.1 Conditional Variational Autoencoder.

Conditional VAE was introduced by Diederik et al. [8] and is able to disentangle content from certain styles by construct-ing a directed graphical model from both content latent space and style information to the input space. In [21], the authors attempt to apply conditional VAE to attribute-guided face gen-eration yet the results appear blurry and of low visual quality. Their more advanced model which outputs improved genera-tions requires expensive background/foreground labeling.

2.2.2 Conditional Generative Adversarial Network. Conditional Generative Adversarial Network (cGAN) [7] ex-tends GAN to a conditional setting. In [22] the authors used cGAN to generate human faces with specific attributes. The output images reflect the specified attributes but they contain a high number of artifacts.

2.2.3 Conditional Autoregressive Models.

A work by Reed et al. [23] extends PixelCNN to condition on e.g. text information and trains in a heavily parallelized system. The model achieves a major improvement in computational efficiency compared to PixelCNN, mostly by removing great amount of spatial dependencies. However, their parallelization system requires significant engineering and computational resources.

2.2.4 Combinations.

Conditional VAE-GAN (cVAE-GAN) [24] transforms VAE to an adversarial setting with additional class label information. Adversarial loss is additionally regularized by feature match-ing. This model also takes advantage of statistic and pairwise feature matching in order to reduce training time and make it more stable. cVAE-GAN is capable of generating high qual-ity images while efficiently utilizing conditional information. However, the proposed model is very sensitive to engineering details and without open sourced code base, it is extremely difficult to reproduce their results.

2.2.5 Our focus.

Due to computational and time constraints, this study focuses on efficient models which do not require substantial resources. We explore cVAE and cGAN as well as extend crVAE to a conditional setting.

(4)

3 METHODOLOGY

In what follows, we present different deep conditional genera-tive models compared in this thesis and provide descriptions of the dataset, model architecture and implementation details.

3.1 Deep Conditional Generative Models

This section describes the theory behind each model, their ar-chitecture and how they are extended to a conditional setting.

3.1.1 cVAE.

This model is based on the concept of an autoencoder [25]. It consists of two neural networks - the encoder and the decoder. The former takes an input and converts it to a latent space representation while the latter tries to convert it back to the original input. The loss function, called reconstruction loss, penalizes the model for generating outputs that differ from the inputs. The main problem with such network is that its main functionality is replication, not generation. The latent space is not probabilistic, hence there lacks a principled way to sample from the latent space.

Variational Autoencoders [5] solve this problem by regular-izing the latent space to follow a pre-defined prior. In this case, we choose the commonly used standard normal distribution as our prior. Specifically, VAE achieves this by inferring the mean and standard deviation, (µi, σi), of the approximated posterior of a given input Xi. The prior regularization of VAE further introduces a Kullback-Leibler divergence (KL) [26] loss, which measures how much the approximated posteriors di-verge from the prior. VAE architecture is represented in figure 1.

Figure 1: VAE architecture

The overall objective, i.e. reconstruction loss and KL-divergence, correspond to the following variational lower bound, where x comes from the input space and z the latent space:

log pθ(x) ≥ −DK L(qϕ(z|x) ∥ pθ(z))+

Eqϕ(z |x)[log pθ(x |z)]= L(θ, ϕ; x)

(1)

where θ and ϕ are the variational and generative parameters respectively.

To extend VAE to its conditional version, cVAE [8], we additionally condition both networks on a new attribute com-ponent y, as can be seen in figure 2. The conditional variational lower bound objective can now be expressed in the following form:

log pθ(x |y) ≥ −DK L(qϕ(z|x, y) ∥ pθ(z))+

Eqϕ(z |x,y)[log pθ(x |z, y)]= L(θ, ϕ; x,y)

(2)

Figure 2: cVAE architecture

3.1.2 cond-crVAE.

VAE suffers from a few fundamental problems, one of the most significant being difficulty to capture more complex im-age spaces, partially due to the fully-connected layers in VAE that cannot extract local details well despite conveying global information. This issue has been alleviated by crVAE [9]. By including recurrent connections between channels for both inference and generation via an LSTM module, it increases its capability of capturing both global abstract information and local details. In this work, we extend the unsupervised crVAE to a conditional version, named cond-crVAE, following the same graphical model as cVAE introduced above. Hence cond-crVAE is optimized via the same variational lower bound 2 as our cVAE model. The only difference in training objec-tive between cond-crVAE and cVAE is the weighted KL: the first half of the latent space bears more KL penalty to carry more over-all information and the second half has KL penalty reduced to allow more granular details.

Figure 3: cond-crVAE architecture

3.1.3 cGAN.

Generative Adversarial Network [4] consists of two neural networks: the generator G and the discriminator D. The term adversarial comes from the fact that those two networks com-pete against each other. The generator G generates new data instances while the discriminator D tries to tell apart real vs generated images and returns a probability of an image being real or not, where conventionally 0 means fake and 1 means real. The generator G wants to maximize the chances of discriminator making a mistake, that is to make it guess completely randomly.

Simply put, the generator G and the discriminator D play a two-player minimax game, forming the following min-max objective V(D, G):

min

G maxD V (D, G)= Ex∼pd at a(x)[loдD(x)]+ E_z∼p_z(z)[loд(1 − D(G(z)))] (3) Both networks are trained simultaneously.

GAN can be extended to a conditional version (cGAN [7]) by feeding additional input y to the generator and discriminator. 3

(5)

The objective function is hence: min

G maxD V (D, G)= Ex∼pd at a(x)[loдD(x |y)]+ E_z∼p_z(z)[loд(1 − D(G(z|y)))]

(4)

In this case for the discriminator to output 1, not only does the generated data have to look realistic but it also has to reflect the conditional information provided by y.

Figure 4: cGAN architecture

3.2 Data description

CelebA [27] dataset is used to train the models in this work. It consists of over 200K face images of celebrities. In addition, each image has 40 binary attribute annotations describing the person. The attributes vary from general ones, such as gender or hair color, to more specific ones, like arched eye-brows or pointy nose. This dataset was chosen for this study thanks to the large quantity of available images as well as attribute annotations which are essential for effective training of a conditional generative model.

The dataset comes with a predefined split between training (160K images), validation (20K images) and test (20K images) sets. Models are trained using the training set and then the validation set is used to track the progress and select the best model. Attributes from the test set are used to generate images in order to evaluate model performance.

In order to make computations feasible in terms of time and memory, all images have been resized to 64×64 pixels. The original dataset contains images representing the whole upper body of a person but this study is focused on faces in particular. Hence we use a cropped version of those images which show only the face. Furthermore, we perform data augmentation by randomly flipping images horizontally during the training.

3.3 Model architecture

All three conditional generative models share the same decoder (i.e. image generation network) architecture. It consists of a series of deconvolutional layers, each followed by batch normalization and LeakyReLU activation function.

Encoder for cVAE and cond-crVAE uses multiple convolu-tional layers. Similarly to decoder, each one is followed by batch normalization and activation function. However, before the input vector is fed into an activation layer, it is concate-nated with its negative to prevent information loss in further stages, forming the CReLU non-linearity [28].

Discriminator in cGAN uses a few designs specially for adversarial training. It makes use of the batch discrimination layers [12] which look at multiple examples combined instead of single ones to help prevent generator collapse. To include

additional regularization and prevent the discriminator becom-ing too powerful too quickly, dropout [29] and max-poolbecom-ing layers are included.

Model layouts can be found in Appendix B.

3.4 Implementation Details

Training is done on a server equipped with a Tesla K80 graphic card with 12 GB of RAM. The models use a fixed batch size of 128 and are trained for 150 epochs. Model snapshots and gen-eration outputs were saved frequently to track the progress.

Our implementation uses Torch [30]. It is a scientific com-puting framework with a wide support for machine learning algorithms. It allows to use GPU to improve performance and contains a large ecosystem of community-driven packages.

Training of each model takes 3-4 days in most cases. Each batch update takes approximately 1.6-1.8 seconds for cVAE and approximately 1.0-1.2 seconds for the other models. cVAE, cGAN and cond-crVAE use an initial learning rate of 0.0001, 0.0002 and 0.003 respectively with Adam optimizer [31]. In cGAN, beta1 for Adam is set to 0.5 and the other to 0.9. For cVAE, we do not average the reconstruction loss over each output dimension and henceforce weight the KL term with alpha set to 0.0003 during training; for cond-crVAE, similarly, the recurrent module is composed of 8 time steps and we set the KL term corresponding to the first 3 time steps to be 0.0003 and the rest 0.0002.

4 RESULTS

In this chapter we take a closer look at the generated images from each model and analyze them using two different meth-ods: quantitative analysis and human evaluation.

It remains an open question what is the optimal way to evaluate generative models besides human evaluation. In this thesis, inspired by related works, we employ two evaluation metrics to quantitatively assess the three deep conditional gen-erative models: first, use an independent attribute classifier trained from scratch as done in [8] to judge whether the gen-erations are well-conditioned; second, calculate the inception score as done in [12] to test the visual fidelity.

4.1 Attribute Quality

An automatic way to quantitatively evaluate how faithful our generations are to the assigned attributes is to separately train a attribute prediction network and evaluate its performance on the generated samples.

To train the attribute prediction network, we use CelebA’s training split as training set and its validation split as vali-dation set to select the best model. The attribute prediction network retain the same architecture as the encoder for cVAE but with additional FC layers leading to binary classification for the 40 attributes. It is optimized for 150 epochs with ADAM optimizer[31], an initial learning rate 0.0002, and a mini-batch size 128; based on the validation accuracy, we select the model trained after 52 epochs. We summarize the network’s perfor-mance on the CelebA testing set in Table 2 and 3 in Appendix A. As a reference, it achieves average accuracy 0.90 across the 40 CelebA attributes.

(6)

(a) Ground Truth (b) cVAE (c) cGAN (d) cond-crVAE

Figure 5: Comparison of generated images based on the same set of attributes using different models

Model Inception score cVAE 10.689 ± 0.207 cGAN 17.083 ± 0.259 cond-crVAE 8.062 ± 0.103 Table 1: Inception scores for each model

Next, we generate samples using the testing set attribute vectors from CelebA with cVAE, cGAN and cond-crVAE, feed the generated samples from the three models to the attribute prediction network and summarize the results in Table 2 and 3 in Appendix A. The average accuracy across the 40 attributes for cVAE is 0.87, cGAN 0.89, cond-crVAE 0.87 respectively. The generated samples from all the models result in similar at-tribute prediction accuracies, which are also very close to that of the ground truth images (0.90). Overall, we cannot draw a decisive conclusion regarding to which model performs the best in attribute modulation basing on this suit of results, be-cause not only the quantitative results are not statistically different enough but also later on we find the prediction accu-racy measured from the attribute prediction network in fact in some cases diverge significantly from human evaluation results (see Section 4.3). Nonetheless, these quantitative results still signify that the generated samples capture the assigned attributes to a certain degree.

4.2 Inception Score

In [12], the authors propose to use the inception score as a numerical measurement of the generated images’ visual qual-ity, which is suggested to positively correlate with human preference. Mathematically, given a model trained to classify categories of objects and a generative model trained to gen-erate from the pool of objects, the inception score is defined as

I= exp(ExDKL(p(c |x)||p(c)))

where x is the generated image and c is the class label. Intu-itively, the mathematical formulation of inception score quan-tifies both the diversity as well as the meaningfulness of the generated images.

The classification model used to calculate inception score in our case is a face recognition model trained on CASIA dataset [32] that is composed of 494, 414 face images from 10, 575 different identities. Specifically, we first train a convo-lutional neural network with 8 convolution layers and spatial max-pooling followed by three FC layers on ImageNet [33]. The network is equipped with CReLU [28] nonlinearity for convolution layers and ReLU nonlinearity for FC layers. Then, we fine-tune the ImageNet pretrained CNN on the CASIA dataset by resetting the last FC layer to have 10, 575 output classes. The learning rate is set to 0.0001 for fine-tuning, while the last fully-connected layer has 10 times larger learning rate as the layer weights are re-initialized.

During evaluation, we again generate images based on the CelebA testing set attributes from cVAE, cGAN and cond-crVAE and then calculate the inception scores for each model. We then randomly split the 20K testing images to 10 fold and calculate the inception scores for each fold, reporting the mean and standard deviation. The results are summarized in Table 1.

Surprisingly, cGAN achieves significantly better inception score than VAE-based models. However, in Figure 5, the gen-erations from cGAN is visually inferior to the others. Such contradictory brings question marks over the validity of us-ing inception scores to measure generation quality, agreeus-ing with observations from recent works [34]. Indeed, later on, we show that the human evaluation results again diverge from the quantitative assessment–this time with inception score.

4.3 Human evaluation

For human evaluation, we created a crowdsourcing task shared with a group of people. The objective is to analyze model performance based on human perception of generated images. Once more, we use attributes from the testing set to gather the data for this experiment from each deep conditional generative model.

4.3.1 Survey Setup.

Each survey task consists of three images generated from each model with the same set of attributes. The images are 5

(7)

Figure 6: Crowdsourcing task

shuffled every time to prevent bias towards any model. The user is presented with those three images and asked two ques-tions.

The first question is to compare the images and select the one that looks the most realistic. Here we focus on evaluating the generated image quality.

The second question focused on how well the attributes are represented on each image. The user is given a list of posi-tive(present) attributes used to generate given images along with male/female, has_beard or not and young/old. For each attribute the user would select the images which represent it the best.

An example of a survey task can be seen in Figure 6. There were 16 participants and 270 unique sets of images being evaluated. The full list of attribute scores is presented in Tables 2 and 3 in Appendix A.

4.3.2 Survey Results.

In the first question the most popular answer was cond-crVAE which was selected 177 times, followed by cVAE (73) and cGAN (20). These results show that in most cases cond-crVAE gen-erates sharper, more realistic images and is superior to other models–opposite to the conclusion from inception scores.

The second question evaluated how well the attributes used to generate the images are represented. For each model an overall accuracy for all attributes was calculated. Total average accuracy was the highest for cond-crVAE (0.78), then cVAE (0.70) and lastly cGAN (0.68).

Interestingly, the performance of the models can vary signif-icantly from attribute to attribute. cVAE achieved the highest scores for goatee(1.00), blurry(1.00), male(0.97), female(0.94), no beard(0.92) and eyeglasses (0.9). cGAN achieved the highest scores for bald(1.00), pale skin(1.00), wearing necklace(1.00), wavy hair(0.96), no beard(0.91) and blonde hair(0.90). Top per-forming attributes for cond-crVAE were goatee(1.00), mus-tache(1.00), no beard(1.00), female(0.99), male(0.97), smiling(0.96) and attractive(0.92).

On the other hand, cVAE performed poorly on wearing neck-lace(0.00), wearing earrings(0.13) and wavy hair(0.19); cGAN performed poorly on chubby(0.09), attractive(0.18) and oval face(0.26); cond-crVAE performed poorly wearing necklace(0.00), gray hair(0.00), wavy hair(0.23) and wearing earrings(0.25). It is worth noting that poor performance of some attributes might

be due to the low resolution of the images which makes it difficult to detect small objects such as earrings.

cVAE and cond-crVAE performed similarly on most at-tributes because they are in the same family of generative models. For example, both models were good with facial hair related attributes thanks to its clear generation of the face re-gion but generally lacked the ability to reflect the correct hair attributes, likely due to its blurriness around the boundary. In most cases they were more accurate than cGAN, even for some of the most basic attributes such as gender where both cVAE and cond-crVAE had accuracy close to 1.00 while cGAN had only 0.79 for female and 0.88 for male.

However, there were certain areas where cGAN was supe-rior to other models. One prime example is wavy_hair where cGAN achieved accuracy of 0.96 whereas cVAE scored only 0.19 and cond-crVAE 0.23. Interestingly, cGAN performed bet-ter than other models especially for attributes related to hair, probably due to the flaw in VAE-based models which are often blurry in hair regions as aforementioned.

cond-crVAE tends to perform the best on average and there are many attributes for which it had a much higher accuracy than both cVAE and cGAN. For instance attractive was well represented by cond-crVAE (0.92) while other models per-formed poorly (0.63 for cVAE and 0.18 for cGAN). Similarly for young cond-crVAE greatly outperformed other models.

Meanwhile, no single model appears to be significantly superior to the rest on all attributes. cond-crVAE is the best on average but ineffective in some areas where the other models excel.

Even though the models investigated in this thesis still lack accuracy and efficiency in some areas, the survey confirmed that all of the deep conditional generative models are capable of representing attributes in generated images to a substantial degree, which shows great potential for our desired attribute-guided face synthesis applications.

5 WEB APPLICATION

This chapter describes the web application, the final product of this work. We design a clear and easy to use website which utilizes the generative models on the server side to allow se-lection of the desired facial features and instant generation of the images.

5.1 Design

The user is presented with an interface that allows him/her to describe how a person looks like. Related attributes are grouped together, such as hair colors, for easier browsing. Some groups allow multiple selection(mouth can be both smil-ing and wearsmil-ing_lipstick) while others are limited to just one, for example gender(male or female) or age(young or old).

After the attributes are selected, the user can choose one or more models that he/she wishes to use. Upon clicking Generate a request for each model with selected attributes is sent to the server. The server runs the model and once the images are generated they are sent back to the front end for display. The user can then select an image and save it or choose to regenerate them, receiving a set of new images within seconds. 6

(8)

Figure 7: Web Application

5.2 Advantages

There are many ways in which such web application is superior to current solutions in suspect identification.

The first one is speed. As discussed in section 1.1 it is of major importance to interview the witness as quickly as possi-ble after the incident. Current techniques may take even up to few days for that. Our proposed solution solves this problem as it only requires a laptop or a smartphone and connection to the Internet and the witness can immediately try to identify the suspect.

Another major advantage is the cost. Hiring a forensic sketch expert or training employees to use a sophisticated software is an expensive investment. By using our web appli-cation those costs can be minimized. Once the appliappli-cation is built the only cost is to host a server or a workstation which is considerably lower than previously mentioned expenses.

Furthermore, this application can potentially have a higher success rate than conventional methods. Based on our analysis and results, the generated images have a high level of realism and represent selected attributes with high accuracy. Even if current generated images don’t resemble the suspect, it takes only few seconds to generate new ones and this process can be repeated as many times as necessary.

6 CONCLUSION

The overall goal of the thesis was to investigate a new method of suspect identification which would be faster, cheaper and more accurate than current solutions. To achieve this goal, we examined three important deep conditional generative models, namely conditional Variational Autoencoder and conditional Generative Adversarial Network and conditional channel re-current Variational Autoencoder. A thorough performance evaluation of those models was conducted, including quanti-tative analysis and human evaluation. The analysis confirmed the quality of output images both in terms of image quality and

the presence of attributes specified in the generation process. We then designed a novel web application, where a user can select facial features describing a person and generate realistic face images using deep generative models presented in this thesis.

There are many aspects which leave room for improve-ment for future research. Generated images from cVAE and cond-crVAE, despite overall high quality and realism, are of-ten blurry on the boundaries, especially at hair regions. cGAN model looks promising as well in preserving attribute assign-ments but currently generates poor quality images. Further-more, given the fact that cVAE and cond-crVAE perform better on one set of attributes and cGAN outperforms them on the other ones, it is worth investigating a combination of such models.

Another improvement would be to use a higher number of attributes and experiment with different forms of input, such as utilizing nature language descriptions. The results can also be potentially better if the models were trained on images with higher resolution.

REFERENCES

[1] Overseas Security Advisory Council, “Netherlands 2018 crime & safety re-port.” https://www.osac.gov/Pages/ContentReportDetails.aspx?cid=23530, Feb 2018. Accessed on 2018-05-28.

[2] Scott Klum, Hu Han, Anil K. Jain, Brendan Klare, “Sketch based face recognition: Forensic vs. composite sketches,” 2013.

[3] C. Frowd, D. Carson, H. Ness, D. McQuiston, J. Richardson, H. Bald-win, P. Hancock, “Contemporary composite techniques: The impact of a forensically-relevant target delay,” Legal and Criminological Psychology, pp. 2–5, 2005.

[4] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Genera- tive adver-sarial nets,” arxive, 2014.

[5] Max Welling, Diederik P. Kingma, “Auto-encoding variational bayes,” ICLR, 2014.

[6] Scott Reed, AÃďron van den Oord, Nal Kalchbrenner, Sergio GÃşmez Colmenarejo, Ziyu Wang, Dan Belov, Nando de Freitas, “Parallel multiscale autoregressive density estimation,” ICML, 2017.

[7] Mehdi Mirza, Simon Osindero, “Conditional generative adversarial nets,”

(9)

arxiv, 2014.

[8] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling, “Semi-supervised learning with deep generative models,” NIPS, 2014. [9] Wenling Shang, Kihyuk Sohn, Yuandong Tian, “Channel-recurrent

autoen-coding for image modeling,” WACV, 2018.

[10] Sepp Hochreiter, Jurgen Schmidhuber , “Long short-term memory,” 1997. [11] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang, “Gener-alization and equilibrium in generative adversarial nets (gans),” ICML, 2017.

[12] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, “Improved techniques for training gans,” 2016. [13] Duhyeon Bang, Hyunjung Shim, “Improved training of generative

adver-sarial networks using representative features,” 2018.

[14] Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, “Progressive grow-ing of gans for improved quality, stability, and variation,” 2018. [15] Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On convergence

and stability of gans,” 2017.

[16] Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, “Pixel recur-rent neural networks,” 2016.

[17] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, “Pixelvae: A latent variable model for natural images.,” arxiv, 2016.

[18] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” 2016.

[19] A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, “Autoen-coding beyond pixels using a learned similarity metric,” ICML, 2016. [20] L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial variational bayes:

Unifying variational autoencoders and generative adversarial networks,” ICML, 2017.

[21] Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee, “Attribute2image: Conditional image generation from visual attributes,” ECCV, 2016. [22] Jon Gauthier, “Conditional generative adversarial nets for convolutional

face generation,” Technical Report, 2015.

[23] Scott Reed, Aaron van den Oord, Nal Kalchbrenner, Sergio Gomez Col-menarejo, Ziyu Wang, Dan Belov, Nando de Freitas, “Parallel multiscale autoregressive density estimation,” ICML, 2017.

[24] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, Gang Hua, “Cvae-gan: Fine-grained image generation through asymmetric training,” 2017. [25] D.E. Rumelhart, G.E. Hinton, R.J. Williams, “Parallel distributed processing.

vol 1: Foundations,” MIT Press, 1986.

[26] Solomon Kullback, Richard A. Leibler, “On information and sufficiency,” 1951.

[27] Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang, “Deep learning face attributes in the wild,” ICCV, 2015.

[28] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in International Conference on Machine Learning, 2016.

[29] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdi-nov, “Improving neural networks by preventing co-adaptation of feature detectors,” 2012.

[30] Ronan Collobert, Samy Bengio, Johnny Marithoz, “Torch: A modular ma-chine learning software library,” 2002.

[31] Diederik P. Kingma, Jimmy Ba, “Adam: A method for stochastic optimiza-tion,” 2014.

[32] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.

[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[34] S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.

(10)

Appendices

A

ATTRIBUTE SCORES TABLES

Human evaluation Classifier

attribute cVAE cGAN cond-crVAE cVAE cGAN cond-crVAE real no_beard 0.92 0.91 1.00 0.93 0.93 0.95 0.95 male 0.97 0.88 0.97 0.98 0.97 0.99 0.98 female 0.94 0.79 0.98 - - - -smiling 0.90 0.87 0.96 0.95 0.95 0.90 0.92 goatee 1.00 0.50 1.00 0.97 0.97 0.97 0.97 mouth_slightly_open 0.87 0.70 0.89 0.91 0.93 0.93 0.93 eyeglasses 0.90 0.80 0.70 0.97 0.99 0.97 0.99 black_hair 0.73 0.78 0.84 0.85 0.89 0.78 0.89 rosy_cheeks 0.60 0.80 0.90 0.93 0.94 0.94 0.95 straight_hair 0.77 0.62 0.90 0.73 0.79 0.69 0.81 pale_skin 0.75 1.00 0.50 0.97 0.97 0.92 0.97 mustache 0.75 0.50 1.00 0.97 0.97 0.97 0.97 bald 0.75 1.00 0.50 0.98 0.98 0.98 0.99 blurry 1.00 0.38 0.88 0.75 0.95 0.92 0.96 arched_eyebrows 0.63 0.73 0.85 0.79 0.80 0.80 0.82 narrow_eyes 0.88 0.44 0.81 0.89 0.85 0.89 0.87 brown_hair 0.65 0.68 0.79 0.82 0.84 0.82 0.88 pointy_nose 0.58 0.81 0.69 0.74 0.75 0.74 0.76 blond_hair 0.68 0.90 0.47 0.90 0.95 0.88 0.96 big_nose 0.74 0.61 0.65 0.80 0.79 0.81 0.83 young 0.69 0.47 0.83 0.83 0.82 0.80 0.88 wearing_lipstick 0.52 0.73 0.73 0.87 0.94 0.93 0.94 high_cheekbones 0.59 0.57 0.79 0.88 0.87 0.87 0.87 old 0.67 0.67 0.47 - - -

-Table 2: Human evaluation attribute accuracy

(11)

Human evaluation Classifier

attribute cVAE cGAN cond-crVAE cVAE cGAN cond-crVAE real bangs 0.64 0.72 0.40 0.89 0.97 0.84 0.95 double_chin 0.50 0.50 0.75 0.96 0.96 0.96 0.96 attractive 0.63 0.18 0.92 0.79 0.79 0.67 0.82 wearing_hat 0.43 0.71 0.57 0.97 0.99 0.96 0.99 receding_hairline 0.65 0.76 0.29 0.93 0.93 0.92 0.93 sideburns 0.67 0.67 0.33 0.96 0.96 0.96 0.97 heavy_makeup 0.33 0.72 0.58 0.85 0.90 0.89 0.91 beard 0.20 0.60 0.80 - - - -oval_face 0.64 0.26 0.69 0.73 0.71 0.69 0.75 bushy_eyebrows 0.45 0.45 0.64 0.90 0.91 0.91 0.92 chubby 0.82 0.09 0.64 0.95 0.95 0.95 0.95 big_lips 0.46 0.43 0.64 0.68 0.67 0.70 0.72 bags_under_eyesk 0.37 0.74 0.42 0.80 0.83 0.81 0.84 wavy_hair 0.19 0.96 0.23 0.64 0.76 0.64 0.80 5_o_clock_shadow 0.33 0.75 0.25 0.92 0.90 0.92 0.94 gray_hair 0.67 0.67 0.00 0.97 0.97 0.97 0.98 wearing_earrings 0.13 0.88 0.25 0.79 0.84 0.80 0.87 wearing_necklace 0.00 1.00 0.00 0.86 0.86 0.86 0.86 average 0.70 0.68 0.78 0.87 0.89 0.87 0.90

Table 3: Human evaluation attribute accuracy

B

MODEL ARCHITECTURE

Layer explanation:

• Spatial Convolution (nInputPlane, nOutputPlane, kW, kH, dW, dH) - performs a 2D convolution over an in-put image composed of several inin-put planes. Params: nInputPlane number of input planes; nOutputPlane -number of output planes; kW - kernel width; kH - kernel height; dW width step of the convolution; dH -height step of the convolution

• Spatial Batch Normalization (N, eps, mom) - performs batch normalization. Params: N - dimensionality of the input; eps - added to standard deviation, prevents divi-sion by zero; mom - momentum value

• Linear(in, out) - fully connected layer. Params: in - size of input; out - size of output

(12)

VAE Encoder Attribute Encoder cGAN encoder

CRELU Linear (40, 1024) Spatial Convolution (3, 32, 5, 5, 1, 1, 2, 2) Spatial Convolution (6, 32, 5, 5, 2, 2, 2, 2) LeakyReLU(0.1) Max Pooling (2,2)

Spatial Batch Normalization (32, 1e-6, 0.9) ReLU

CReLU Dropout (0.5 )

Spatial Convolution (64, 64, 3, 3, 2, 2, 1, 1) Spatial Convolution (32, 64, 5, 5, 1, 1, 2, 2) Spatial Batch Normalization (64, 1e-6, 0.9) Max Pooling (2,2)

CReLU ReLU

Spatial Convolution (128, 128, 3, 3, 2, 2, 1, 1) Spatial Convolution (64, 128, 3, 3, 1, 1, 1,1) Spatial Batch Normalization (128, 1e-6, 0.9) Max Pooling (2,2)

CReLU

Spatial Convolution (256, 256, 3, 3, 1, 1, 1, 1) ReLU

Spatial Batch Normalization (256, 1e-6, 0.9) Spatial Convolution (128, 256, 3, 3, 1, 1, 1,1)

LeakyReLU(0.1) Max Pooling (2,2)

Spatial Convolution (256, 256, 3, 3, 2, 2, 1, 1) ReLU Spatial Batch Normalization (128, 1e-6, 0.9) Linear(4096, 512)

CReLU ReLU

Spatial Convolution (512, 256, 3, 3, 1, 1, 1, 1) Spatial Batch Normalization (256, 1e-6, 0.9)

LeakyReLU(0.1)

Table 4: Encoder for the 2 VAE models, the attributes shared by all three models, and cGAN

Image Decoder

SpatialFullConvolution(256, 256, 4, 4, 2, 2, 1, 1) Spatial Batch Normalization(256, 1e-6, 0.9)

LeakyReLU(0.1)

SpatialConvolution(256, 256, 3, 3, 1, 1, 1, 1) Spatial Batch Normalization(256, 1e-6, 0.9)

LeakyReLU(0.1)

SpatialConvolution(64, 64, 3, 3, 1, 1, 1, 1) Spatial Batch Normalization(64, 1e-6, 0.9)

LeakyReLU(0.1)

Spatial Convolution(64, 3, 3, 3, 1, 1, 1, 1) Tanh

Table 5: Decoder shared by all three models

cVAE Inference crVAE Inference

Mean: Mean:

Linear (1024,1024) Spatial Convolution (320, 64, 3, 3, 1, 1, 1, 1) Add (64,4,4)

Log Variance: Log Variance: Linear (1024, 1024) 8 x LSTM (640, 512)

Batch Normalization (512, 1e-6, 0.9) 8 x Linear (512, 128)

Table 6: Inference Components for cVAE and cond-crVAE, note that the input to the inference part is the concatena-tion of encoded image and attribute.

(13)

cVAE Connection crVAE Connection Concate with Encoded Attribute 8 x LSTM (128, 512)

Linear (2048, 4096) 8 x Linear (512, 128) Reshape (256,4,4) Add (64, 4, 4)

Linear (1024, 1024) Concate with Encoded Attribute Spatial Convolution (128, 256, 3, 3, 1, 1, 1, 1)

Spatial Batch Normalization(256, 1e-6, 0.9)

Table 7: Connecting Components for cVAE and cond-crVAE between latent space and the decoder.

Discriminator Generator Concate with Encoded Attribute Uniform Noise (1024)

Linear (5120, 4096) Concate with Encoded Attribute Normalize Linear ( 2048, 4096) Dropout (0.5) Reshape(256,4,4) ReLU Batch Discriminator(100,5) Linear(4196,1) Sigmoid

Table 8: Combination of Image/Feature with Attribute for cGAN.