Pose Robust Age Estimation with the use of Pix2pix to Generate the Frontal Face

(1)

MSc Artificial Intelligence

Master Thesis

Pose Robust Age Estimation with the use of

Pix2pix to Generate the Frontal Face

by

Tycho Koster

10667687

October 31, 2019

36 EC 07/02/2019 - 31/07/2019

Supervisor:

Dr. S. Karaoglu

Co-Supervisor:

Prof. Dr. T. Gevers

Assessor:

Dhr. P. Das

(2)

4 Datasets 15 4.1 WIKI dataset . . . 15 4.2 APPA-REAL dataset . . . 15 4.3 AgeDB dataset . . . 16 5 Experiments 18 5.1 Implementation details . . . 18 5.2 Evaluation metrics . . . 18 5.2.1 Quantitative research . . . 18 5.2.2 Qualitative results . . . 19 6 Results 20 6.1 Quantitative results . . . 20 6.1.1 UV Completion . . . 20 6.1.2 Age Estimation . . . 20 6.2 Qualitative Results . . . 24 7 Conclusion 26

(3)

Abstract

In this paper, a new approach to pose robust age estimation is proposed. A problem in age estimation at this time, is that most state-of-the-art models have difficulty accurately predicting ages for extreme pose and occlusions. This thesis aims to create a model that is capable of generating a realistic frontal face for a large variation of poses, so that an accurate age can be estimated. This will be done by building upon an image-to-image translation model, called pix2pix, where 3 discriminators are added to the original model, to generate detailed frontal faces. The results show that the model is robust to poses, and achieve reasonable results on different datasets with limited available data used for training. With a larger and more specific dataset to the problem at hand, the results could improve significantly.

(4)

1 Introduction

The human face contains significant information about age, expression, gender and race etc. This resulted in automatic age estimation being an interesting research area. Age estimation has many potential applications in surveillance monitoring [10], Human–Computer Interaction, age-oriented advertisement systems [14] and in content acces [20].

There have been multiple papers on age estimation based on a single image of a face. In most recent years, many experimental results have shown that deep learning methods achieve the best results [8, 11]. In a research by Zhang et al. [38], a model is proposed named recurrent age estimation (RAE), that uses personalized aging patterns and appearance features to accurately predict age. They achieve state of the art performance results on different datasets, but the datasets used for evaluation contain limited pose variations and mostly frontal or near frontal facial images. Another research by Liu et al. [22], where a label-sensitive deep metric learning (LSDML) approach is proposed. Their model outperforms state of the art models on multiple datasets containing a high variation of poses. However, they still notice that their performance on frontal face images is better than the performance on images of faces taken from a different angle.

Most of these existing age estimation methods are based on the notion that face images are frontal or near frontal [2, 19, 33, 37]. This is because most datasets that are used for training do not contain enough images for every possible pose. Another reason is that a frontal face contains more information about the age of a person, which results in a more accurate prediction of the age. The main problem that arises from these papers is that the models have a hard time predicting the age when the dataset consists of a large pose variation. There is relatively little work concerning pose robust age estimation. However, there are many applications where non-frontal images appear more often than frontal images. For example, camera recordings from security footage, where people rarely look straight at the camera. Other papers have tried frontalization methods for age estimation to frontalize a face, which normalizes pose, size, and alignment [6, 34]. They prove that frontalizing a face does improve the age estimation results, but frontalization can lead to a significant amount of data being lost, especially with faces that contain extreme pose or occlusions [3]. Therefore, it is desirable to have a pose robust age estimator [32]. The main research goal of this paper is to create a new pose robust model that estimates the age from a single image of a face, by generating the frontal face with the use of an image-to-image translation model. This model extracts a UV texture map from a face, which is the 2D projection of a 3D model. This texture map contains self occlusions, because of different pose variations. Hence, a full frontal face is generated from this incomplete UV texture map, which is called the complete UV texture map. From this generated complete UV texture map, the age prediction is made. To train such a model, full frontal faces are used as a ground truth, and a method is implemented that imitates the process to create different poses. By generating a full frontal face, the problems that arise in the previously mentioned models of pose variations and self occlusions disappear. Subsequently, the model has more information of the face to make a more precise age prediction. This results in the following research question:

”Is it possible to create a pose robust age estimation model, with the use of an image-to-image translation model to generate a complete UV texture map?”

The hypothesis for this research, is that the model will be able to outperform the same age estimation model trained on the original images and that it is capable of giving an accurate estimation of the age for every pose, resulting in a pose robust model. To summarize, the key contributions of this paper are:

• An age estimation model that is robust to poses, that is capable of an accurate age prediction on datasets that contain pose variations and outperforms the same age estimation model that is trained on the original images.

• Implementing a method that replicates the data that is needed to train the model to generate frontal faces from different viewpoints of a face, which removes different parts of a face to resemble different poses or self-occlusions .

• A model that is capable of generating full frontal faces for a high variation of different poses, so that the generated image can contain features for age estimation.

(5)

2 Related Work

2.1 Age estimation

In the past years, the focus of age estimation has been shifted from a manual process to an automated process. There have been multiple researchers who have tried to create a model that automatically predicts the age of a person solely on the basis of an image of the face. In this section, related work about different implementations of automatic age estimation networks is discussed. Furthermore, research on automatic age estimation that focuses on different poses is discussed.

In a previous work of Yoo et al. [37], a linear Support Vector Machine (SVM) and Adaboost classifiers are used to train a network for age and gender estimation. They detect and extract faces from video data, and select the faces that are taken from a frontal view, with the use of a head pose estimation method. Following this, they extract features from the frontal faces with the use of Local binary patterns (LBP) and Gabor filters. Finally, they used these features to estimate the age and gender with previously mentioned SVM and Adaboost classifiers.

Eventually the work shifted from linear classifiers to Neural networks, where a research by Quinn and Lech [27] combined a neural network and a SVM to create an automatic age estimation model. They extract features with the use of the orthogonal locality preserved projections (OLPP), which preserves the locality characteris-tics of images, and a Sobel filter, that detects edges in a face, which shows the presence of wrinkles and other features that can give a good indication about the age of a person. These features are used on a multistage binary age estimation system, where every layer makes a binary decision on whether the face is closer to one age or the other, and finally converging to a narrow choice between two ages.

Afterwards the automatic age estimation models went from neural networks to deep learning networks, where in a previous work of Levi and Hassner [21], a small and simple CNN model is proposed, which is trained on unconstrained images of faces. Their model is trained on the Adience benchmark, which is a dataset that contains age and gender labels, where the images are automatically uploaded to Flickr from smartphone devices. This results in a dataset that has highly unconstrained images, which reflect the real-world challenges of faces on the internet. Their model consists of 3 convolutional layers and 2 fully connected layers. They use a small network to avoid over-fitting. Their input is an unconstrained image of a face and the output is the predicted age group and gender. They noticed that most mistakes made by their model were because of images being of low resolution or containing occlusions. Their model also predicts on age groups, where the proposed model focuses on precise age estimation, which is more difficult to predict.

Another paper that focuses on automatic age estimation, is from Qawaqneh et al. [26]. They propose a model that fine-tunes the pre-trained model of VGGFace, which is trained on a large dataset for face recognition tasks, to perform age classification by replacing the last fully connected layer with an age classifier. By using the pre-trained model it will avoid over-fitting on a small dataset. This model is tested on the same dataset as the aforementioned work and achieved a better accuracy, but it still performed poorly on images that did not have a frontal or near frontal view of the face.

Improved upon the idea of using a pre-trained model to increase the performance of the age estimation model, in a research by Rothe et al. [29] they propose a model called DEX, which uses a pre-trained model on the ImageNet dataset called VGG-16, and fine-tune this model with a dataset called IMDB-WIKI. This dataset contains over 500K images of faces with labels of ages. They achieve state of the art results on different bench-marks, which is why in this paper it is compared to the proposed model.

Another paper which improved upon the works of the DEX model, is residual DEX by Agustsson et al. [1]. This paper added another model that is specialized on the difference between the predicted age and the real age. This model predicts the residual between the predicted age and the real age, so that this information can be taken into account when computing the final prediction. Their paper shows that the results slightly improve the age estimation process, which is why it is used as a comparison for the proposed model.

2.1.1 Age estimation for different poses

There have been few papers that discussed the problem of high variation of poses in age estimation. In a research by Bukar and Ugail [5], a model is proposed, which estimate ages from side-view images of a face. The authors use a pre-trained deep residual neural network to extract the features and use a space partial least-squares regression technique to estimate the ages. Their model shows promising results while having less information

(6)

than frontal-view images. They do state that their performance degrades when they evaluate on unconstrained datasets, meaning it does not perform well for every pose.

In another research by Song et al. [32], a multi-view age estimator by means of video images is proposed. They train an age estimation model by combining similar faces in different poses and their spatial information with a small set of faces with labeled ages. This model is a framework that unifies supervised learning, multi-view learning and transfer learning to a multi-multi-view age estimator. The small set of faces with labeled ages are used to enforce the face-to-age relation. Subsequently, the images of the same face in different poses are used to enforce the age-equivalence relation. Finally, the spatial relation of the models can be indicated by matching appearance of faces across poses. This age estimator shows promising results in predicting ages from different viewing poses, making it very universal. However, the Mean standard error of the proposed model is generally higher than other models. This is because not all pose variations are present in the video data they used. All of these methods have issues with accurately predicting ages of faces that contain occlusions, as well as faces that highly differ in pose, because most datasets that are used for training their model, do not contain every possible pose or every occlusion that is possible in a face. These images with a lot of occlusions tend to contain less information about the age of a person. This results in poor performance for images in the wild. Therefore, in this paper a model is proposed which predicts ages from a single image of a face by generating the full frontal face, removing any occlusion from the face and making it robust to poses.

2.2 UV Texture Mapping and 3DMM

UV texture mapping is a technique that maps a 2D image texture onto a 3D mesh, where ”U” and ”V” represent the names of the axes of a 2D plane. These UV coordinates are mapped to the ”X”, ”Y” and ”Z” coordinates present in a 3D space. This means that a UV texture map is a flat representation of a 3D model, which means that a 3D model of an image has to be created to be able to extract a UV texture map. A technique called 3D Morphable Model (3DMM) fitting is capable of creating such a 3D model from a 2D image [16]. The 3DMM is a model that is created from 3D face data by using Principal Component Analysis (PCA), and contains prior knowledge about the human face. The 3DMM fitting technique estimates the 3D shape, texture, pose and illumination from a single 2D image. There have been multiple papers about 3DMM fitting techniques, which are discussed below.

It was first proposed by Blanz and Vetter [36], where they estimate the 3D shape and texture from single images containing faces, by fitting a 3D morphable model of 3D faces. They iteratively fitted the input faces with textured 3D scans of heads and reconstruct the 3D texture and shape based on the matched 3D scan. They show that a 3DMM is a powerful and versatile representation of human faces. They only show the use of a 3DMM on faces without much complexity (i.e. faces that do not contain glasses, beards or different expressions). They also performed the fitting process on all facial pixels, which resulted in a high computational load. Recently, there have been many proposed methods to improve the efficiency and accuracy of the 3DMM fitting process. In a paper by Rara et al. [28], a method that only needs 2D feature points from a single input image to reconstruct the 3D shape is proposed. This resulted in a much more efficient way of fitting the 3DMM onto a 2D image, lowering the computational load and increasing the accuracy of the process. In another paper by Yanga et al. [36], where they increase the accuracy of 3DMM fitting process by not treating the landmarks equally, but assign weights to them that are based on the estimated errors between the corresponding 2D and 3D landmarks. To significantly decrease the computational load of the 3DMM fitting process, a CNN model is proposed by Feng et al. [13]. This model is capable of predicting the UV position map, which is a 2D representation that records 3D shape of a complete face, from a single image. This results in the removal of a model that contains prior knowledge of 3D faces. Their model surpasses state of the art methods on both reconstruction and alignment tasks. Therefore, this model is used in this paper to extract the UV texture map from a single 2D image of a face.

2.3 Generative Adversarial Network

Generative Adversarial Networks (GANs) have been widely used in different image related tasks. For example, the generation of images, image editing and image-to-image translation. GANs are frameworks where two models are trained simultaneously [15]. One model is the generator G : Z → Y , which maps random noise z ∈ Z to an output image y ∈ Y . The other model is the discriminator D : Y → [0, 1], which determines whether an example is from the training data or from the generator, where the output value is the probability of the input image being ’real’ or ’fake’. The loss function that is used for training GANs is called the adversarial loss,

(7)

Source: https://www.freecodecamp.org/news/

an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394/ Figure 1: An illustration of the architecture of standard GANs. The generator takes random noise, and generates a fake image that resembles the images in the training set. The discriminator tries to classify the

real images as real, and the fake images as fake.

which is defined as:

LGAN(G, D) = Ey log(D(y)) + Ezlog(1 − D(G(z))) (1)

The intuition behind this loss function, is that the generator and the discriminator play a game of mini-max. The generator tries to fool the discriminator, by generating images that are realistic enough, while the discriminator is continuously trained to accurately distinguish the real images from the fake images. The equation for this game is as follows:

G∗= arg min

G maxD LGAN(G, D) (2)

At the time of testing, only the generator is used to generate fake images that closely resemble the training images. The input of the generator or discriminator can also be conditioned, which are called Conditional Generative Adversarial Networks (CGANs). CGANs are almost identical to GANs, but the difference between the two is that the generator and the discriminator use additional conditional information combined with the original input [23]. To illustrate GANs a bit more, figure 1 shows the architecture of standard GANs. The next section explains some of these applications using GANs.

2.3.1 Pix2pix

Pix2pix is a CGAN that is proposed by Isola et al. [17] as a general approach to image-to-image translation problems. This is the first model that combines the adversarial loss with the standard pixel level L1loss, which

solves the problem of traditional image-to-image translation, where the standard losses typically do not take the correlations between pixels into consideration [40], because they treat pixels as independent values. The L1

distance loss is defined as:

LL1(G) = Ex,y,z||y − G(x, z)||1 (3)

They use CGANs to map the input image X combined with random noise z to the output image Y , which is called image-to-image translation. They use the generator G : X × Z → Y , to generate the output image, given an input image and random noise, and they use the discriminator to determine whether the generated image is fake or real. The final objective that the model tries to optimize is:

G∗= arg min

G maxD LGAN(G, D) + λLL1(G) (4)

The random noise is added to the generator to avoid a deterministic output from the generator. In these experiments, they found out that the generator learned to ignore the random noise when mapping X to Y . In order to solve this issue they added dropout for several layers of their generator, for both training and testing. The problem was not fully solved, but they noticed that the output of the generator was partially stochastic. In their paper, they use a ”U-net” architecture, which makes skip connections between mirrored layers of the generator and the discriminator. The reason for this is that the structure of the input is aligned with the structure of the output, which would get lost for a normal encoder-decoder architecture. A normal encoder-decoder architecture down-samples the information until a bottleneck layer is reached, at which point the process is reversed. Therefore, by using the U-net architecture the bottleneck is bypassed and the important low-level information is maintained, that would otherwise get lost. Figure 2, illustrates the workings of a U-net.

(8)

Figure 2: Difference between a standard encoder-decoder and the U-net where skip connections are made between different layers [17].

They showed that their algorithm proves to be applicable to a wide variety of settings, because their loss function adapts to the problem at hand. In figure 3, there are a few examples of those applications. Seeing as this model is applicable to a wide variety of settings for image-to-image translation, it will be used to serve as a basis for the proposed model that is capable of generating complete UV texture maps with a given UV texture map that contains occlusions.

Source: https://github.com/phillipi/pix2pix/blob/master/imgs/examples.jpg Figure 3: Some examples of applications that use pix2pix.

2.3.2 UV-Gan

In a research by Deng et al.[9], an adversarial UV completion framework called UV-GAN is proposed. Their framework is capable of pose invariant facial recognition. They fit a 3DMM model onto a 2D image of a face to extract the incomplete UV texture map, which contains self occluded regions. They have three parametric models which need to be solved, which are shape S, texture T and camera W, they are defined as:

S(p) = s + Usp,

T (λ) = t + Utλ,

W(p, c) = P(S(p), c),

(5)

where p, λ, c are the parameters that need to be optimized by the shape, texture and camera models. Us and

Utare the eigenbasis of the shape and texture models, and s and t are the mean shapes of the shape and texture

models. Function P is a perspective camera transformation, making the overall cost function for 3DMM fitting, which is defined as follows:

arg min p,λ,c||F (W(p, c)) − T (λ)|| 2_{+ α} s||p||2_σ−1 s + αt||λ|| 2 σ−1_t . (6)

(9)

The ||p||2

σs−1 and ||λ||

2

σ_T−1, are regularisation terms to prevent over-fitting of the model. The terms σ −1 s and

||σ−1t are the diagonal matrices, where the diagonal are the eigenvalues for the shape and texture models.

The constants αs and αt are there to balance the regularisation terms. The term F (W(p, c)) stands for the

operation of sampling the feature based input image on the projected 2D locations. To accelerate the 3DMM fitting process, they start the fitting with predicted 3D landmarks, and use these landmarks by adding a 2D landmark term to the final objective function. This changes the complete objective function to:

arg min p,λ,c||F (W(p, c)) − T (λ)|| 2_{+ α} l||W(p, c)) − sl||2+ αs||p||2_σ−1 s + αt||λ|| 2 σ_t−1 (7)

In this equation, sl is the 2D shape, αtis the same as the other α values, to balance the landmark term. This

function has to be optimized by their so called Gauss-Newton optimisation framework.

Their model for UV texture completion consists of 1 generator, 2 discriminators and an identity model. The generator tries to generate the complete UV texture map, based of an incomplete UV texture map of a face. They use the pixel-wise reconstruction loss for the generator, which is the average of the sum of the absolute difference between the generated image and the real image:

Lgen = 1 W × H W X i=1 H X j=1 |Yi,j− Yi,j∗| (8)

The final objective that the discriminator models needs to optimize, is defined as: Ladv= min

G maxD Ex∼pd(x),y∼pd(y)[logD(x, y)]+

Ex∼pd(x),z∼pd(z)[log(1 − D(x, G(x, z)))]

(9)

To complete this UV texture map, local and global adversarial networks are combined as the discriminator of their model, to learn to generate a full complete UV texture map. They use a global discriminator to determine the faithfulness of the entire UV texture map. A local discriminator is used, because the inner part of a face contains much more information concerning the identity. They use the identity model, which is the ResNet-27 model that is pre-trained on the CASIA dataset using the soft-max loss to classify 10k identities. They exploit the centre loss to improve the ability of the model to preserve the identity, which is defined as follows:

Lid= 1 m m X i=1 ||xi− cyi|| 2 2 (10)

Here, m is the batch size, xi ∈ R512 is the embedding features and cyi ∈ R

512 _{is the class feature centre. The}

final objective function for the UV-GAN model is defined as:

L = Lgen+ λ1Ladv,g+ λ2Ladv,l+ λ3Lid, (11)

Here, Ladv,g is the loss for the global discriminator, and the Ladv,l is the loss for the local discriminator, and λ1,

λ2and λ3are the weights to balance the losses. This results in preserving the most prominent features present

in a face. They then combine this completed UV texture map with the fitted mesh and generate synthetic data with arbitrary poses. Examples of their results are shown in 4. In this paper, the ideas of the global and local adversarial discriminators, and an identity preserving model are applied to generate complete UV texture maps that are detailed enough for age estimation.

(10)

(11)

3 Method

In this section, the different steps that are needed to create the proposed model are discussed. First, the process that extracts the UV texture map from a single 2D image of a face is elaborated. Secondly, the way the data is augmented, to create images from the UV texture map, that are similar to different poses of a face is explained. Finally, the proposed model to complete the different UV texture maps to a full frontal face, is described in detail.

3.1 UV texture map extraction

In order to generate a frontal view of a face, on the basis of an image of a face in a non-frontal view, a UV texture map has to be extracted. Doing so, the model can learn to fill in the occlusions present in the image to create a complete UV texture map, which is the same as a frontal view of a facial image. It is important to fill in the missing parts of the UV texture map, because this results in an increase of information about the age of a face, which could result in a better age prediction. The UV texture map extraction is performed with a small part of the implementation of PRNet1 _{[13]. This model uses a method called 3D Morphable model (3DMM)}

fitting, which is a method used to reconstruct a 3D facial image from a single 2D image, including information such as pose and illumination [36]. The 3DMM parameters are used to compute the correspondence between the 3D and 2D landmarks, and reconstructs the face with the use of an analysis-synthesis framework. The landmark detection is shown in Figure 5. Subsequently, to minimize the difference between the input image and the synthesized image, an optimization technique called Gauss-Newton is applied. Some examples of these reconstructed 3D models are shown in Figure 6.

Figure 5: Visualization of landmark detection on different faces [13]

Figure 6: Examples of the reconstructed 3D model [13]

They trained an encoder-decoder model, that is capable of generating the UV position map from an RGB image. This UV position map records the position information of a 3D face, and provides a dense correspondence to the semantic meaning of each point in the UV space [13]. This UV position map can be used to extract the UV texture map, by using a technique called z-buffering. Through z-buffering, the visible vertices can be computed, which can generate the UV texture map [9]. Examples of this UV texture map and UV position map are shown in Figure 7. To train this model they used a dataset called 300W-LP, which contains 60K unconstrained images of faces with their corresponding fitted 3DMM parameters. They apply the 3DMM fitting process mentioned previously, by using these 3DMM parameters to generate the corresponding 3D position points and render these points into UV space to create the ground truth UV position map. In this way, the model learns to predict the UV position map without needing the 3DMM parameters. Therefore, bypassing 3DMM fitting, which is a very time consuming process. The loss function that is optimized by their model, is called the Mean Squared Error (MSE). However, this function treats every pixel equally. Therefore, they add segmentation information to focus on the most important parts of the face, which are the eye, the nose and the mouth region. Their final

1

(12)

loss function used to train their model is as follows:

L =X||P os(u, v) − P os∗(u, v)|| ∗ W (u, v) (12)

Here, P os(u, v) is the ground truth position map, P os∗(u, v) is the predicted position map, W (u, v) is the weight mask shown in Figure 7, u and v are the UV coordinates.

Figure 7: From left to right: UV texture map, UV position map, UV texture map including the segmentation information, weight mask used for the loss function [13].

The trained model is used for the extraction process of the UV texture map. The extraction process works as follows: First, the input image is re-scaled to 256x256, because their model is trained on cropped images from this format. Subsequently, it tries to detect whether a face is present with the use of the dlib CNN model2_{. This}

CNN is trained on a large dataset containing faces. The CNN is used as a slider of 50x50 on the complete image to decide whether an object of interest is present, which is a face in this particular case. This face detection model is proven to accurately predict whether a face is present or not for images that are taken at odd angles of a face. After the face detection, the PR-Net model predicts the UV position map. Finally, this UV position map is used to construct the UV texture map, with the aforementioned visible vertices. In this UV texture map, occlusions that are present in the image, are shown as black spots. Some examples of extracted UV texture maps are shown in Figure 8.

Figure 8: Examples of extracted UV texture maps (right) with their corresponding input image (left)

3.2 Data augmentation

To create a model that is capable of completing the incomplete UV texture map, data has to be available that contains an image of a full frontal face, where no occlusions are present, and an image of the same face from a different angle. Furthermore, a label of the age of the person’s face has to be available. An example of the needed data is shown in Figure 9. Unfortunately, there was no dataset, that contained all these attributes, was available online. This was mostly, because the ages of the people were not available for images that contained

(13)

different view angles of the same face. Thus, a method has been implemented that imitates the process of creating different angles of a face on datasets that contained age labels and frontal faces.

Source: https://www.shutterstock.com/nl/search/face+angle

Figure 9: Example of the visual data that is needed for the proposed model, without the age of the person First, the UV texture maps are extracted from a dataset containing face images with the use of the previously mentioned PRNet model, where the image is taken from a full frontal view of the face. This process is fulfilled with the use of the pose information that is available in the UV position map generated by the PRNet model, where the pose is between 0 and 10 degrees. Secondly, any image that has some sort of occlusion in the texture map is removed, which is detected by searching for black spots in the mouth, eye, and nose region from the face. Hence, if these black spots are present, they are removed from the training dataset. Thirdly, these complete UV texture maps are used as a ground truth for the model in the training phase, so that it can learn to generate the complete UV texture map, which is the full frontal face. Finally, one half of the face is randomly chosen and black spots of different sizes are randomly placed unto the UV texture map. The amount of black spots that will be augmented onto the image is also picked at random. This results in a wide variety of different occlusions present in the faces, so that the model can learn to fill in the face for a wide range of poses, and that the model is capable of filling in any occlusions that could be present in a face. Some examples of the results after performing this method are shown in Figure 10, This results in texture maps that are somewhat similar to the texture maps shown in the Figure 8, where different occlusions are present in a face, ranging from a total side view of a face to a small occlusion present in a face, because of a slight change in pose.

Figure 10: Examples of creating the data. The face on the left is the ground truth and the right face is the input for the generator.

3.3 UV texture map completion

To create the proposed pose robust age estimator, a frontal face has to be recreated. To achieve this, an image-to-image translation model called pix2pix3 _{has been implemented and modified. To refresh the memory,}

pix2pix is a CGAN that is capable of accurately mapping one image to another. The original pix2pix model

(14)

Figure 11: Network architecture. It consists of 1 generator, 3 discriminators and an identity preserving module. The generator takes the concatenated flipped input filled with random noise and returns a generated

complete UV texture map. The generated image is compared with the ground truth on a global and a local level, to validate the authenticity of the generated UV texture map. The fixed identity model is used to preserve the identity of the face. The age is predicted on the generated face. Only the generator and the age

discriminator is used for testing.

consists of 1 generator and 1 discriminator. In this model, the generator tries to create an image similar to the ground truth, and the discriminator tries to estimate whether the generated image is real or fake. The proposed model is based on the idea of Deng et al. [9] and it consists of 1 generator and 3 discriminators and an identity preserving module. Figure 11 shows a visualization of the design of the proposed model.

Generator: The incomplete UV texture map is filled with random noise for the black spots that are present in the face, so that the generator can make a clear distinction between the parts that need to be filled in, and the parts that should stay the same. Subsequently, the mirrored image of the face is concatenated to the original image, to add more information for the generator. The generator takes this incomplete UV texture map as an input for the model and tries to generate a complete UV texture map that is similar to the complete UV texture map that is used as a ground truth. To improve the quality of the results, skip connections between the encoder and decoder models are made, which is called a U-Net [18], these skip connections result in more information being passed on between different layers. The loss that needs to be optimized by the generator is called the pixel-wise reconstruction loss:

Lgen = 1 W × H W X i=1 H X j=1 |Ii,j− Ii,j∗ |, (13)

where W and H are the width and height of the complete UV texture map, Ii,j is the ground truth complete

UV texture map, and I_i,j∗ is the complete UV texture map generated by the generator model. The reason for using the pixel-wise reconstruction loss function is to have a detailed comparison between the generated image and the ground truth. This will cause the generator to create an image that is as close to the ground truth as possible, which is the desired outcome.

Global discriminator: This discriminator model predicts whether the image that is given as an input, is the complete UV texture map from the training data, or if it is the generated complete UV texture map. The discriminator predicts this for both the complete UV texture map that is used as the ground truth and the generated complete UV texture map. The reason that it is called the global discriminator is, because the discriminator uses the full 256x256 image as its input. Therefore, the faithfulness of the entire UV texture map is determined. The loss that has to be optimized by this discriminator is computed as follows:

LDreal/f ake(x, y) = 1 N N X n=1 [yn· log xn+ (1 − yn) · log(1 − xn)] LDglobal = 1 2LDreal+ 1 2LDf ake, (14)

where LDreal and LDf ake are computed as the Binary Cross Entropy loss between the label y, and the output x of the discriminator, for N samples that are present in a batch. The labels y are the generated complete UV map (fake), which has a value of 0, and the real complete UV map (real), which has a value of 1. The output x of the discriminator is a value that is between 0 and 1, that represents the probability of the model being sure that the input UV texture map is the real complete UV texture map. The Lglobal is the weighted sum of both

(15)

map. The Binary Cross Entropy is used, because it is proven to work well on classification problems that use an activation functions that models probabilities (i.e. sigmoid) in the output layer.

Local discriminator: This discriminator is called a local discriminator, because it only determines the faith-fulness of only the eye, the mouth and nose region of the complete UV texture maps. A local discriminator is used, because the central face region contains more identity information compared to all the other parts of the face. This results in a more detailed generated complete UV texture map. The loss of this discriminator is computed in the same manner as the global discriminator, which is as follows:

LDlocal = 1

2LDreal+ 1

2LDf ake, (15)

where LDreal and LDf ake are computed the same as in Equation 14, which is the Binary Cross Entropy loss. Age discriminator: The age discriminator determines the age of the generated image. The loss is only computed for the generated image, to force the network to somehow learn to generate images that are robust for age estimation. The age discriminator consists of a pre-trained model of the VGG-16 [31]. In which, the last fully connected layer is used for classification of the 101 classes (0-100) and the remaining layers are used for feature extraction [25]. VGG-16 is a Convolutional Neural Network (CNN) that is trained on a dataset called ImageNet, and is trained for image classification purposes. In this way, the model benefits from the representation that is learned to discriminate objects from images. The age discriminator starts with training after 20 epochs, so that it can start learning to predict the age from a generated image that contains more valuable information. The reason for this is that the generated image at the start of the training process is very blurry and contains a lot of noise. The loss of this discriminator is the Cross Entropy loss between the output matrix of the discriminator and the real age, which is called Lage, and is computed as follows:

Lage(x, y) = 1 N N X n=1 − log exp(xn[yn]) PJ j=0exp(xn[j]) ! = −xn[yn] + log   J X j=0 exp(xn[j])  , (16)

where x is the probability output matrix from the discriminator and y is the real age of the face. The output matrix has a size of N x100, where N is the size of the batch. This probability matrix shows the probabilities of all the ages. The variable J is the value for the different possible classes, which are the age labels, and is a value between [0,100]. To increase the accuracy of the prediction of the age discriminator, a slider window of the size 5x1 is used on the 101 classes to find the group with the highest probability. Finally, the center of the window is used as the prediction of the age estimation [39].

Identity module: An identity module is used (VGG-16 [31]), where the last fully connected layer is removed and the output of the second to last fully connected layer is used as a feature extraction model. This will make it so that the frontal results will be as close to the ground truth as possible on the feature level[7]. This is called the identity module, because its purpose is to preserve the identity of the face in the generated complete UV texture map. The loss for the identity module is computed as follows:

Lid = ||φ(Gf ake) − φ(Greal)||22 (17)

The reason behind the use of this loss function is, that this loss function is able to recognize if the same features are present in both images, meaning that it is capable of preserving the identity of a face. This particular model is pre-trained and is only used in the final loss, meaning it is not trained any further during the training phase. Total loss: To compute the total loss that has to be optimized by the complete model, the weigted sum of all the losses are combined, resulting in the following equation:

Ltotal= Lgen+ λ1LDglobal + λ2LDlocal+ λ3Lage+ λ4Lid (18)

(16)

4 Datasets

In this section three different datasets are explained. Firstly, the WIKI dataset, which is used for the training an testing of the proposed model. Secondly, the APPA-REAL dataset is discussed, which is solely used for testing purposes. Lastly, the AgeDB dataset is elaborated, which is only used for testing as well.

4.1 WIKI dataset

The WIKI dataset is the largest publicly available dataset of face images with gender and age labels [30]. However, seeing as the original dataset contains a lot of images without faces or wrongly labeled faces, where ages range from ”-33” to ”2014”, a clean dataset is used. All wrongly labeled images are removed automatically, so that the ages only range from 0 to a 100 years. The images that did not contain faces are removed by using the previously mention face detection algorithm provided by dlib. This resulted in a clean dataset, which only contains faces with their respective labels between 0 and a 100 years. This dataset contains around 35K images with their respective labels. Within these 35K images, only the images that contain a complete UV texture map are used for the training of the pix2pix model, including the age discriminator. To sum up, this resulted in around 10K images for training and 25K images for testing. Because the number of images that are collected is low, it resulted in an unbalanced distribution in the dataset, as can be seen in Figure 13. To prevent the model from over-fitting, a class balancing method was implemented. The data was split into different age groups. The age groups are as follows:

• 0-25

• 25-30

• 30-40

• 40-55

• 55+

After the class balancing the distribution is more even, which is shown in figure 14. As can be seen, every age group contains around 2K images. The data for each group is shuffled and before a batch example is selected, an age group is chosen by a cyclic iterator. Then, the example is selected within the chosen age group by means of another cyclic iterator. The pose distribution of the dataset is shown in the first row of Table 1. As can be seen, this dataset contains a high variation of poses.

Figure 12: Examples of WIKI images [30]

4.2 APPA-REAL dataset

The APPA-REAL dataset is a dataset that contains 7,591 images of faces associated with their real and apparent ages [12]. This dataset is used for the evaluation of the proposed model. Only the real age is used, because the proposed model is trained on the real age of the faces and not the apparent age. Both models are fine-tuned on a

(17)

Figure 13: Age distribution of the wiki dataset Figure 14: Age distribution of the wiki dataset after class balancing

small set of images from the APPA-REAL dataset. However, only images that had a frontal view were selected, seeing as it would have been impossible to fine-tune the proposed model otherwise. The test set contains 2000 images, which is used for the evaluation. This dataset is used as a benchmark for the proposed model. Most of the faces in this dataset have a frontal or near-frontal view, but it does contain some small pose variations, which can be seen in Table 1.

Figure 15: Examples of APPA-REAL images [12]

4.3 AgeDB dataset

The Age-DB dataset is the first dataset that contains manually collected in the wild facial images [24]. The drawback of other datasets, is that they are semi-automatically retrieved and annotated, which results in a great amount of noise. However, by manually collecting these images, it means that the dataset contains images of faces that are noise free and have accurate age labels. The dataset contains around 25000 facial images and is used as a benchmark for in the wild age estimation algorithms. To evaluate the performance of the model with just 21k images, it was possible to extract a UV texture map. 5K of these images are used for fine-tuning the model on this dataset. The other 16K images are used for evaluation. This dataset consists of images with a significant amount of pose variation, which can be seen from Table 1.

(18)

Figure 16: Examples of images in the AgeDB dataset [24] Poses Dataset 0-20 20-40 40-60 60+ WIKI (25K) 16K 8K 1.5K 160 AgeDB (16K) 7K 2.8K 250 25 APPA-REAL (2K) 1.5K 330 60 10

(19)

5 Experiments

In this section, the experimental setup is explained. First, the implementation details of the models that are evaluated are elaborated in detail. Secondly, the metrics that are used to evaluate these models are discussed.

5.1 Implementation details

UV Completion: To properly evaluate the quality of the generated complete UV texture maps proposed model, an ablation study has been performed. In this ablation study, every loss component shown in Equation 18 is added one after another, where the total losses of the different models are defined as follows:

• L_total= L_gen

• Ltotal= Lgen+ λ3Lage

• L_total= L_gen+ λ₃L_age+ λ₁L_D_global

• Ltotal= Lgen+ λ3Lage+ λ1LDglobal + λ2LDlocal

• L_total= L_gen+ λ₃L_age+ λ₁L_D_global + λ₂L_D_local+ λ₄L_id

The models are trained on 10000 images of faces from the WIKI dataset that are cut in half, and are tested on 5000 images of the AgeDB dataset, that are also cut in half. Examples of the input and the ground truth for the different models are shown in Figure 17. The parameters for the the models were the same, where all the networks were trained for 200 epochs. Subsequently, the values for λ1, λ2, λ3, and λ4, are 0.01, 0.04, 0.01,

and 0.001 respectively. Furthermore, the batch size is 8, the learning rate is 0.0002, the number of iterations to linearly decay learning rate to zero is equal to 100, number of generators and discriminator filters are 64, and the input size is 256x256. For training the models, a NVIDIA TitanX GPU is used.

Age Estimation: In order to asses whether the performance of the proposed model is sufficient, the same model that is used as an age discriminator for the proposed model, is trained on the original input images for age prediction, instead of the UV texture maps. Examples of the original input images is shown in Figure 8. Both models are trained on 10K images of the WIKI dataset and are tested on three different datasets. The three different datasets are WIKI, AgeDB and APPA-REAL, which are previously mentioned. All the images that are present in the different test sets are exclusively unseen samples. The parameters for the models are the same as mentioned in section 5.1.

Figure 17: Examples of the cropped images used for the ablation study. Left is the ground truth and right is the input.

5.2 Evaluation metrics

5.2.1 Quantitative research

UV Completion To determine the quality of the generated images, two metrics are employed. The first metric is the peak signal-to-noise ratio (PSNR), which computes the difference in pixel values. PSNR is a value between 0-100, in which the following applies; the higher the value, the better the quality. The equation for PSNR is defined as follows: PSNR = 10 log₁₀ (2 d_{− 1)}2_{W H} PW i=1 PH j=1(p[i, j] − p0[i, j])2 (19)

Here, d is the bit depth of pixel, W the image width, H the image height, and p[i, j], p0_{[i, j] is the ith-row}

(20)

the structural similarity index (SSIM) [35]. This metric estimates the holistic similarity between two images. SSIM is a value between 0 and 1, where the higher the value, the higher the similarity is between the two images. The equation for SSIM is defined as follows:

SSIM (x, y) = (2µxµy+ C1) + (2σxy+ C2) (µ2 x+ µ2y+ C1)(σ2x+ σ2y+ C2) µx= 1 N N X i=1 xi σx= ( 1 N − 1 N X i=1 (xi− µx)2) 1 2 C1= (K1L)2 C2= (K2L)2 (20)

Here, K1 = 0.01, K2 = 0.03, N is the total amount of pixels, and L is dynamic range of the pixel values. Age Estimation: For both models, the mean absolute error (MAE) is computed on three different datasets. Subsequently, the MAE for different poses is computed. The MAE in age estimation is the average of the absolute error between the estimated age and the real age, which is shown in the following equation:

M AE = 1 N N X i=1 |yi− xi| (21)

Here, yi is the predicted age, xi is the real age, and N is the amount of faces in the dataset. This metric will

result in a reliable comparison between the two models. If the performance of the proposed model is close or better to the performance of the age estimation model trained on the basis of the original images, it can be assumed that it is possible to estimate age based on the generated frontal face of a 2D image. This metric is evaluated on different datasets, to evaluate how good the proposed model works on images in the wild.

5.2.2 Qualitative results

For the qualitative evaluation of the generated results, an analytic style is chosen. The results that are generated by the proposed model will be evaluated manually, by looking at the worst and best cases of the model. The worst cases are defined as images where the difference between the predicted age and the real age is at least 15 years. The best cases are defined as images where the difference between the predicted age and the real age is at most 2 years. The pose distributions of all the worst and best cases will also be presented, which will show whether the performance of the proposed model is based on the viewpoint of a face. This qualitative evaluation will showcase the strengths and weaknesses of the proposed model. The analysis will be done by the author himself.

(21)

6 Results

In this section, the results of the different experiments that were performed are shown and discussed. First, the results of the quantitative research on the proposed model are elaborated for both the UV texture completion part of the model and the age estimation part of the model. Secondly, the qualitative results of the completed UV texture map are described in detail.

6.1 Quantitative results

6.1.1 UV Completion

To evaluate the performance of the different components of the model, an ablation study for the different losses is performed. The PSNR and the SSIM are computed for all the different models in order to evaluate the difference in quality of the generated image of the models. These results are shown in Table 2, which shows that the PSNR and SSIM decrease when more losses are added. This is because these two metrics favor smooth and blurry results. As can be seen from the results, it does not imply that the image looks more realistic. The MAE does decrease when the global and local discriminator are added, which could mean that some extra information is added, which helps the age prediction process. However, it increases when the identity module is added. The reason for this could be that the model works well with blurrier images, because it extracts more generic features this way.

The last row in Table 2 shows the final results of the PSNR and SSIM evaluation of the proposed model. The PSNR and SSIM values for the proposed model are 0.73 and 61.46 respectively. This shows that there could be improvement in the results. However, the results are not insignificant. This means that there is detailed enough information present in the generated complete UV map, which could be helpful for the age estimation process.

Losses PSNR SSIM MAE

Lgen 62.003 0.751 8.4

Lgen+ Lage 62.000 0.752 8.8

Lgen+ Lage+ LDglobal 61.742 0.741 8.1 Lgen+ Lage+ LDglobal+ LDlocal 61.553 0.734 8.2 Lgen+ Lage+ LDglobal+ LDlocal+ Lid 61.462 0.731 8.4

Table 2: Ablation study on the PSNR, SSIM and MAE values on the different losses of the proposed model.

6.1.2 Age Estimation

The proposed model is compared to the same age estimation model that is used as a discriminator. However, this model is trained on the real input images, instead of the UV texture maps. This comparison between the two models will show whether the proposed model actually adds extra information for the age estimation predictions. The model is also compared to other state-of-the-art models, which are: DEX [29] and Residuel DEX [1]. A problem with this comparison is that most of these implementations were not available online, which is why their MAE results were taken from their papers. This makes it so, that the comparison is slightly unfair, because their models are trained on the full IMDB-WIKI dataset, which contains around 530K images, where the proposed model is trained on only 10K images. However, it still shows some sort of comparison between the proposed model and the state-of-the-art models. Table 3 shows the MAE on the complete datasets, where the baseline is the state-of-the-art model DEX, but trained on the 10K ground truth images, and AGE-GAN is the proposed model.

Model APPA-REAL (1.9K) AgeDB (11K) WIKI (25K)

Baseline 12.8 8.93 9.43

AGE-GAN 8.73 8.16 7.51

DEX [29] 5.468 13.7 n/a

Res-DEX [1] 5.352 n/a n/a

Table 3: MAE of the models on different datasets.

As can be seen from the MAE results, the proposed method outperforms the age estimation model, that is trained on the real images, by a significant amount, on both the WIKI dataset and the APPA-REAL dataset.

(22)

Figure 18: Examples of extracted UV texture maps (left) with their corresponding generated UV texture maps (right) where the difference between the real age and the predicted age is 15 at least.

Figure 19: Examples of extracted UV texture maps (left) with their corresponding generated UV texture maps (right) where the difference between the real age and the predicted age is 2 years at most.

(23)

The MAE for both the baseline and proposed models on the AgeDB dataset, are fairly close to one other. This is probably because the AgeDB dataset contains mostly frontal view images of the face, compared to the other two datasets, which can be seen from Table 1. The difference between the proposed model compared to the other two state of the art models is significant on the APPA-REAL dataset, but it is still fairly close when looking at the Baseline model, which is basically the DEX model trained on less images. This could mean that increasing the dataset for the proposed model could significantly increase the MAE results. The same idea goes for the Residual DEX model, if the proposed model used the Residual model as well, the results could slightly increase.

Age estimation on different poses: In order to asses how well the model performs on different poses, they are compared to the other model on the following poses:

• 0-20

• 20-40

• 40-60

• 60+

By comparing the proposed model to an age estimation model on various poses, it can be apparent whether the model is robust to poses. The evaluation of the model on different poses is carried out on all the three different test sets to get a general view of the performance of the model. Tables 4, 5, and 6 show the Mean Absolute Error (MAE) per pose for the state of the art and the proposed model on 3 different datasets.

As can be seen from these results, is that the difference between the MAE results of the two models increases when there is more occlusion present in the face. This means that the proposed model works significantly better with faces that have more occlusion compared to the age estimation model, which is to be expected. In the pose results of APPA and WIKI, it is notable that the MAE does not differ between the pose classes 0-20 and 20-40, while there is a significant change in MAE results on the AgeDB dataset. This could be due to the fact that the poses of the faces were all close to 20 for the WIKI and APPA datasets, which was not the case for the AgeDB dataset. For the proposed model, the MAE between the different poses are fairly close to each other, which could indicate that the proposed model is close to being pose invariant. The significant differences that are noticeable in the APPA-REAL results, could have originated due to an insignificant amount of images for the poses of a higher degree, seeing as the difference decreases when the amount of images increases. It could also be, that the faces for a higher degree were also faces that were very old or young, resulting in a large difference between the predicted age and the real age. Figures 20 and 21 show the pose distributions of the worst and best cases. It is notable that there is not a significant difference between the distributions, which could mean that the performance of the model is not based on the pose. This could mean that the proposed model is robust to poses.

APPA-REAL (2K)

Model 0-20(1.5K) 20-40(330) 40-60(60) 60+(10)

Baseline 12.60 12.52 17.41 22.00

AGE-GAN 8.63 8.08 9.33 17.44

Table 4: MAE of the models on the APPA-REAL dataset for different poses

AgeDB (11K)

Model 0-20(7K) 20-40(2.8K) 40-60(250) 60+(25)

Baseline 8.78 9.03 11.92 14.30

AGE-GAN 7.96 8.48 9.68 9.42

Table 5: MAE of the models on the Age-DB dataset for different poses

5-fold Cross-validation: Another form of quantitative research that will be performed on the model, is 5-fold cross-validation. This approach is carried out to asses how well the proposed model generalizes on the problem at hand. The WIKI dataset is randomly split into five different datasets, and the model is tested on one of the five partitions, and is trained on the other four. This process of training and testing is repeated

(24)

WIKI (25K)

Model 0-20(16K) 20-40(8K) 40-60(1.5K) 60+(160)

Baseline 9.25 9.38 10.24 9.0

AGE-GAN 7.65 7.29 7.43 6.92

Table 6: MAE of the models on the WIKI dataset for different poses

Figure 20: The pose distribution of the worst cases from the test results

Figure 21: The pose distribution of the best cases from the test results

until the model is tested on all five partitions. The MAE is computed for every model, and the average and the standard deviation of the 5 models is computed. By having a high number of partitions, the model is less prone to selection bias and the certainty of the robustness of the model increases. The models for every fold are called P1-P5.

The results in Table 8 show that the model is capable of generalizing quite well to different datasets. The MAE results are fairly close to each other for every partition, which means that the model is capable of giving an evenly accurate prediction to unseen data and did not over-fit on the basis of the given data. When compared to the results in Table 3, it is visible that the results improve when the training data is in the same format as the test data. The cross validation is validated on a sub part of the original training data, meaning that the test data is also augmented like the training data.

Model MAE P1 6.73 P2 6.55 P3 6.53 P4 6.78 P5 6.44 Average 6.61 Std 0.13

Table 7: MAE results for the 5-fold cross-validation

Age group evaluation: To evaluate the performance of the model on different age groups, the MAE for every age group is computed. The age groups consist of the ages between 0 and 100 years, where every age group is a span of 5 years. The results for this experiment are shown in table 8. The age distributions for the worst and best cases are also computed, which will show on what age group the proposed model performs well and on what age group the model performs poorly. These results are shown in Figures 22 and 23.

In Figure 22, the age distribution of the worst cases in the test results are shown. Most of the faces that were wrongly labeled, were very old or very young. This could be a data problem, because the unbalanced distribution of ages in the training dataset as shown in figure 13 indicate that there were not a lot of images

(25)

of old or young faces. This means that the model did not have a lot of reference material for faces that looked very old or young, so it would be hard for the model to predict the age in the given circumstances. In figure 23, the age distribution of the best cases are shown. It is evident that most of the best cases originate from the middle aged faces, which could be explained by the same data problem as for the worst cases. Seeing as most of the data contains middle aged faces, the model has enough reference material to make an accurate prediction. Table 8 confirms that it is more likely to be a data problem, because the MAE increases when the face is very young or very old for both the state of the art and the proposed model. The proposed model does outperform the state of the art for almost every age group. In order to address the dataset issue, a more balanced dataset is needed, where a large amount of facial images for every age between 0 and 100 is available. This could result in a better model that can determine an accurate prediction for every age.

Figure 22: The age distribution of the worst cases from the test results

Figure 23: The age distribution of the best cases from the test results

Age group AGE-GAN Baseline

0-5 (2) 34.7 23.0 5-10 (23) 15.76 19.6 10-15 (140) 13.2 14.0 15-20 (1.3K) 8.4 8,9 20-25 (3.6K) 5.2 5.8 35-30 (5.1K) 4.4 4.4 30-35 (3K) 5.1 4.6 35-40 (2.1K) 6.6 6.0 40-45 (1.9K) 7.4 7.7 45-50 (1.7K) 8.0 9.3 50-55 (1.5K) 7.4 11.0 55-60 (1.3K) 8.6 13.6 60-65 (1.1K) 10.5 16.9 65-70 (901) 10.8 19.6 70-75 (630) 12.1 23.4 75-80 (446) 15.6 28.06 80-85 (288) 19.0 30.0 85-90 (162) 24.2 36.6 90-95 (86) 26.1 38.1 95-100 (23) 39.0 50.9

Table 8: The MAE results of the proposed model (AGE-GAN) and the state of the art model (VGG16)

6.2 Qualitative Results

Figures 18 and 19 show examples from the test results of the best and worst cases. Figure 18 shows the worst cases, where the predicted age is very inaccurate. Most of the worst cases, are cases where the generated image has a significant difference between the generated and non-generated part of the face. This difference could contribute to a face containing features that would make the person appear younger or older. The result also

(26)

consists of a significant amount of blurry images, which could result in a less accurate feature extraction for the age discriminator. In turn, this could lead to an inaccurate prediction of the age. Because training data is augmented to replicate different poses for a face, the test and train images differ slightly. An issue that arises from this difference, is that the model sometimes creates artifacts on the bottom side of the image. This does not happen during the training phase. This problem could be fixed by using video data of faces to the extent that a full frontal face and different poses of the same face become available. This would result in a more accurate generation of the complete UV map. However, these artifacts are not a problem for practical use as they can be easily blackened out with the use of a mask.

Figure 24 shows examples of the generated results of the ablation study by the proposed model. What can be seen from these images, is that the generated image starts of very blurry and increasingly becomes more similar to the ground truth. By adding the global and local discriminator loss to the final loss, the faces look more realistic. The final addition of the id module result in a bit more detail in the face, especially around the eye region, which results in an even more realistic face. The final face closely resembles the ground truth face, but the differences are still noticeable. To further improve the quality of the results, different loss components could be added, which would train the model to resemble the ground truth image even more, such as loss components that solely focus on specific parts of the face. One such example is the discriminator that only compares the eye region, or the mouth region of a face. Furthermore, the similarity between the generated complete UV map and the real complete UV map could potentially be decreased by adding some sort of post processing technique like smoothing, which would make it so that the difference between the generated part and the real part is less distinguishable.

Figure 24: Generation results of the ablation study for different loss components. A is only Lgen. B consists of

Lgen+ Lage. C is Lgen+ Lage+ LDglobal. D contains Lgen+ Lage+ LDglobal+ LDlocal. Finaly, E is the total loss, which is Lgen+ Lage+ LDglobal + LDlocal+ Lid

(27)

7 Conclusion

The aim of this research was to create a pose robust age estimation model that is capable of giving accurately predicting the age of images in the wild, by generating a full frontal face. The hypothesis of this research, was that the model will be able to outperform the same age estimation model, trained on input images, and that it is capable of giving an accurate estimation of the age for every pose, resulting in a pose robust model. Looking back at the MAE results given in Tables 4, 5, and 6 a slight difference in MAE results between different poses can be seen. However, the difference is significantly smaller, when compared to the model trained on the input images. This means that the proposed model is pose robust, and it performs significantly better than the model that is trained on the input images.

Unfortunately, only 10k images were used for training, which resulted in an unbalanced dataset, where there were significantly less younger or older faces. Increasing the amount of images used for training, could poten-tially increase the performance of the model. One other issue that occurred, was that the data used for this paper, is augmented to imitate different poses for a frontal face, instead of using real poses. Had there been more video data that consists of different poses of the same face, an accurate representation of the full frontal face for different poses could be learned by the model. In turn, this could result in more successfully generated and complete UV maps, which could then lead to a better age prediction and eventually in a model that is pose invariant.

(28)

References

[1] Eirikur Agustsson, Radu Timofte, Sergio Escalera, Xavier Bar´o, Isabelle Guyon, and Rasmus Rothe. 2017. Apparent and Real Age Estimation in Still Images with Deep Residual Regressors on Appa-Real Database. https://doi.org/10.1109/FG.2017.20

[2] Raphael Angulu, Jules-Raymond Tapamo, and Aderemi Adewumi. 2018. Age estimation via face images: a survey. EURASIP Journal on Image and Video Processing 2018 (12 2018). https://doi.org/10.1186/ s13640-018-0278-6

[3] S. Banerjee, J. Brogan, J. Krizaj, A. Bharati, B. R. Webster, V. Struc, P. J. Flynn, and W. J. Scheirer. 2018. To Frontalize or Not to Frontalize: Do We Really Need Elaborate Pre-processing to Improve Face Recognition?. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 20–29. https://doi.org/10.1109/WACV.2018.00009

[4] V. Blanz and T. Vetter. 2003. Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 9 (Sep. 2003), 1063–1074. https://doi.org/10.1109/ TPAMI.2003.1227983

[5] Ali Maina Bukar and Hassan Ugail. 2017-12-01. Automatic age estimation from facial profile view. IET Computer Vision 11, 8 (2017-12-01), 650,655.

[6] A. B´aez-Su´arez, Christophoros Nikou, J. Nolazco-Flores, and Ioannis Kakadiaris. 2016. Age Classifica-tion from Facial Images: Is FrontalizaClassifica-tion Necessary?, Vol. 10072. 769–778. https://doi.org/10.1007/ 978-3-319-50835-1_69

[7] Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan Sun. 2018. Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization. In Advances in Neural Information Processing Systems. 2867–2877.

[8] S. Chen, C. Zhang, and M. Dong. 2018. Deep Age Estimation: From Classification to Ranking. IEEE Transactions on Multimedia 20, 8 (Aug 2018), 2209–2222. https://doi.org/10.1109/TMM.2017.2786869 [9] Jiankang Deng, Shiyang Cheng, Niannan Xue, Yuxiang Zhou, and Stefanos Zafeiriou. 2018. Uv-gan: Adver-sarial facial uv map completion for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7093–7102.

[10] Yuan Dong, Yinan Liu, and Shiguo Lian. 2016. Automatic age estimation based on deep learning algorithm. Neurocomputing 187 (2016), 4–10.

[11] M. Duan, K. Li, and K. Li. 2018. An Ensemble CNN2ELM for Age Estimation. IEEE Transactions on Information Forensics and Security 13, 3 (March 2018), 758–772. https://doi.org/10.1109/TIFS. 2017.2766583

[12] S Escalera X Baro I Guyon R Rothe. E Agustsson, R Timofte. 2017. Apparent and real age estimation in still images with deep residual regressors on APPA-REAL database.. In 12th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2017. IEEE.

[13] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. In ECCV.

[14] X. Geng, Z. Zhou, and K. Smith-Miles. 2007. Automatic Age Estimation Based on Facial Aging Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 12 (Dec 2007), 2234–2240. https: //doi.org/10.1109/TPAMI.2007.70733

[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information pro-cessing systems. 2672–2680.

[16] Guosheng Hu, Fly Yan, Josef Kittler, Chi-Ho Chan, Zhenhua Feng, and Patrik Huber. 2017. Efficient 3D Morphable Face Model Fitting. Pattern Recognition 67 (02 2017), 366–379. https://doi.org/10.1016/ j.patcog.2017.02.007

[17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).

Pose Robust Age Estimation with the use of Pix2pix to Generate the Frontal Face

MSc Artificial Intelligence

Master Thesis