Deep Set Prediction Networks for Facial Landmark Detection Tasks

(1)

Deep Set Prediction

Networks

for Facial Landmark

Detection Tasks

(2)

Layout: typeset by the author using LA_TEX.

(3)

Deep Set Prediction Networks

for Facial Landmark Detection

Tasks

Florian E. Schroevers 11334266

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Mr W.D. (David) Zhang MSc Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 GH Amsterdam Jun 26st, 2020

(4)

Abstract

Set prediction is inherently difficult in machine learning, due to their depen-dence on the order and size of the samples in the dataset. Recently developed Deep Set Prediction Networks have made set prediction feasible, although it is not yet thoroughly tested, neither has its range of applications been explored. The model also suffers from some drawbacks considering performance. In the following, a possible method of increasing performance is explored. Furthermore, the network is applied to facial landmark detection tasks on datasets containing cat faces and human faces, to explore the suitability of the Deep Set Prediction Network for this kind of task. The results show that the network is capable of predicting sets of facial landmarks, but not on par with state-of-the-art models developed for this task.

(5)

Chapter 1 Introduction

Contemporary neural networks perform well for standard classification tasks like recognizing objects in an image [Ren et al., 2016]. Neural networks are limited by their dependence on the output structure - it needs to be a vector. Due to the design of a neural network, all the target vectors in the dataset need to be of the same size, and the vector’s ith element must correspond to the ith feature such that the order of the output is consistent, because the index of a feature in the output vector determines which output it describes. For example, when we have a multi-class classification model, the output is usually a vector with the same length as the amount of classes, where each index corresponds to one of the classes. In this case, the order of the output vector is important. If this is not the case, the network fails to find meaningful structures in the data [Zhang et al., 2019b]. However, not all prediction tasks are structured like this. When the output size is variable, or if there is no inherent struture (such as ordering) in the output, typical neural networks are not feasible. There are many tasks that suffer from this problem, ranging from population statistics [Póczos et al., 2013] to predicting point clouds [Achlioptas et al., 2017] and finding bounding boxes of objects in an image [Rezatofighi et al., 2018]. Vector-to-set models overcome this limitation of vector-to-vector models. [Zhang et al., 2019a] propose the Deep Set Prediction Network (DSPN), a method for set prediction using deep learning. DSPN accepts a variable output size (up to some finite bound), and is permutation-equivariant.

Predicting sets introduces a problem regarding the evaluation metric: in a vector-to-vector model a low-complexity loss function such as Maximum Like-lihood or Cross-Entropy can be used, but because ordering is not relevant to sets, there arises the problem of finding the corresponding element in the tar-get set for each element in the input vector. An assignment mechanism can ad-dress this [Zhang et al., 2019b]. For example, the Chamfer loss function (O(n2)) matches each element from one set to the closest element in the other set.

(8)

Bet-ter results might be achieved using assignment losses from the Hungarian al-gorithm (also known as the earth mover’s distance or the Wasserstein metric) [Rubner et al., 2000]. DSPN uses either of those functions as loss functions. The Hungarian algorithm is more computationally expensive than the Chamfer loss, with time complexity of O(n3). We adapt the code from [Zhang et al., 2019a] by implementing a parallelized method of calculating the earth mover’s distance, which might greatly reduce the training time of the model.

Figure 1.1: Training set image with annotated facial landmarks

DSPNs have been applied to machine learn-ing problems where the output size is variable, such as point-clouds and object detection in images (including data sets with multiple ob-jects in one image) [Zhang et al., 2019a]. How-ever, the suitability of DSPNs for facial detec-tion has not been explored. The following de-scribes an implementation of DSPNs adapted for the detection of facial landmarks of cat faces, using the Cat Head Detection Dataset [Zhang et al., 2008]. The dataset consists of ±8500 images of cats with annotated facial landmarks (see Figure 1.1). This implementa-tion aims to localize these landmarks in a test image. The facial landmarks constitute the tar-get set; the model learns a function that decodes that set into a latent representa-tion of the input.

The suitability of this model for this task is evaluated by comparison against a state-of-the-art baseline model for finding facial landmarks in cat faces, using the same loss function to score both models. This leads to an assessment of the non-quantifiable structural deficiencies in the model. The discussion addresses applications of this model in facial landmark detection tasks where state-of-the-art models struggle, such as multiple faces in one image, or a dataset of faces from multiple species, for example both cats and humans. For this reason, both the DSPN and the baseline will also be tested on a dataset containing human faces (with facial landmarks).

(9)

Chapter 2 Background

2.1 Facial landmark detection

Over the past few decades, facial landmark detection algorithms have seen signifi-cant developments [Wu and Ji, 2019]. There are multiple types of algorithms that accomplish this task to varying levels of success, of which Deep Learning is one. Convolutional Neural Networks have been shown to be able to learn structures in images, and this task is no different. A problem that arises is that most of these models rely on datasets that contain images in controlled conditions, which means the images are taken en face, with good lighting and no occlusion. A more real-world application would be able to learn and accurately predict so called in-the-wild images - images that were not taken in controlled conditions, and might include some if not all of the before metioned difficulties. Some in-the-wild datasets have been released, and models using these datasets show good performance on localizing facial keypoints [Koestinger et al., 2011], [Burgos-Artizzu et al., 2013], [Sagonas et al., 2016], [Wu et al., 2018]. [Wang et al., 2020] introduce a model showing significantly higher performance on various visual recognition tasks, called HRNet. Various state-of-the-art models use the same design rule in which the reso-lution of the representations goes down through the convoreso-lutional layers. HRNet is a model that keeps high-resolution representations throughout the network, and is therefore spatially more precise. [Sun et al., 2019] apply this model to four in-the-wild facial landmark detection datasets, and achieve better than state-of-the-art results. One of the datasets this model has been tested on is the WFLW dataset [Wu et al., 2018], on which the DSPN model will also be tested.

(10)

2.2 Predicting sets

2.2.1 Sets

In machine learning, working with sets is inherently difficult, since most models are developed with fixed-size and fixed-order vectors. A lot of real world problems suffer limitations from this fact, for example, when the size of the output can be variable in the dataset, or when the order of the input is variable. In other words, often the output of a model has to be a set. Since the set is a datatype that is struc-turally very different to a vector, the model needs to account for those properties. The differences between sets and vectors are: in a set, the order (or permutation) of elements is irrelevant, unlike vectors, also, the size (or cardinality) of a set is not interpreted the same as the size (or dimensionality) of a vector. Since vector operations are so natural for a computer, a model operating with sets would need some alterations. The practice of extending typical machine learning models to work with sets is only a recent development [Zaheer et al., 2017]. A few possible types of these models can be identified, namely set-to-vector models, vector-to-set models and even set-to-set models. [Zaheer et al., 2017] introduced Deep Sets, a set-to-vector model. They showed that for an unsupervised set-prediction task, when the model is permutation-invariant, the performance is greatly increased compared to other order-sensitive tasks. They also show that such a model in a supervised setting will perform better when it is permutation-equivariant. Permu-tation invariance is a property of a function or transformation where the output does not change under permutation of the input, that is:

∀π∈Πf (π(x)) = f (x)

, where where Π is the set of all possible permutation functions (a function of which the output is a permutation of the input). Permutation-equivariance is the property where the input and output must behave in the same way under permutation, that is:

∀π∈Πf (π(x)) = π(f (x))

[Guttenberg et al., 2016]. A simple example of a permutation-invariant function could be the summation function: f (~x) = P

x∈~xx. Any permutation of ~x will

lead to the same result. An example of a permutation-equivariant function is for example the ReLu function ReLu(~x) = max(0, ~x), or any other element-wise function.

Besides inherently designing the model in such a way that it is permutation-invariant or permutation-equivariant, other methods could be explored. For ex-ample, one might think of sorting the input set, equalizing the input under any permutation. Another method could be extending the dataset by adding random

(11)

permutations of all the samples. Both of those methods are shown to be infeasible [Qi et al., 2017]. [Rezatofighi et al., 2018] propose a supervised set-to-set model, capable of object recognition outperforming state-of-the art models. This model learns not only the targets, but also the permutation and the cardinality of the input, factoring out the permutation. To find the ground truths of the permu-tations, since these are unknown from the data, a linear assignment algorithm, namely the Hungarian algorithm, is used.

2.2.2 Auto-Encoder

An auto-encoder is an unsupervised method to learn a data encoding for a specific dataset. The model encodes data into a representation, or latent space, and it also learns a decoder to reconstruct an approximation of the input. These encoders can be as simple as a multi-layer perceptron (MLP). This technique is used for dimensionality reduction, since the dimension of the representation can be lower than the dimension of the input data, and for generative models, since the decoder can reconstruct data that resembles the input data. In practice, auto-encoders are used for tasks like facial recognition [Gao et al., 2015] and natural language generation [Bowman et al., 2015].

The use of auto-encoders in combination with a Gaussian mixture model has been shown to produce reliable results as a vector-to-set model, in the case of point cloud generation [Achlioptas et al., 2017]. However, [Zhang et al., 2019a] point out that this solution does not take the responsibility problem [Zhang et al., 2019b] into account. This is a problem that arises when the model must learn some discontinuous jump when assigning an output to specific neurons. The result of this is that the decoder will not reproduce the input set in the same order, which is a problem since the input is a vector, not a set. Feature-wise Sort Pooling (FSPool) [Zhang et al., 2019b] is a method of avoiding the responsibility problem. When an FSPool encoder encodes the input set into a vector, it stores the permutation in the encoder. Later this permutation can be used to decode the vector back into the input. The method works by sorting the input, and then performing a weighted sum. The weights of this weighted sum are learnable, so these are the parameters of the encoder. To account for the variable sized output, the weights are not a discrete vector, but a piecewise linear function f : [0, 1] → R. The function will then be evaluated at evenly spaced intervals based on the size of the set. For example, to obtain the weights of a set of size 4, the function will be evaluated at 0, 0.33..., 0.66... and 1. This function is chosen such that the parameters are learnable. To maintain permutation-equivariance, the sorting is done by the recently developed sorting network [Grover et al., 2019], which learns a permutation matrix. In the decoder, the permutation matrix from the encoder is simply inverted to obtain the original order of the input. With this innovation,

(12)

a permutation-equivariant auto-encoder can be constructed.

2.2.3 Loss

Besides the problem of permutation-equivariance, another issue arises when the output of a model is a set. During training, the model needs to determine some error based on its current predictions in order to perform gradient descent; a loss function is needed. Such a loss function needs to conform a to certain criterium: it needs to be a differentiable function (in most places) in order to determine a gradient. Two such functions have been shown to give both a loss metric between sets of points, and to be differentiable almost everywhere: the Chamfer distance, and the earth mover’s distance (EMD) [Fan et al., 2017]. The Chamfer distance works by assigning, for each point in one set, the nearest neighbor in the other set, and then squaring and summing these distances. The Chamfer distance between S1, S2 ∈ Rd, where d ≥ 2 is given by:

dCD(S1, S2) = X x1∈S1 min x2∈S2 ||x1− x2||2+ X x2∈S2 min x1∈S1 ||x1− x2||2

This metric has some shortcomings, mainly that a point in one set can be matched to multiple points in the other. The EMD is a distance obtained by taking the assignment losses of a linear assignment algorithm, namely the Hungarian algorithm. This algorithm finds a bijection between two sets that minimizes the distance between each input/output pair. For the EMD, we are not interested in the function it finds, but only in the distances between the pairs. The earth mover’s distance between two sets S1, S2 ∈ Rd, where d ≥ 2 is given by:

dEM D(S1, S2) = min f :S1→S2

X

x∈S1

||x − f (x)||2

A benefit of the Chamfer distance over the EMD is its time complexity. The time complexity of calculating the Chamfer distance is O(n2_{), whereas the time}

complexity of the Hungarian algorithm is O(n3_).

2.3 Deep Set Prediction Network

Some of the innovations described above are used to develop the Deep Set Predic-tion Network; a vector-to-set model. The model is based on the observaPredic-tion that to decode a feature vector into a set, it is possible to use gradient descent to find a set that encodes to that feature vector. The proof of this statement is provided by [Zhang et al., 2019a]. What this means conceptually is that the model is trying to find some decoder that decodes a latent representation of the input into the target

(13)

set. The model is based on the permutation-equivariant auto-encoder model using FSPool, but this model does not incorporate some key features. First of all, the input of this model needs to be a set, and secondly, the size of the output set needs to be a constant size. The DSPN model addresses these shortcomings.

The following describes the steps to construct the DSPN. First of all, an input encoder is used to obtain a latent representation of the input. The goal is to find some set that approximately encodes into this latent representation. If a permutation-equivariant encoder is used for the set (the set encoder), a simple representation loss can be used to evaluate the estimated set. The representation loss is defined as follows:

Lrepr( ˆY, z) = ||genc( ˆY) − z||2

Where ˆY is the estimated set, z is the latent representation of the input, and genc is a set encoder. The goal is to optimize ˆY such that the representation loss is

minimized. To do this, gradient descent is used. For a some number of iterations T , and some initial set ˆY(0):

ˆ

Y(t+1) = ˆY(t)− η · ∂Lrepr( ˆY, z) ∂ ˆY(t)

Where t < T , and η the learning rate. This is what is called the inner op-timization loop. After T iterations, the goal is to have some set ˆY(T ) _where

gdec = ˆY(T ) ≈ Y.

Missing from this are the training of the weigths of the encoder genc. To obtain

a metric of this encoder, we use a set loss function such as the Chamfer distance (CD) or the Earth mover’s distance (EM D):

Lset( ˆY, Y) = LCD( ˆY, Y)

or

Lset( ˆY, Y) = LEM D( ˆY, Y)

As seen before, these functions are differentiable, therefore we can use gradient descent to minimize this loss. This is what is called the outer optimization loop.

In the case of a vector-to-set model, the input encoder is not a set encoder. This causes the input encoder to loose the guarantee that genc will approximate

Y. To fix this, the loss for the encoder will obtain an extension to make sure it encodes to approxiamtely z again. To do that, the representation loss of the target set Y is added (with some weight λ):

(14)

Thus far the network is engineered to properly predict sets of constant size. To make sure it can handle variable sized outputs, some extension is necessary. First, all targets sets are padded to some (a priori) maximum size. Then, all targets will be concatenated to a mask feature, indicating whether the element is relevant or not (1 for relevant, 0 for irrelevant). These features are then also learned by the model. These values are clamped to 0 and 1 to make sure they form a valid mask. For the input encoder, any encoder can be used that is relevant for the input type. For example, for images, a convolutional neural network such as ResNet [He et al., 2016] can be used. For the set encoder an encoder using FSPool is use-ful, since it maintains the permutation-equivariance [Zhang et al., 2019b].

[Zhang et al., 2019a] describe the RNFS-Encoder, a 2-layer relation network (RN) [Santoro et al., 2017], with FSPool as pooling function. A relation network is a building block for a deep learning model that is capable of learning relationships within a set of object. It works by learning a function that infers the relation between all pairs of objects. [Santoro et al., 2017] use it for super-human perfor-mance on the CLEVR QA test. The CLEVR QA test [Johnson et al., 2017] is a test related to the CLEVR dataset where questions about a scene have to be answered (questions like: what color is the cube to the right of the yellow sphere? ). They mention RN could be used for learning relations in sets, since it is invariant to the order of the inputs. [Zhang et al., 2019a] use the RNFS-encoder for the CLEVR QA test as well.

(15)

Chapter 3 Method

3.1 Implementation

The code for the Deep Set Prediction Networks is provided by Zhang et al.1_{. For}

the experiments described next, this code has been adapted to allow support for two new datasets. Additionally, a method of combining datasets without altering file structure or even data loading methods has been implemented by means of a wrapper class. This method could be used to easily extend the model to accept more datasets without having to alter the structure of the code and/or the data.

The two loss functions able to find a distance measure between two point sets are the Chamfer distance and the EMD. Theoretically, EMD should perform bet-ter than the Chamfer distance [Fan et al., 2017]. The complexity issue prevents reasonable usage of the EMD in practical settings, when the algorithm runs se-quentially. [Zhang et al., 2019a] Since the Hungarian algorithm is parallelizable on GPU [Fan et al., 2017], it could be usable for practical applications. The code for running this algorithm on GPU, written by Fan et al. has been adapted2 _for

python in the pytorch-neuralnet3 _package.

3.1.1 Data preprocessing

The cat faces dataset contains 8605 RGB images of cats. Each image contains exactly one cat, and the cat is (mostly) not occluded. The images are all of different size. Each image has an associated file describing the (manually marked) pixel coordinates of the following 9 facial landmarks: the nose, the eyes, the tips of the ears, and the two points at the base of the ears. Some of these keypoints

1_{Cyanogenoid/dspn on GitHub}

2_{daerduoCarey/PyTorchEMD on GitHub} 3_{Neuralnet-Pytorch on readthedocs}

(16)

Figure 3.1: Example of marked keypoints for an image from the cat faces dataset, and for an image from the WFLW dataset

may lie outside the borders of the image, when the cats face is not entirely inside the image. The WFLW dataset contains 10000 RGB images of human faces. The faces are in-the-wild, meaning they occur under different conditions of lighting, scale and orientation of the face in the image, and occlusion. The images are all of different size. There are 98 facial landmarks annotated in the images. To maintain some kind of similar scale as the cat dataset, a selection of the following 7 landmarks is made: the centers of the eyes, the tip of the nose, the corners of the mouth, and the points where the ears connect to the face (see figure3.1). The dataset also contains the following 6 extra features: blur, expression, illumination, pose, makeup and occlusion. These are binary values, simply denoting whether this feature is present in the image or not.

The images from both datasets have been resized to 128x128 pixels. All the pixel coordinates of the landmarks have been divided by the image dimensions, so the network has to predict values between 0 and 1. Both datasets have a train/test split: the cat dataset has 7284 train images and 1321 test images (15.4%). The WFLW dataset has 7500 train images and 2500 test images (25%). For the com-bined dataset, this results to a split of 14784 train images and 3821 test images (20.5%).

3.1.2 Model

As set encoder, the model uses the RNFS-Encoder. We choose this encoder because it learns relations in the set, and in the case of facial keypoints we can say these relations exist, e.g. , the eyes are mostly on the same plane (we see a structural

(17)

relation between the keypoints). Since the inputs of this model are images, an input encoder is also necessary. In this case, a ResNet34 based image encoder is used. The loss function the model uses is the EMD.

3.2 Experiments

3.2.1 Loss function optimization

To determine the difference in run time of the loss functions, each of the three variants (Chamfer distance, EMD on CPU4_{, EMD on GPU}5_{) is run on a set of}

dummy inputs of varying sizes (powers of 2 up to 220. This will give a direct metric of time for the different loss functions, but it is not necessarily a good measure to determine the speedup of the training time of the model. The following experiment has been set up in order to determine the increase in training time between the different loss functions. The DSPN model is trained three times, once using the Chamfer distance, once using the EMD on CPU (as used in the original implementation), and once using the parallelized version of the EMD on GPU. Each of the runs will be trained for 100 epochs. The batch-size has an impact on the training time when considering running the program on GPU, since there is an overhead regarding copying the data to VRAM, cacheing, and other low-level optimizations, therefore the batch-size is set to a constant value of 12 for all training runs. Since the results and performance on predicting facial landmarks is not important for this test, all other hyper-parameters are set to the same values as described in the CLEVR bounding box prediction task in Zhang et al.. The total training time and the average time per epoch are recorded.

3.2.2 Facial landmark detection

To determine the performance of the DSPN model described above for facial land-mark detection, three training runs are performed. Once on the cat faces dataset, once on the WFLW dataset, and once for the union of the two datasets.

Training is done in batches of 12, since this is the maximum number of inputs that can be stored in the VRAM (2 gigabytes) of the GPU that is used. Since the facial landmark detection task is somewhat similar to the CLEVR bounding box prediction task, most hyper-parameters used to train the network for facial land-marks is kept the same as the CLEVR-box task described in [Zhang et al., 2019a]. It is mentioned that increasing the number of inner iterations might increase

per-4_{For all experiments, an Intel i5-6200U @ 2.30GHz (4 cores) is used} 5_{For all experiments, a GeForce GTX950M is used}

(18)

formance, so the number of inner iterations for this experiment is doubled to 20. All other hyper-parameters are kept the same (see appendixA).

The predictions on the test set are saved. To determine the performance of the model, a baseline comparison is needed. For this task, the state-of-the-art model introduced by [Wang et al., 2019], named HRNet is used. This model has not been applied for landmark detection in cat faces, but since the task is very similar, it could also be seen as state-of-the-art for this specific dataset. The model is trained using almost all the same hyper-parameters described by [Wang et al., 2019] (See appendixA). The only difference is the scale and rotation parameter when training on the cat faces dataset. HRNet augments the dataset by adding rotated and scaled versions of the training images. This caused some problems where some of the images are zoomed in on the cat face in such a way that most keypoints fall outside the image. The trained model for the WFLW dataset is provided by [Sun et al., 2019]6, and it is used to obtain predictions for the WFLW dataset. This model does not support variable output, so it will not be used on the union of the datasets. Since this model is not a set prediction task, the metric used to determine performance can not be applied to the results of the DSPN predictions. Therefore, the predictions of both models are compared using Chamfer distance and the EMD.

For the WFLW dataset, the test set is split further into sets where the 6 extra features are present. That is - there is a test set containing only images with blur, a test set containing only images with occlusion, etc. Both the DSPN model and the HRNet model are tested separately on these sets to compare their performance in those specific conditions.

(19)

Chapter 4 Results

4.1 Loss function optimization

Figure 4.1: Plot showing the processing time of two different loss functions on arrays of different sizes.

In figure 4.1 we can see the processing times of the Cham-fer loss function and the EMD - GPU loss function. A few things are immediately appar-ent. The first thing that is clear, is that for lower input size (less than 105_{), the}

Cham-fer distance performs better. This is to be expected since the theoretic time complexity of the Chamfer distance is bet-ter than the time complexity of the EMD. Both of these func-tions run on GPU and are en-tirely parallelized, which ex-plains the almost linear time complexity when the entire array can be evaluated in parallel. The sharp in-crease in processing time after about 105 _{is most likely due to the amount of}

computations necessary being larger than the amount of computation the GPU can run in parallel. Another observation is the bump the Chamfer distance makes in processing time when the size of the input array is between ±103 _{and ±10}4_.

(20)

Chamfer EMD - CPU EMD - GPU

Avg time per epoch (s) 327.3 332.4 322.4

Relative speedup 1.0 1.016 0.985

Table 4.1: Loss function effect on training time

Figure 4.2: Plot showing the processing time of two different loss functions on arrays of different sizes.

In figure 4.2 it is quite ob-vious that the GPU implemen-tation of the EMD is far bet-ter than the CPU implementa-tion. Not only is the process-ing time far greater for any in-put size (notice the logarithmic scale of the Y-axis), the high-complexity nature of the func-tion also becomes apparent at much smaller input sizes due to the lack of parallelization.

However, as can be seen in table 4.1, there is barely any difference in training time for each of the runs. In fact, they only differ ± 1 percent. The

input sizes for the loss functions are proportionate to the batch size. Since the images themselves are encoded using a convolutional network, they are also loaded into the memory of the GPU, which greatly influences the maximum batch size before the GPU runs out of memory. In practice the batch size could only be up to a few dozen depending on the machine. This order of magnitude is so low that the difference in processing time of the loss functions is only up to hundreds of milliseconds, as can be seen in figure 4.2. In short, the input size of the array in practice was so small the full effect of parallelization could not be utilized. On machines that have more GPU memory, or for tasks that don’t encode images on the GPU, the speedup of the parallelized EMD function could be more pronounced.

(21)

Figure 4.3: Sample of DSPN predictions on cat faces (top row), and HRNet predictions (bottom row)

4.2 Facial landmark detection

4.2.1 Comparison of models

In table 4.2 the EMD and the Chamfer loss of the set of predictions made by the different models is shown. We can see that for both datasets, the HRNet model performs better than the DSPN model. A sample of the predictions is shown in figure 4.3. We can see from the images that the HRNet predictions are generally closer. The first thing that stands out is that the HRNet model predictions look a lot better than the DSPN predictions when the cat face is rotated with respect to the image. This is an indication that the HRNet model is much more robust under rotation. We can also conclude see that HRNet is not capable of detecting landmarks outside of the image, while the DSPN model is. The difference in loss is a lot larger on the WFLW dataset. Since this HRNet model is designed for human facial landmark detection, it makes sense that the losses for the WFLW dataset are so much lower. For this training run, no hyper-parameter optimization has been done, but since we are using a pre-trained model, we can assume the hyper-parameters were better optimized. Another significant observation is the loss difference between the DSPN model performance. We can see that it performs better on the cat faces dataset. This is most likely due to the simplicity of this dataset. The images are not in-the-wild, making it an easier detection task altogether. The performance of the DSPN model on the union of the dataset is quite a bit worse than on the individual datasets.

(22)

Cats Chamfer loss EMD DSPN 1.710 · 10−3 7.440 · 10−2 HRNet 5.355 · 10−4 4.346 · 10−2

WFLW Chamfer loss EMD

DSPN 3.437 · 10−4 7.385 · 10−3 HRNet 1.490 · 10−5 3.734 · 10−3

Union Chamfer loss EMD

DSPN 3.405 · 10−3 5.478 · 10−1

Table 4.2: Losses on the validation set for the different models (lower is better)

4.2.2 Performance of models for specific situations

Qualitatively speaking it is apparent the model performs fairly well for predicting facial landmarks on cat faces. After inspecting some of the resulting images as seen in figure4.3, some preliminary conclusions can be drawn. First of all, it seems the model performs better on images where the portion of the image occupied by the face is high, and for images where the angle of the face relative to the orientation of the image is low. An example of these cases is shown in figure 4.4. To assess quantitatively if this observation is valid, some analysis of the scale and orientation of the faces in the validation set is necessary.

First, the scale of a cat in an image is obtained by simply taking the distance between the eyes of a cat in pixels, divided by the width of the image. This will give a rough estimate of the portion of the image the cat face occupies. To obtain the rotation, the angle between the horizontal axis of the image, and the line through the two eyes of the cat is calculated. The absolute value of this angle is used since we are only interested in the magnitude of the rotation, not the direction. We are interested in the correlation between the performance of the model on a specific image, and the scale and rotation of the cat face in that image. In figure 4.5 the rotation and scale measurements are shown (also note the Pearson correlation coefficient (r) and the corresponding p-values in the caption). Like expected, the loss generally increases as the angle increases, which the very low p-value confirms. This is most likely due to there simply being more low-rotation images in the dataset as we can see from the distribution of points in figure 4.5, which causes the model to have a bias towards those images. It also seems the loss decreases as the scale of the face increases. Even though this relationship seems to be quite weak, the relatively low p-value suggests it is relevant nonetheless. This is likely due to the way the loss functions work. They are based on the euclidean distance. This means that if the scale if the face is very low, and the

(23)

High scale, low rotation Low scale, low rotation

High scale, high rotation Low scale, high rotation

Figure 4.4: Selected predictions for different scales and rotations

entire cluster of predictions is close to the actual landmarks, the loss will be as low, or even lower than when the scale is high (see figure4.4, the absolute distance between the predictions and targets in the high scale faces is very similar to the absolute distance in the low scale images, even though the structure seems a lot more accurate in the high scale image).

(24)

(a) Scatterplot showing the angles of cat faces in an image relative to the horizontal axis of that image, and the EMD loss of the model on that image r = 0.349

p = 0.000

(b) Scatterplot showing the scale of distance be-tween the eyes of the face relative to the width of the image, and the EMD loss of the model on that image

r = −0.063 p = 0.023

Figure 4.5: Scale and rotation measurements versus loss of the DSPN model in cat face images

We are also interested in the correlation between the color in the images and the loss. It might be that some color information helps with detecting the land-marks. To do so, the images are converted from RGB to HSV (Hue Saturation Value). This is done in order to separate some color attributes in a way that is easily interpretable. For example, the average saturation in an image is more understandable than the average greenness. For all images, the average hue, sat-uration and value are calculated and plotted against the loss. As figure 4.6, and the corresponding Pearson correlation coefficients and p-values suggest, there is effectively no correlation between these average color values and the loss.

In figure 4.7 we see the same measurements for the prediction losses on the cat dataset from the HRNet model. All the Pearson correlation coefficients and p-values suggest there is no significant correlation between any of the measurements and the losses (except for the average hue, but this correlation is quite small). HRNet was developed to be able to detect facial landmarks on face in-the-wild. In general, faces in-the-wild have higher rotation and scale compared to the relatively simple cat faces dataset. This could explain why these correlations are so much less present in the predictions of the HRNet model.

The WFLW dataset also contains other features than just the keypoints. The test set is split in multiple sets, each of which one of those features is present. This allows us to gather more information about how the models perform on this dataset under specific conditions. We can measure this for both the DSPN and the HRNet model. In table 4.3 the EMD and Chamfer losses for these different test sets is shown. As we can see, in all conditions the HRNet still performs better.

(25)

(a) Scatterplot showing the average hue of an image, and the EMD loss of the model on that image r = 0.036

p = 0.197

(b) Scatterplot showing the aver-age saturation of an imaver-age, and the EMD loss of the model on that im-age

r = 0.030 p = 0.272

(c) Scatterplot showing the average value (brightness) of an image, and the EMD loss of the model on that image

r = 0.036 p = 0.191

Figure 4.6: Color measurements versus loss of the DSPN model in cat face images

DSPN HRNet EMD blur 9.592 · 10−3(×1.30) 2.281 · 10−3(×0.61) expression 7.177 · 10−3(×0.97) 5.673 · 10−3(×1.52) illumination 7.664 · 10−3(×1.04) 3.744 · 10−3(×1.00) largepose 1.442 · 10−2(×1.95) 3.529 · 10−3(×0.95) makeup 8.080 · 10−3(×1.09) 5.490 · 10−3(×1.47) occlusion 9.613 · 10−3(×1.30) 4.510 · 10−3(×1.21)

Table 4.3: EMD losses for different test sets for the DSPN model and the HRNet model, with the relative increase of loss compared to the loss on the entire test set

What is more interesting, is looking at how much better or worse a model is under a certain condition compared to the loss on the whole test set, and then comparing those between models. As we can see in table4.3, the losses on these splits are still lower for the HRNet, and when compring the relative increase/decrease in loss, it seems the HRNet model is more robust under these splits than the DSPN model. The only instance where the DSPN loss decreases and the HRNet increases is in the expression test set.

(26)

(a) Scatterplot showing the angles of cat faces in an image relative to the horizontal axis of that image, and the EMD loss of the model on that image r = 0.040

p = 0.149

(b) Scatterplot showing the scale of distance be-tween the eyes of the face relative to the width of the image, and the EMD loss of the model on that image

r = −0.026 p = 0.338

(c) Scatterplot showing the average hue of an image, and the EMD loss of the model on that image r = −0.056

p = 0.044

(d) Scatterplot showing the aver-age saturation of an imaver-age, and the EMD loss of the model on that im-age

r = −0.033 p = 0.228

(e) Scatterplot showing the average value (brightness) of an image, and the EMD loss of the model on that image

r = −0.034 p = 0.222

Figure 4.7: Scale, rotation and color measurements versus loss of the HRNet model in cat face images

(27)

Chapter 5 Conclusion

First of all, in our experiments we’ve seen that the DSPN model is further improved by parallelizing the loss function, but only when working on very powerful machines or when working on tasks where the inputs of the model are not images, but to determine the true value of this further experimentation is required. The DSPN model performs well for facial landmark detection tasks, but does not outperform a state-of-the-art model specifically designed for this task; HRNet. We showed that the DSPN models performance decreases when the input show some form of angular transformation. As the original authors of the DSPN mentioned, the set encoder might be improved by expanding to a rotation-equivariant encoder. We’ve also seen that the model performs worse for images where the scale of the face in the image is low, even though this is not represented in the loss. This is due to the loss not taking structural similarity between the predictions and the targets into account. For tasks like this, the model might be improved by expanding the loss function to incorporate some structural similarity between the predictions and the target. Another way to improve the model could be by performing bounding box prediction first. We can see that if the cat face takes a larger part of the image, the predictions look a lot better. If the image can be cropped to contain only the face of the cat, the results might be better.

Acknowledgment

Special thanks to David Zhang MSc. for his excellent supervision. Thanks to Oscar McGinty and Callum McLean for their helpful feedback on the writing.

(28)

Bibliography

[Achlioptas et al., 2017] Achlioptas, P., Diamanti, O., Mitliagkas, I., and Guibas, L. (2017). Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392.

[Bowman et al., 2015] Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Joze-fowicz, R., and Bengio, S. (2015). Generating sentences from a continuous space. [Burgos-Artizzu et al., 2013] Burgos-Artizzu, X. P., Perona, P., and Dollár, P. (2013). Robust face landmark estimation under occlusion. In Proceedings of the IEEE international conference on computer vision, pages 1513–1520. [Fan et al., 2017] Fan, H., Su, H., and Guibas, L. J. (2017). A point set generation

network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613. [Gao et al., 2015] Gao, S., Zhang, Y., Jia, K., Lu, J., and Zhang, Y. (2015).

Sin-gle sample face recognition via learning deep supervised autoencoders. IEEE Transactions on Information Forensics and Security, 10(10):2108–2118.

[Grover et al., 2019] Grover, A., Wang, E., Zweig, A., and Ermon, S. (2019). Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850.

[Guttenberg et al., 2016] Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H., and Kanai, R. (2016). Permutation-equivariant neural networks applied to dynamics prediction. arXiv preprint arXiv:1612.04530.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on com-puter vision and pattern recognition, pages 770–778.

[Johnson et al., 2017] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. (2017). Clevr: A diagnostic dataset for

(29)

compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901– 2910.

[Koestinger et al., 2011] Koestinger, M., Wohlhart, P., Roth, P. M., and Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 2144–2151. IEEE. [Póczos et al., 2013] Póczos, B., Rinaldo, A., Singh, A., and Wasserman, L.

(2013). Distribution-free distribution regression.

[Qi et al., 2017] Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652– 660.

[Ren et al., 2016] Ren, S., He, K., Girshick, R., Sun, J., and R-CNN, F. (2016). Towards real-time object detection with region proposal networks, 2016 [j]. arXiv preprint arXiv:1506.01497.

[Rezatofighi et al., 2018] Rezatofighi, S. H., Kaskman, R., Motlagh, F. T., Shi, Q., Cremers, D., Leal-Taixé, L., and Reid, I. (2018). Deep perm-set net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. arXiv preprint arXiv:1805.00613.

[Rubner et al., 2000] Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International journal of com-puter vision, 40(2):99–121.

[Sagonas et al., 2016] Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. (2016). 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18.

[Santoro et al., 2017] Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. (2017). A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976.

[Sun et al., 2019] Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., and Wang, J. (2019). High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514.

(30)

[Wang et al., 2020] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence.

[Wang et al., 2019] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., and Xiao, B. (2019). Deep high-resolution representation learning for visual recognition. TPAMI.

[Wu et al., 2018] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., and Zhou, Q. (2018). Look at boundary: A boundary-aware face alignment algorithm. In CVPR.

[Wu and Ji, 2019] Wu, Y. and Ji, Q. (2019). Facial landmark detection: A litera-ture survey. International Journal of Computer Vision, 127(2):115–142.

[Zaheer et al., 2017] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017). Deep sets. In Advances in neural information processing systems, pages 3391–3401.

[Zhang et al., 2008] Zhang, W., Sun, J., and Tang, X. (2008). Cat head detection-how to effectively exploit shape and texture features. In European Conference on Computer Vision, pages 802–816. Springer.

[Zhang et al., 2019a] Zhang, Y., Hare, J., and Prügel-Bennett, A. (2019a). Deep set prediction networks.

[Zhang et al., 2019b] Zhang, Y., Hare, J., and Prügel-Bennett, A. (2019b). Fspool: Learning set representations with featurewise sort pooling.

(31)

Appendix A

Hyper-parameters of models

Parameter Value

Dimensionality of latent space 512 Dimensionality of hidden layers 512

Learning rate 0.0003

Inner learning rate 800

Scaling of repr loss for outer loss 0.1

Inner iterations 20

Mask as feature True

Table A.1: Hyper-parameters of the DSPN training runs on different datasets

WFLW Cat faces

Parameter Value Parameter Value

Learning rate 0.0001 Learning rate 0.0001

Optimizer Adam Optimizer Adam

Image size 256x256 Image size 256x256

Heatmap size 64x64 Heatmap size 64x64

Nesterov False Nesterov False

Scaling factor 0.25 Scaling factor 0 Rotation factor 30 Rotation factor 0

Flip True Flip False

(32)

Appendix B

Sample of results

B.1 DSPN

(33)

Figure B.2: Sample of DSPN predictions on WFLW

(a) Make-up (b) Occlusion (c) Large pose

(d) Illumination (e) Expression (f ) Blur

(34)

(35)

B.2 HRNet

Figure B.5: Sample of HRNet predictions on cat faces

(36)

(a) Blur (b) Make-up (c) Large pose

(d) Illumination (e) Expression (f ) Occlusion

(37)

Appendix C

Reproduction guide

Note: Tested withAnaconda on Ubuntu 18.04.4 LTS and Ubuntu 20.04 LTS. Setup instructions:

• Install CUDA 10.2.

• Create and activate a conda environment.

• In this conda environment, install torch as described here.

• For the next step, we need to make sure we have version <8.* of gcc and g++. Hereis a guide to downgrade if necessary.

• Clone into the fancy branch of neuralnet-pytorch on GitHub:

git clone -b fancy https://github.com/justanhduc/neuralnet-pytorch.git • Change directory to

./neuralnet-pytorch/neuralnet_pytorch/extensions/cuda/emd_c/ and open the file emd_kernel.cu in your preferred text editor.

• Find these lines at the beginning of the file:

# d e f i n e C H E C K _ C U D A ( x ) A T _ C H E C K ( x . type () . i s _ c u d a () , # x " must be a CUDA t e n s o r " )

# d e f i n e C H E C K _ C O N T I G U O U S ( x )

A T _ C H E C K ( x . i s _ c o n t i g u o u s () , # x " must be c o n t i g u o u s " )

(38)

# d e f i n e C H E C K _ C U D A ( x ) T O R C H _ C H E C K ( x . type () . i s _ c u d a () , # x " must be a CUDA t e n s o r " ) # d e f i n e C H E C K _ C O N T I G U O U S ( x ) T O R C H _ C H E C K ( x . i s _ c o n t i g u o u s () , # x " must be c o n t i g u o u s " )

(Simply change AT_CHECK to TORCH_CHECK).

• Browse back to the neuralnet-pytorch base directory and run python setup.py install

• Install other necesarry packages:

pip install matplotlib pandas scipy Pillow tensorboardX h5py tqdm • Now that everything the code from this project needs is set up, we can install

the extended DSPN package. In a fresh directory, clone into https://github.com/FlorianSchroevers/dspn.git. Simply run python train.py --help to see how to continue.

Deep Set Prediction Networks for Facial Landmark Detection Tasks

Deep Set Prediction

Networks

for Facial Landmark

Detection Tasks

Deep Set Prediction Networks

for Facial Landmark Detection

Tasks

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Facial landmark detection

2.2

Predicting sets

2.2.1

Sets

2.2.2

Auto-Encoder

2.2.3

Loss

2.3

Deep Set Prediction Network

Chapter 3

Method

3.1

Implementation

3.1.1

Data preprocessing

3.1.2

Model

3.2

Experiments

3.2.1

Loss function optimization

3.2.2

Facial landmark detection

Chapter 4

Results

4.1

Loss function optimization

4.2

Facial landmark detection

4.2.1

Comparison of models

4.2.2

Performance of models for specific situations

Chapter 5

Conclusion

Acknowledgment

Bibliography

Appendix A

Hyper-parameters of models

Appendix B

Sample of results

B.1

DSPN

B.2

HRNet

Appendix C

Reproduction guide