Face Anonymisation in Face Detector Training Data

(1)

MSc Artificial Intelligence

Master Thesis

Face Anonymisation in Face Detector

Training Data

by

Matthew van Rijn

10779353

28th July 2020

36 EC March 2019 - July 2020

Supervisor:

S. Klomp MSc

Examiner:

Prof. dr. C.G.M. Snoek

Assessor:

dr. P.S.M. Mettes

Vinotion B.V.

(2)

Abstract

Privacy is an increasing concern for the large-scale collection of faces required to train face detectors. Current face anonymisation methods can generate realistic-looking replacement faces, but these are not designed to be used as training data. In this thesis, we seek to determine the best method for anonymising faces for use as training data for face detectors. As a part of this, we propose a face anonymisation method that uses two loss functions not previously used in face anonymisation, named Detection Loss and Perceptual Loss [1]. The use of these loss functions, is designed to optimise the anonymised faces for training a face detector.

We perform experiments to show the impact of the individual loss functions, as well as other compon-ents of the face anonymiser, on the anonymised faces. To measure the impact for use as training data, face detectors are trained on the anonymised faces. The performance of these face detectors is compared to ones trained using original faces or faces anonymised using other common anonymisation methods.

The results show that the Perceptual Loss raises the performance of a face detector trained on the anonymised faces significantly. Detection Loss has the opposite effect, lowering the performance of the face detector. We also find that facial landmarks are an essential component of the anonymisation system, due to the way they improve the structure of faces and force variety in them. It is concluded that while all tested anonymisation method reduce the suitability of faces to be used as training data for a face detector, our method using Perceptual Loss causes the smallest performance loss amongst methods that completely replace the face.

(3)

Introduction

Face detection is one of the oldest and most commonly researched computer vision applications. It is often used as a component of more advanced applications, such as in biometric security systems or in re-identification for automatic tagging of pictures on social media. The face is a human’s most identifiable feature, and face detection applications are therefore very privacy sensitive. Face detection can be used as part of a mass surveillance system, for example.

Since it first became a research subject in the 1990s, face detection has gone though several periods [2]. Early rule-based methods would encode human knowledge of what a face should look like. In later years, methods based on more complex, yet still hand-selected features and boosted decision trees gained ground. In recent years deep learning has become dominant in the field of face detection, and object detection in general [2].

Deep learning methods generally require far larger datasets than were previously necessary. Its advent in face detection has therefore increased the demand for large-scale face datasets. The creation of these datasets raises certain privacy concerns, since people in the images are fully recognisable. Obtaining consent from each subject in a dataset with hundreds of thousands of faces is obviously impractical. Modern privacy legislation, such as the EU’s GDPR, stipulates that personal data, including faces, should not be processed without explicit consent unless it is strictly necessary for the application and does not affect the interests or rights of the subject [3]. The use of faces in a face detection dataset would intuitively seem necessary, but concepts such as interests and rights leave room for interpretation.

There is already a wealth of face anonymisation methods [4]. Simple anonymisation methods such as blurring are commonly used on images taken in public, such as Google’s street view service. There are also more advanced anonymisation methods that replace faces with completely new ones generated with Generative Adversarial Networks (GANs) [5, 6, 7,8]. However, these methods have not been designed for use on training data. Their goal is to generate realistic looking faces, and not to preserve the variety of features important for training a face detector. Of these methods, [7] and [8] were produced in parallel with this thesis.

The creation of an anonymisation method that removes identifiable information from a dataset while preserving these features would be of huge benefit to privacy-minded computer vision companies, or those who want to be on the safe side of privacy laws. This would allow new datasets to be more easily created and published which would benefit the research field in general. The main focus of this thesis will be to explore such methods. The goal is to create an anonymisation method that makes it so that the use of personal data in training datasets is no longer considered necessary.

The disconnect between the learning goal of the face anonymiser and the needs of the face detector is addressed in this thesis by extending an existing type of GAN-based anonymiser with two additional loss functions. The first is a newly proposed Detection Loss, that is determined by how well an existing detector is able to detect the anonymised faces. This should strengthen features that are important for face detection. The second is the the existing Perceptual Loss function [1]. Face anonymisation can be considered a form of image translation, and this loss function is used in other forms of image translation such as sharpening and style transfer. Perceptual Loss should be well suited to the anonymisation of training data, since it values the preservation of high-level features important for face detection, such as eyes and ears, over smaller details such as freckles. Neither of these loss functions have previously been used for face anonymisation.

(5)

1.1 Research Questions

The primary goal of this thesis is to research methods of anonymising faces without making them unusable for training face detectors. A new face anonymisation method is proposed for this purpose, which builds on existing GAN-based anonymisers, and integrates face detection into the training process. Specifically, we want to answer the following research questions:

Main Research Question: “How can we anonymise faces without reducing their suitability to train effective face detectors?”

To help answer this question, we pose four subquestions. They are placed in an order where the answers to earlier questions may help to answer later questions.

Question 1: “What is the influence of the different components of a face generator on the generated faces and their value as training data?”

Face anonymisation methods often consist of several (optional) components. For example, the GAN-based anonymiser from [6] has a component that removes the original face, and a component where facial landmarks are inserted. An understanding of what each component is used for is helpful when making changes and finding improvements.

Question 2: “How do the additions of Perceptual and Detection Loss influence the generated faces and their value as training data?”

As noted above, both Perceptual Loss and Detection Loss have properties that could make anonymised faces more suitable for training detectors. This thesis explores to what extent this is true by using these losses separately, together, and as an addition to a GAN-based anonymiser.

Question 3: “Which of the considered anonymisation methods are most suitable for anonymising train-ing data?”

Answering this question is important for the practical application of face anonymisers on training data. There are two measures to consider here. The first is the performance of the face detector trained on the anonymised data, and the second is the level of anonymisation. In most cases there is a tradeoff between the two. There may be multiple suitable methods depending on the amount of anonymisation required by the user.

Question 4: “Can the use of original faces for training face detectors be considered unnecessary?”

This question is asked due to its relevance to the privacy laws described earlier in the introduction. While it may never be possible to prove that the use of original faces is strictly necessary, a well-performing anonymisation method could certainly prove that it is unnecessary. This would, for example, be the case if a face detector trained on anonymised faces performs equally well as one trained on original faces.

1.2 Contributions

This thesis makes three main contributions to the field of face anonymisation:

(i) A new neural GAN-based face anonymisation method using Perceptual and Detection Losses. (ii) A study of the impact of new and existing face anonymisation methods on the suitability of training

data for training face detectors.

(iii) A better understanding of how different face generator components contribute to generated faces.

1.3 Outline

This thesis is structured as follows. Chapter 2 (Related Work), gives an overview of existing face an-onymisation and detection methods and their properties. Chapter 3 (Method) introduces the architecture of our proposed anonymisation method, and motivates its design choices. Chapter 4 (Experimental Setup) introduces the datasets used, and explains how we train and evaluate the face anonymisers. Chapter 5 (Experiments and Results) describes the experiments we perform, gives their result, and discusses them. Finally, Chapter 6 (Conclusion) summarises the results and explicitly answers the research questions.

(6)

Chapter 2

Related Work

This chapter provides the reader with background information about topics that are relevant in the following chapters. There are two research topics that play a prominent role in this thesis: face detection and face anonymisation. Before we can propose new anonymisation methods, it is necessary to explore what kind of possibilities there are and what methods are commonly used. In order to evaluate a new method for the purpose of face detection, an understanding of how state-of-the-art face detectors work is also required.

The chapter is divided into four sections, the first of which (Section 2.1) covers modern face detectors and the standard evaluation measures used with them. The following three sections cover various types of face anonymisation. Section 2.2 discusses simple methods such as blurring, Section 2.3 looks at methods that replace faces with other existing faces, and finally Section 2.4 goes into detail about modern generative solutions.

2.1 Face Detection

Face detection has gone through several periods, but today deep convolutional neural networks (DCNNs) are dominant. To get a good idea of the impact of anonymisation on face detection, we evaluate the impact using modern methods that perform at, or close to, state-of-the-art level.

As face detection is a subproblem of generic object detection, it shares many of the same challenges, but also many of the latter’s advancements. Face detection methods can broadly be divided into one-stage and two-one-stage categories. Two-one-stage detectors have one one-stage that proposes regions of interest, which are then normalised and classified using a separate network, as in the generic object detector Faster-RCNN [9]. Single-stage detectors on the other hand replace the region proposal stage with dense predefined anchor boxes at various scales and aspect ratios, such as in the SSD [10] detector. Single-stage methods are generally more efficient because there is no region proposal step, and features for the entire image are calculated using a single forward pass. This is at the expense of localisation accuracy, due to the predefined bounding boxes. Not all object detectors fit cleanly into the two categories. For example, FCOS [11] is a single stage detector that does not use anchor boxes.

For the purpose of face detection, incremental improvements have been made to both the one-stage and two-stage methods. New detectors are often focused on improving one aspect of face detection. One stage detector EXTD [12], for example, replaces the large backbones used in previous detectors with a lightweight one to improve CPU and mobile performance. Alternatively, the two-stage Tiny Face detector [13] trains detectors at multiple scales to improve the detection of very small faces.

For the popular face detection dataset WIDERFACE [14], a leaderboard1 _{is maintained showing the}

top scoring face detectors on this dataset. At the time of writing it is led by the Automatic and Scalable Face Detector (ASFD) [15], though the performance difference with other top detectors [16,17,18,19] is small. Therefore, for evaluating the effects of anonymisation, it may be preferable to use a slightly worse performing detector if it is faster to train or has an open-source implementation.

2.1.1 Evaluation

To evaluate the performance of a face detector its detections must be compared to the ground truth. The procedure for this is fortunately quite standardised, with most studies using a variant of the PASCAL

(7)

VOC evaluation [20]. Two things are considered in the evaluation of a bounding box; is it positioned correctly, and is it classified correctly. This is done using the standard object detection measure Inter-section over Union (IoU), and a classification score threshold respectively. For evaluation over the whole dataset the Average Precision (AP) is used. These measures are explained below.

The IoU is an overlap measure that, as the name suggests, divides the area of the intersection of two bounding boxes over the area of their union. This produces a value between 0 and 1 where 0 means there is no overlap and 1 indicates the bounding boxes are identical. For PASCAL VOC evaluation, a localisation is considered correct if its IoU with the ground truth is greater than 0.5. An example calculation of the IoU is shown in Figure 2.1.

60

32 ≈

0.53

Figure 2.1: A visual example of an Intersection over Union overlap calculation. The black bounding box is the ground truth and the red bounding box is a prediction.

An inherent property of a retrieval task such as face detection is that there will be a tradeoff between precision and recall. In this case it is determined by the classification score threshold. Lowering this threshold will increase the number of true positives, and therefore the recall, at the expense of introducing false positives, and decreasing the precision. To get an overall picture of the performance the AP calculates the precision using various thresholds that result in evenly spaced recall values. For PASCAL VOC, 11 recall values are used between 0 and 1. The AP is the mean of the precision at these 11 recall values, which is an approximation of the area under the precision-recall curve.

This evaluation method appears to reward correct classification more than exact localisation. More confident predictions will likely increase the AP, but a prediction with an IoU of 0.95 will not be evaluated as better than one with an IoU of 0.51. This may give an advantage to one-stage detectors with less accurate localisation, but for many applications it is preferable to detect the presence of a face than to locate it perfectly. In some cases the AP is calculated for a range of IoU values, but this is not universal. Finally, it is common to split the evaluation to control for certain face properties, for example scale and occlusion in PASCAL VOC. The properties may also be used to assign an easy, medium or hard label to faces, and evaluate based on that, as is done for WIDERFACE [14].

2.2 Early Anonymisation Methods

The focus of early face anonymisation research was on image editing techniques such as blurring, blocking and pixellation [21]. These methods are very simple to apply, and are therefore used at a large scale to this day. Anonymisation is provided by removing identifying features of the face, or making them less detailed. Therefore, the level of anonymisation these methods provide can change significantly depending on how strongly they are applied. Since face detectors learn to detect faces based on the same features that are obfuscated by these early anonymisation methods, it is unlikely that blurring, blocking or pixellation are suitable for anonymising training data to a high degree. However, since they are commonly used, it makes sense to include them in experiments as a baseline for comparison.

(8)

Examples of blurring, blocking and pixellation applied to a face are shown in Figure 2.2. These three methods are explained in more detail below.

Figure 2.2: From left to right, an image and its anonymised versions using blurring, pixellation and blocking.

Blocking

Blocking is an anonymisation method where features are anonymised by replacing them by a solid coloured box. The only information required to perform blocking is the position of the relevant features, for which a face bounding box can be used. It also requires almost no computation. By removing all information, blocking provides the highest possible level of anonymisation. For many applications though, the extra anonymisation is not worth it considering the following drawbacks. Blocked features stand out from the rest of the image and draw a viewer’s attention, which is bad if the images are meant for people to look at. For training data anonymisation it is also bad because a face detector trained on blocked images can only learn to detect black boxes.

Blurring

Blurring is an anonymisation method that smoothes and thereby reduces the detail of image features [21]. This makes it harder to recognise a specific face in an image, but still retains some structural information that can be used to train a detector. Since there is a certain amount of information left about the original features, the level of anonymisation is lower than that of blocking.

The smoothed features are obtained by applying a filter to the desired area. Equation 2.1 is typically used to create the Gaussian filter. The strength of the blur is proportional to σ, the standard deviation of the Gaussian distribution.

G(x, y) = 1

2πσ2e

−x2 +y2

2σ2 (2.1)

Pixellation

Pixellation can be viewed as being in-between blocking and blurring. It effectively lowers the resolution of features to remove information, making them look more generic. Pixellation is applied by placing a grid over the face, and replacing the pixel values within each cell by their mean value. The smaller the grid, the larger the level of anonymisation, with a 1×1 grid being equivalent to blocking. Using a typical grid size the amount of information removed is larger than when blurring with a typical standard deviation, and smaller than with blocking.

Full Image

An inherent advantage of these early anonymisation methods is that they can be applied independently of the face position, therefore it is possible to blur, block or pixellate faces at test time, without knowing where they are positioned, simply by applying the obfuscation to the entire image. For visual purposes this may not be beneficial, but for the anonymisation of training data this can be advantageous. If only the face is anonymised, the obfuscation itself becomes an artificial feature and a face detector may learn to associate blurred or pixellated areas with faces. When the detector is then used to detect new (unblurred) faces, it will fail to do so. If the whole image is obfuscated, this specific problem will not occur.

(9)

A drawback of obfuscating the complete image as opposed to the individual faces is that the level of anonymisation becomes dependent on the face size. Larger faces will appear more recognisable than smaller ones. To sufficiently anonymise large faces a strong blur or small-grid pixellation is required, but this can make smaller faces disappear completely.

Drawbacks

The level of anonymisation of the early anonymisation methods is also not always sufficient. Whilst blocking logically makes face identification impossible, blurring and pixellation do not prevent machine identification in some cases [6,21]. Furthermore, it has long been possible to de-blur images using methods such as Deconvolution [22], which reveals that a significant amount of detail is still present. This is bad for privacy, but this detail could prove useful to a face detector. Blurring may therefore provide a good balance between protecting privacy and preserving a dataset’s suitability for use as as training data.

2.3 Replacement Anonymisation

As digital processing of face became more popular, a new type of face anonymisation method appeared where the faces are replaced with different faces entirely. These face-replacement methods improve upon the early anonymisation methods discussed in the previous section in two ways. Firstly, by replacing the original face instead of merely deforming it, a greater level of privacy protection is achieved. Secondly, the faces look more natural, which is a benefit if they are meant for display to humans, or if the features are needed to train a face detector.

One common replacement-based anonymisation method is the k-Same algorithm [23]. This method replaces the face with the average of the k-closest faces. The closest face are determined by calculating the Euclidean distance between the faces projected into eigenspace. Unlike methods such as blurring, k-same produces an actual face with clear eyes, nose, mouth and ears. The level of privacy protection can be increased by increasing k, but this comes at the cost of face variety as it makes the anonymised face approach the dataset mean. A drawback of k-same is that all faces must be aligned to be able to calculate distances between them and obtain average faces. Privacy protection is also not guaranteed if a person’s face appears multiple times in the dataset.

In the years following, several improvements were made to the k-Same algorithm to address some of its shortcomings. The k-Same-Select [24] algorithm partitions faces using properties such as gender and expression type, and uses faces with the same properties to create a replacement. This improves the variation, and therefore ’data utility’, of the anonymised faces, which is shown using a face classifier. To improve the quality of faces, the k-Same-M algorithm [25] fits generative parametric models to all faces, between which it calculated the distance. This eliminates artefacts caused by misaligned faces, especially for higher values of k.

An alternative to using averaged faces, is to replace faces with those of other people directly. This is a method proposed in [26] for anonymising entire bodies. The bodies are segmented and removed from the image. The background is then inpainted, after which the most similar body is placed in the location of the original. Such an approach helps to preserve the variety in the dataset and, because they are real people, produces natural-looking replacements. Whether the segmentation-mask approach is suitable for faces, however, is unclear. This method also requires enough people to consent to be used as replacements.

Yet another type of replacement anonymisation is abstraction [4]. For abstraction, the object is removed and replaced with an abstract representation of it. For example, a face could be replaced with a series of dots to show the positions of the eyes, nose, mouth and ears. This type of anonymisation removes all information but the position of an object, making it unsuitable for applications such as training a face detector.

2.4 Generative Face Anonymisation

In recent years, just like with face detection, deep convolutional neural networks have become successful in face anonymisation. The biggest change is the advent of the Generative Adversarial Network (GAN) [5] and its variants. These neural networks are able to generate new faces from noise, without major privacy concerns.

(10)

State-of-the-art GANs for face generation such as [27] produce very detailed high-resolution faces. These faces are not suitable to be used to anonymise faces within a larger image, because they do not blend in to the image. Dedicated face anonymisation GANs such a CIAGAN [8], DeepPrivacy [7] and [6, 28], solve this issue by generating new faces from the area around the face instead of random noise, similar to inpainting. The GAN’s discriminator then judges the generated face and the area round it. This results in properly blended faces.

The generative face anonymisation methods [6,7,8,28] typically have a similar generative architecture. Since the face anonymisation technique proposed in Chapter 3 is also based on this architecture, we go into more detail about the shared components in the sections below.

Evaluating the output of a GAN is often done visually. However, there are also some common objective evaluation metrics to evaluate certain aspects of the result. Two of these are the Inception Score (IS) [29] and the Fréchet Inception Distance (FID) [30]. The IS measures both the quality and diversity of a dataset through the class probability distribution produced by feeding images through an Inception network [31]. The FID also evaluates the diversity of a dataset, but instead uses intermediate features from the Inception network.

2.4.1 U-Net

The typical architecture of the generator within a face anonymisation GAN is that of a U-Net [32]. A U-Net is an autoencoder that provides additional skip connections between the encoder and decoder. It consists of a series of convolutional blocks, between which the feature resolution is downscaled in the encoder and upscaled in the decoder. The rescaling can be performed entirely convolutionally. Alternative methods include max-pooling for downscaling, and uppooling or bilinear interpolation for upscaling.

A U-Net’s skip connections work by concatenating the outputs of convolutional blocks in the encoder onto the inputs of blocks in the decoder. These connections allow high-resolution spatial information to be recovered [33], which is useful to blend the generated face with the context and helps maintain the pose of the face through the landmarks.

Context

Face and

Context

1 1

Skip Connection

Downscale

Upscale

Convolutional Block

Resolution:

U-Net

Figure 2.3: Schematic overview of a U-Net [32] when applied as a generator for face anonymisation. Initially, U-net was used for image segmentation. The convolutions were performed without padding, causing the output resolution to be lower than the input resolution. For face anonymisation it is more important to maintain a high output resolution to allow enough detail to be generated. This is why convolutions in anonymisation U-Nets are generally padded. A schematic of a U-Net applied to face anonymisation is shown in Figure 2.3.

2.4.2 Landmarks

In addition to the U-Net, another method shared between many GAN-based face anonymisers [6, 7, 8] is the use of landmarks to represent the positions of certain parts of the face. Structure guidance such

(11)

as through landmarks can improve the quality of complex structures in generated images [34]. Since landmarks determine the facial structure, they can be used to preserve the pose of the original face, or conversely, change it to provide additional anonymisation. Figure 2.4 shows the landmarks as used in [6].

Figure 2.4: An example of landmarks (in white) indicating the positions of the eyes, nose, moth and face outline. The face is part of the PIPA [35] dataset.

Whilst landmark use is common, the specific way in which they are used varies. In [6] and [7], landmarks are one-hot encoded into a k× width × height matrix, where k is the number of landmarks.

The matrix is concatenated to the input image before entering the U-Net. In [8], on the other hand, landmarks are represented as a single-channel image with dots for landmarks. The other main difference is the number of landmarks used. In [6], 68 landmarks are used. This number is reduced in [7] and [8] to make it harder to identify a person through the landmark positions.

(12)

Chapter 3

Method

This chapter proposes a new face anonymisation method. The method is a member of the GAN-based family of face anonymisers. Existing GAN-based anonymisers are designed to generate natural-looking replacement faces that are able to withstand identification by both humans and computers. Our new method has slightly different goal; in addition to withstanding identification, faces generated by this new model should be optimised for training a face detector.

To achieve this goal, the new anonymisation method extends a GAN similar to that proposed in [6] with two additional loss functions. The first is the Detection Loss, which measures how detectable a generated face is with a face detector pretrained on unobfuscated images. This is meant to emphasise features in the generated face that are important discriminators for a face detector. The second is the Perceptual Loss [1], which judges whether a generated face has the expected features of a real face. This is meant to allow the removal of identifiable details from faces while preserving the main features of a face.

A general representation of this new anonymisation method is shown in Figure 3.1. Each rectangle rep-resents a component and each component is described in one of the following sections. First, Section 3.1 describes the input of the face anonymiser. Section 3.2 gives the architecture of the U-Net generator. Af-terwards, Section 3.3 introduces the GAN Loss. Finally, the face anonymisation contributions, Detection Loss and Perceptual Loss, are described in sections 3.4 and 3.5 respectively.

Updates Inputs New U-Net Generator GAN Loss (Fig 3.3) Perceptual Loss (Fig 3.5) Detection Loss (Fig 3.4) Input Original Anonymised Input Preparation Loss

Figure 3.1: A schematic overview of the proposed face anonymisation method. Each rectangle represents a component that is explained in more detail in the sections below. The components in the grey shaded area are new contributions.

(13)

3.1 Input Preparation

The first step of any generative face obfuscator is to collect and prepare the information given to the face generator. The information in this context refers to the input given to the generator which allows it to generate a suitable replacement for the face being anonymised. For this face anonymisation method we prepare two sources of information for the generator: the context and the landmarks.

Context is a source of information that consists of the areas directly surrounding the face. Its most important function is to allow the generator to blend the face with the surroundings, so that there is no clear border around the anonymised face. In addition, context provides a source of randomness needed to produce distinct faces. Though the exact size of the context varies, it is commonly available to GAN-based anonymisers [6,7]. For this method, we make use of a context area three times the size of the face itself. This area is selected by extending the face’s bounding box 50% to each side, 10% to the top and 90% to the bottom. The reason to extend the bounding box further downward than upward is that the area below the face is typically the body, and therefore more relevant to the face than the area above.

Once the bounding box has been extended the area is sliced from the image and resized to 128×128.

In some cases the head is close the edge of the image, making some of the context area fall outside the image. These areas are also filled by repeating the nearest edge pixel, because this is the most likely colour that the context would have had, had it been in frame. Finally, to ensure sufficient anonymisation of the face, the face area is replaced by solid black.

Figure 3.2: Example of an image input into the GAN face obfuscator before (left) and after (right) preparation. On the left, the annotated bounding box is shown in blue. On the right, the image is cropped to the context area, with the head blacked out and the face landmarks shown in white.

The second source of information for the generator are the face landmarks. These are points that represent the positions of fixed points of the face, for example the eyes, nose, mouth and chin. This information can be used by the generator to make a new face with the same pose as the original, something for which the context does not always provide enough information. While this may not be important for, and may even be damaging to, the strength of the anonymisation, it does make the replacements look more natural. More importantly, it guarantees that the distribution of head poses in the is maintained in the anonymised dataset, which is important when it is used to train a detection model.

A drawback of using landmarks is that they are not annotated in most datasets. This means landmarks must be detected for faces before they can be anonymised. This anonymisation method makes use of the facial keypoint detector from the dlib package1_{, which detects 68 landmarks. These landmarks}

are represented as 68 one-hot encoded channels. This means the generator input for a 128×128 RGB

context area (3 channels) becomes 71×128×128. An example of the input preparation process is shown

in Figure 3.2, where the landmarks are shown as white dots for the purpose of visualisation. Given a bounding box, the landmark detector will always return a set of landmarks, even if it is merely a best guess. This allows all faces to be anonymised, including occluded or very small faces, though in some cases with an incorrect pose.

3.2 Generator

The component responsible for anonymising the faces is the generator. It is given information about the face in the form of context and landmarks and must use these to generate a natural-looking face that is

(14)

suitable as training data for a face detector. The common U-Net architecture introduced in Section 2.4 is used as a generator for this anonymisation method.

The U-Net consists of a series of convolutional blocks, as shown in Figure 2.3. A single convolutional block is composed of two layers of 3×3 convolutions with Rectified Linear Unit (ReLU) activation [36]. Padding is used to maintain the image resolution on the output, and allow 1:1 copying in the skip connections. Batch normalisation is applied after each convolutional layer [37].

As a type of autoencoder, the U-Net first encodes the input to a bottleneck, which is subsequently decoded to a face. The encoding step consists of five convolutional blocks with 2×2 max-pooling applied

between them, where each block doubles the number of feature channels. The output of each block is stored and appended to the input of the opposite convolutional block in the decoding step as a skip connection. The decoding step is the mirror of the encoding step. Each convolutional block halves the number of channels and bilinear upsampling is used to increase the feature resolution, because it is less likely to produce checkerboard artefacts than deconvolution.

The output of the final convolutional block is a 3×128×128 image, with pixels for both the face

and the context area. The context pixels are discarded and replaced with the original context, since the discriminator must be able to judge the blending between face and context. Because of this, the generated pixels in the context area do not affect the losses.

3.3 GAN Loss

The GAN Loss is a combination of the adversarial loss and an L1 reconstruction loss, as used in [6]. The general method is shown in figure 3.3. The purpose of the adversarial loss is to judge whether faces can be recognised as anonymised or original. This is done through a discriminator network, which is given a face with context as input and outputs whether it believes the face to be real or anonymised. By repeatedly training the generator to fool the discriminator, and the discriminator to better distinguish between original and anonymised faces, the generator should eventually produce faces that look just as real as the original ones.

GAN Loss Anonymised Original Discriminator Reconstruction Loss Adversarial Loss

Figure 3.3: An overview of the GAN-loss; a combination of an adversarial loss term and a reconstruction loss term.

The discriminator used is a DCGAN [38]. The structure of a DCGAN discriminator is a series of strided convolutional layers to downsample the image to a single scalar. The convolutional layers use LeakyRELU activation with a slope of 0.2, and batch normalisation [37] is applied between each convolution. The activation of the final score is sigmoid, turning the output into the probability of the face being original.

The other component of the GAN Loss is reconstruction loss, which is a simple L1 loss between the original face and the anonymised face. The reconstruction loss is valuable because it helps the generator quickly produce something face-like, stabilising the adversarial training. The reconstruction loss is not a focus of this method and is not subject to experimentation in Chapter 5.

(15)

3.4 Detection Loss Function

The Detection Loss is a loss function we propose for anonymisation that measures how detectable an anonymised face is by passing it to a face detector. The idea behind this loss is that a face that is easily detected by a face detector is likely to contain the features important for training a face detector. In this thesis we have a face detection, and a face anonymisation problem, and Detection Loss unites these into one system by allowing information to flow from a face detector into the face generator. This should cause the important features to be automatically emphasised.

The Detection Loss is based on the MultiBox Loss of the Dual-Shot Face Detector (DSFD) [19], the same detector that is used throughout this thesis. Like most object detectors, DSFD uses a loss function with confidence and location terms when comparing to the ground truth. Equation 3.1 shows the Detection Loss calculation, where det(ˆy) are the DSFD predictions of the anonymised image, and gt(y) is the ground truth. These values are used by the DSFD MultiBox Loss (LM B), resulting in the

Detection LossLd. Detection Loss is used to assist the GAN, and is weighted by factor λd before being

summed with the overall loss.

Ld =LM B(det(ˆy), gt(y)) (3.1)

While the other loss functions of our anonymisation method work on individual faces, face detectors are designed to detect within whole images. If we only input the face, it would be resized to a (large) fixed size and the detection would be trivial. This may cause an emphasis on features that are important for detecting large faces, and hurt the small-face detection performance of a face detector trained on the anonymised data. For this reason, anonymised faces are input into DSFD with the same size and position as the original faces.

One thing that should be avoided in the calculation of Detection Loss is the presence of other faces in the image passed to the face detector. If DSFD detects these, they will interfere with the loss calculation, and make it more computationally expensive. To avoid this, the anonymised face is not placed in the original image but instead rescaled to the size of the original face together with the context area, after which the edge pixels are repeated up to the edge of a canvas with the same size as the original image. This way we can be sure there is exactly one face to be detected, except in the rare case when the context area contains another face. An overview of the Detection Loss system is shown in Figure 3.4.

Detection Loss Replace in Image Anonymised DSFD Face Detector DSFD Multibox Loss

Ground Truth Predictions

Figure 3.4: A schematic overview of the Detection Loss calculation.

3.5 Perceptual Loss Function

The Perceptual Loss function [1] measures the difference between the intermediate features of two images, where the features are extracted from a pretrained network. Intermediate features represent higher-level entities than the raw pixels used in, for example, a reconstruction loss, and are discriminative for the

(16)

goal the network was trained for. Therefore, if features are extracted from a classification network, the Perceptual Loss will emphasise high-level properties that are discriminative for classification, over small details, such as those that risk identifying an individual. This should make the loss function well-suited to the problem we are addressing.

The core of the Perceptual Loss is the network from which features are extracted, which is called the Loss Network. In the original paper, the Loss Network is the backbone of a generic image classifier. Such a backbone is very large, and due to the similarities between classification and detection, should provide plenty of features that are discriminative for a face. For this reason we make use of the same Loss Network, detail of which is given in Section 3.5.1.

A useful feature of Perceptual Loss is the option to adjust the place from which the features are extracted. Depending on the Loss Network, features may be extracted from one or more different places to adjust to the desired effect. Early layers in the Loss Network will typically contain simple features at a high resolution, whilst later layers will contain more complex features at lower resolutions. Intuitively, the use of features from later layers will therefore result in more abstract faces, which may be beneficial for the level of anonymisation, but detrimental to the face detector.

One risk of using Perceptual Loss is that it may lead to a decrease in variety. Since it values most greatly the presence and positioning of high-level features, the generator may learn to produce generic versions of these for all faces, influenced by the mean of the dataset. This loss of variety could damage the performance of a face detector trained on the anonymised dataset.

In this thesis the primary purpose of Perceptual Loss is to assist the GAN, yet it fulfils a similar purpose to the GAN’s discriminator, namely judging whether generated faces are visually correct. This means the Perceptual Loss can also be used together with a generator to completely replace the GAN. Without a discriminator and the adversarial training process, the anonymisation model is greatly simpli-fied. Whether this is advantageous, will be subject to experimentation in Chapter 5.

Perceptual Loss Loss Network Original Anonymised L0 L1 L2 L3 L4

-

-Figure 3.5: Perceptual Loss system overview. The figure shows the original (top) and anonymised (bottom) faces being fed through a pretrained image classification backbone called the Loss Network. The mean difference between the features of the two faces at different depths (L0-L4) form the Perceptual Loss.

3.5.1 Implementation Details

The Loss Network used for Perceptual Loss in our anonymisation method is a VGG-16 backbone [39] pretrained on the ImageNet dataset [40], just as in the original paper [1]. A VGG-16 network consists of five convolutional blocks and one fully connected block. Due to max pooling, the resolution of the feature maps is reduced to half the width and height between each block. The fully-connected layers at the end of the VGG-16 are not used in the Loss Network.

Lp_x=

1

CHW||ϕ(ˆy) − ϕ(y)||

2

2 (3.2)

(17)

how features are extracted from after each of the five convolutional blocks, that are numbered as layer 0 through layer 4. As calculated using Equation 3.2, the Perceptual Loss at a single layer is the Euclidean distance between the feature representations of the original and anonymised images. In this equation,

C, H and W represent the dimensions of the extracted feature map ϕ. The original face is y, and the

anonymised face is ˆy. The all-layer Perceptual LossLp as used by our proposed anonymisation method

(18)

Chapter 4

Experimental Setup

In Chapter 3 several experiments are defined to test the newly proposed anonymisation method, and help answer the research questions posed in the introduction. Before performing these experiments important decisions must be made regarding the datasets and parameters used, and how the results should be evaluated. This chapter covers those decisions. In Section 4.1 the chosen datasets are introduced and analysed. Then, in Section 4.2, the training procedure and parameters are given. Finally, Section 4.3 describes how the results are evaluated.

4.1 Datasets

In this thesis two model types that require training data of faces are trained. For the face detector the WIDER FACE dataset [14] is used. This dataset is the logical choice for training the face detector because it is large and contains a great variety of different settings and face sizes. It is also widely used in face detection, which allows for the results to be compared more easily with the state of the art.

To anonymise the WIDER FACE dataset a face anonymiser must be trained. If this anonymiser is also trained on WIDER FACE, there is a risk of identifying information leaking into the anonymised dataset through the generator. Another reason not to use WIDER FACE for the anonymiser is that most faces are small, which means they have little detail, making landmarks hard to detect.

The dataset chosen to train the anonymiser is the PIPA dataset [35]. This dataset is chosen because it has high-resolution faces that allow the generator to learn to generate detailed faces, making it easy to detect landmarks. It is also the dataset used in [6], the face anonymiser to which ours is most similar.

When using two different datasets the domain gap must be considered. This gap is caused by differ-ences such as face size, pose distribution and context type. For our anonymisation method, we do not believe this should be a problem. As a part of the input preparation, all faces are resized to the same size. This means that to the anonymisation model, there is no difference between a large face that has been downscaled and a face that was already small. By downscaling a dataset of large faces the examples the generator is trained with have the highest quality. Both dataset also have a large variety of poses and context types.

In this section a short analysis is given of both datasets. This allows for better interpretation of the results of the experiments. WIDER face is analysed in Section 4.1.1 and PIPA is analysed in Section 4.1.2.

4.1.1 WIDER FACE

Images Faces Training 12 876 157 021 Validation 3 226 39 123 Test 16 101 197 559 Total 32 203 393 703 Table 4.1: The number of images and faces in each of the WIDER FACE splits.

The WIDER FACE dataset [14] is a dataset containing bounding box-annotated annotated faces, which is used as the target data-set for the anonymisation method described in Chapter 3. The dataset contains three splits: training (40%), validation (10%) and test (50%). The training split is anonymised and used to train a face detector that is evaluated on the original validation set. The annotations for the test split are withheld by the authors, so the validation set is used as test set in the experiments.

In Table 4.1 the sizes of the dataset splits are shown. Figure 4.1 shows an example image from the dataset. In the remainder of this

(19)

section the properties of the WIDER FACE dataset are analysed.

This is important to be able to understand the results of experiments that are related to these properties.

Figure 4.1: Example image from the WIDER FACE dataset with face bounding boxes shown in green.

Qualitative Properties

In addition to a bounding box of the face in the format (xmin, xymin, width, height), each face is also annotated with five properties that help to determine its quality, shown below. The percentages show the proportion of examples that exhibit that property.

• Blur, indicating whether the face appears blurred when viewed close up. (86.4%) • Occlusion, indicating whether part of the face is occluded. (40.1%) • Illumination, indicating whether the face is extremely lit. (5.3%) • Pose, indicating whether the face has an irregular pose. (3.9%) • Expression, indicating whether the face has an exaggerated expression. (1.1%) The properties blur and occlusion are indicated with a value 1 or 2, which tells us the strength of the effect.

Size

The size of the faces is an important property of a face dataset. Depending on the source of the images this can vary wildly. Images of large crowds will generally have very small faces, while images of family gatherings will have much larger ones. The WIDER FACE dataset sources its images from 61 event types, and boasts ”a high degree of variability in scale, pose and occlusion1_”.

Of the 61 event types, the three with the largest mean face size are pictures from press conferences, pictures of couples and pictures of surgeons. These categories have a mean face size of approximately 200×200px. The smallest faces are found in pictures of people marches, demonstrations and parades, with mean face sizes of about 30×30px. The mean face size of the entire dataset is 62×62px, and the

median size is 18×18px.

In general, the smaller the faces in an event category, the more examples of them there are. This is shown by Figure 4.2. The figure also shows that the median face sizes are significantly smaller than the mean ones, suggesting that a large proportion of the dataset consists of small faces. This can also be seen in Figure 4.4, where the distribution of face sizes is shown.

(20)

Figure 4.2: The number of face examples in each of 61 event categories, compared to the mean and median face sizes.

Figure 4.3: The distribution of face sizes amongst the three difficulty splits of the WIDERFACE validation set.

Typically, evaluation on this dataset is split by difficulty. Figure 4.3 shows that there is a strong correlation between difficulty and size, though there is some overlap. This suggests that in addition to size, qualitative properties such as pose, illumination or occlusion play a role in determining the difficulty. In our experiments, we follow the convention of of evaluating by difficulty.

In addition to the overall distribution of face sizes, Figure 4.4 also shows the distribution for images exhibiting the properties described in Section 4.1.1. It shows that while a large proportion of the faces are small, there are a decent amount of medium- to large-scale faces in the dataset as well. The faces that appear blurry follow a similar distribution, since this property is present on a large majority of faces. However, faces that are not blurry are most commonly medium sized. This is not surprising, because it is hard to fit enough information in a small face for it not to appear blurry.

The figure also shows that occlusion is present slightly more in small faces than in large ones. This is likely because many of the smallest faces are from crowd photographs, and in these people often occlude each other. Finally, irregularities in expression, pose and illumination appear slightly more frequently in medium and large sized faces than in small ones. The effect is strongest for exaggerated expressions.

(21)

Figure 4.4: Distribution of examples with various properties over different face sizes. In each subplot the overall distribution is shown for comparison in blue. The x-axis is the square root of the number of pixels in the face bounding box, so a 60×40 face would appear under 48, since ⌊√60· 40⌋ = 48.

4.1.2 PIPA

Figure 4.5: An typical example of an image in the PIPA dataset. Note that the image has a high resolution, and only has a few faces.

The PIPA dataset [35] is a dataset of social media images used to train our face anonymiser. Figure 4.5 shows an example of an image from this dataset. Because most of the images come from small gatherings,

(22)

they contain on average fewer, but higher-resolution faces than the images in WIDERFACE. This means PIPA has many detailed faces, which are well-suited for use as training data for a face anonymiser. PIPA has previously been used in this capacity in [6, 28].

All Faces Faces with Landmarks Train + Validation Test Train + Validation Test Number of Images 25184 758 12498 356 Number of Faces 38062 1671 15903 437

Mean Face Size 156px 195px

Median Face Size 131px 167px

Table 4.2: The size of the PIPA dataset by data split, as well as the mean and median face sizes. On the left side are the overall statistics, and on the right are the statistics of faces for which landmarks could be detected.

Table 4.2 gives a number of statistics on the dataset. Compared to WIDERFACE, the number of images is similar, but the number of faces is far smaller. This is because an average image in PIPA contains approximately 1.5 faces, whilst an average image in WIDERFACE has more than 12. The numbers for the train and validation sets are combined, because we merge the validation set into the training set. The mean and median face sizes are calculated over the whole dataset. Compared to the WIDERFACE mean and median face sizes of 62px and 18px respectively, these are much larger.

The dataset does not come with annotated face landmarks, so these are detected using the method from Section 3.1. Because we want the training examples to be as high-quality as possible, we retain only faces directly detected by the landmark detector, without being given a bounding box. This ensures that the landmarks used for training are almost always correct. In Table 4.2, these are referred to as ‘Faces with Landmarks’.

On average, these ‘Faces with Landmarks’ are larger than the dataset as a whole. Figure 4.6 shows the distribution of face sizes for both the whole dataset, and this subset. In this histogram we that most faces smaller than 100×100px are removed using this method, while most larger faces are kept. This

should not negatively impact the anonymisation of small faces, because all examples are resized anyway.

Figure 4.6: A histogram of face sizes in the PIPA dataset, showing the distribution of all faces in blue, and the distribution of those for which keypoints could be detected in orange.

(23)

4.2 Training and Parameters

There are a number of parameters that affect the training of our anonymisation method. These include typical parameters such as the learning rate and number of training iterations, but also the relative weights of the various loss functions. There are too many parameter values to empirically optimise each one, so where possible the same values are used as in previous works. For parameters without a known value from previous work, the value used is obtained through trial and error. Only the parameter values directly related to our contributions, such as the new losses, are determined using formal experiments in Chapter 5.

Symbol Function Value Symbol Function Value

λd Detection Loss weight 12 Batch Size 64

λp Perceptual Loss weight 1 Detection Loss interval 5

λg GAN Loss weight 1 β1 Adam parameter 0.9

Use Landmarks Yes β2 Adam parameter 0.999

Input resolution 128×128 αd Discriminator learning rate 2· 10−5

Input obfuscation type block Reconstruction loss ratio 50:1 Training iterations 10000 Scheduler interval 20%

αg Generator learning rate 5· 10−4 Scheduler multiplier 1₂

Table 4.3: Table of parameters that affect the training of the anonymiser. The parameters on the left are subject to change in the experiments, whilst the ones on the right are fixed at all times.

An overview of parameters and their values is shown in Table 4.3. The values shown are the default values. In a large number of experiments one or more of the parameter values are changed. The paramet-ers for which this happens are shown in the left part of the table, whilst those that are fixed throughout the experiments are shown in the right part of the table.

Three of the most important parameters are the loss function weights (λg, λdand λp), that weight the

individual loss functions within the total loss function L, as shown in equation 4.1. The default values

for λdand λgare chosen based on the results of experiments. The same is true for the use of landmarks,

the input resolution and the input obfuscation type.

L = λaLa+ λdLd+ λpLp (4.1)

The generator and discriminator learning rates are both affected by a learning rate scheduler, which halves the learning rate (scheduler multiplier) after every 20% of the total training iterations (scheduler interval). The optimiser for both is Adam [41], and uses standard values for β1 and β2. The batch size

is chosen to be the largest power of two that fits in memory. The number of iterations is chosen because it produces reasonable face quality in a reasonable amount of time.

Each loss function has an additional parameter that affects it. For the Detection Loss there is the Detection Loss interval. Due to the computational cost, the Detection Loss is not applied every iteration, but instead at every Detection Loss interval. A component of the GAN Loss is the reconstruction loss. The ratio between the reconstruction loss and adversarial loss is determined by the reconstruction loss ratio. The default value of 50 brings the loss value to a similar magnitude as the adversarial loss, and is the same as in [6]. One reason for this high value is that it is calculated over the entire generator output, of which only the face area changes and contributes to the loss. For the Perceptual Loss, in addition to the weight λp, the VGG layer(s) from which to extract the features must be chosen. By default the mean

over the losses of layers 0, 1, 2, 3 and 4 is used.

4.3 Evaluation

In chapter five there are experiments about face anonymisation, and face detection. The face anonymisa-tion experiments are primarily evaluated visually, because humans are good at spotting patterns of difference that may be caused by new loss functions or other changes to the model. We have selected six faces from the PIPA test set with various poses and context types, shown in Figure 4.7, to showcase our observations. Note that the observations are based on faces from the entire test set, not only those

(24)

shown in the figures. To complement the subjective visual evaluation, the quality and variation of faces is measured using the objective Inception Score introduced in 2.4. For all experiments the faces shown are quite small, so it is best to view the results digitally and zoomed-in.

The experiments concerning face detection are evaluated using the Average Precision, introduced in Section 2.1.1. This objectively measures the performance of a face detector.

Original

Input

Figure 4.7: The six faces from the PIPA test set selected to show the results of the experiments (top), and their counterparts prepared for generator input (bottom).

(25)

Chapter 5

Experiments and Results

In this chapter we perform seven experiments to help us answer the research (sub)questions posed in the introduction. The first six experiments concern the proposed face anonymiser an its components. The effects of Detection Loss and Perceptual Loss on the anonymised faces are determined in experiments one, two and three, and provide answers to subquestion two. Experiments four, five and six cover the effects of the landmarks, resolution and input obfuscation respectively, helping to answer subquestion one. Finally, experiment seven determines the effect of all tested variants of our anonymisation method on the performance of a face detector. The results of this are used to answer subquestion three.

5.1 Experiment 1: Detection Loss

What are the effects of the face Detection Loss on the anonymised faces?

In Chapter 3, a Detection Loss term is proposed to incorporate face detection performance as part of the generator’s training goal. The intuition behind this is that features that are heavily relied on by a face detector will be emphasised in the generated images. This experiment is designed to show whether this happens, and if so what the emphasised features are.

Setup

The first step of this experiment is to validate that the Detection Loss term is working as expected. This is done by training a face anonymiser using only this loss. The resulting anonymiser theoretically maximises the detectability of the face, so we expect it to generate something that resembles a face, though possibly abstractly. Another advantage of training an anonymiser using only Detection Loss is that it shows us what features are considered important for detection, which could help us spot and explain the differences when trained together with the adversarial loss. Using only Detection Loss, the face generator has converged by 500 iterations, so it is not trained for longer than this.

After the Detection Loss is validated, it is used in conjunction with the GAN from Section 3.3. Due to practical limitations caused by the computational expense of the Detection Loss, it is evaluated only on every fifth iteration. The biggest challenge is to find a balance between the loss terms where both have sufficient influence, and the training remains stable. To this end, three anonymisation models are trained, using Detection Loss weights (λd) of 0.1, 0.5 and 1. Higher values result in training instability,

possibly due to the effective learning rate becoming too high.

Results

In Figure 5.1 seven faces have been anonymised using the model trained only with Detection Loss. To a human, the results look like faces, though they do not look natural. Several facial features are clearly represented: eyes, nose, lips, eyebrows and hair. Some face edges are also visible. Ears are only visible in some of the examples.

In addition to the facial features with the same pose as the original images, secondary faces with different poses are visible in most examples. This could be a consequence of maximising the detectability using a face detector trained to detect faces with many poses. The face detector is never explicitly taught

(26)

that a face should have only one set of eyes, nose and mouth, and may therefore consider a face with two sets more ‘face-like’.

One thing the Detection Loss does not appear to encourage is proper blending between the generated face and the context area. There is a clear and sudden border between the two. The context area is available to the Detection Loss function, so there is no technical reason that it could not blend the border, but it appears that this is not sufficiently valued by this face detector. In conclusion, these images show that the Detection Loss is functioning broadly as expected.

Figure 5.1: Examples of faces from the PIPA test set (top) and their anonymised variants using the detection-loss-only anonymiser (bottom).

The faces generated by the three anonymisation models using Detection Loss are shown in Figure 5.2. The base GAN images are shown for reference with λd= 0. A significant visual difference is caused by

the Detection Loss. In the images with higher weight, the shapes of features such as the eyes look similar to what we saw in Figure 5.1. The faces generated with Detection Loss look less natural than those generated without, but Figure 5.1’s detection-optimised faces suggest that this may not be a problem for training a face detector.

The addition of Detection Loss also causes the blending between face and context to become worse, just as was seen when using Detection Loss alone. The strength of the Detection Loss’ effects scales with the weight of the loss. At λd= 0.5 and above the effects are very pronounced, whereas at λd= 0.1 they

(27)

Original

Figure 5.2: Examples of faces from the PIPA test set and their anonymised variants using the anonymisers trained with Detection Loss. The Detection Loss weight is λd.

Model Inception Score

GAN Only 4.863

λd= 0.1 4.875

λd= 0.5 4.939

λd= 1 4.898

Detection Only 3.464

Table 5.1: Inception Scores for faces generated by the four anonymisation models using Detection Loss. Finally, the Inception Scores are shown in Table 5.1 for faces anonymised by the face anonymisers trained for this experiment. The addition of Detection Loss does not significantly change the score, even when using a high weight. This suggests that the Inception network still recognises the new faces as objects, though not necessarily face-related. For this reason the similar scores do not necessarily lead to similar face detection performance in experiment 7. The Inception Score using only Detection Loss is significantly lower, but this is expected because the faces look unlike anything normally found in a picture, and have little variety.

5.2 Experiment 2: Perceptual Loss

What are the effects of the face Detection Loss on the anonymised faces?

In addition to the Detection Loss, we propose to use Perceptual Loss for anonymisation, as described in Chapter 3. The reason for its inclusion is similar to that of Detection Loss, in that it will emphasise certain features. The specific features that are emphasised will differ depending on the layer(s) from which features are extracted. This experiment will show what features these are.

(28)

The Perceptual Loss uses features extracted from an ImageNet-pretrained VGG backbone, which can be extracted at various depths. To find out what feature extraction depth, or combination thereof, is most suitable, several depths and combinations are compared. To test whether, as hypothesised in Section 3.5, Perceptual Loss can be used as a standalone loss, faces are generated both with and without the GAN Loss of Section 3.3.

Setup

Just as with the Detection Loss, it is helpful to first train face anonymisers using only Perceptual Loss to see what kind of facial features result from optimising only this loss function. Doing so also helps to select the feature extraction layer(s). Six face anonymisers are trained. Five use features from VGG ReLU layers 0, 1, 2, 3 and 4 respectively, and one uses the mean of the Perceptual Loss of all five. Training using only the Perceptual Loss is more stable than with GAN Loss, so a learning rate rate (αg)

of 10−3 is used.

Subsequently, several models are trained with both Perceptual and GAN Loss. Just as with Detection Loss, the challenge is to find the right balance between them. Therefore models are trained using three Perceptual Loss weights: λp= 0.1, 0.5 and 1. To restrict the number of combinations tested to a practical

number, this is not done for all of the Perceptual Loss layers. Anonymisation models with both GAN and Perceptual Loss are only trained for layer 1 and all layers combined, because these give good results in the previous step. This means six face anonymisers are trained in total for this experiment.

Results

Examples of faces generated using only the Perceptual Loss are shown in Figure 5.3. This figure shows that the choice of feature layer makes a big difference to the result. At layer 0 the face itself is high-quality, but other parts of the head such as the hair and ears appear simple and blurry. The faces from layer 1 are similar, except that they are noticeably better blended with the context. The blurriness using these layers occurs mostly at the top of the head, where there are no landmarks to assist the generator. Starting at layer 2, some more structure starts to appear in the hair and around the edges of the head. At this depth imperfections in the structure of the face start to appear, showing the effects of a lower feature resolution. This becomes steadily worse over layers 3 and 4 until the generated faces barely look human anymore. Finally the bottom layer displays examples using all five feature layers. This results is a balance where the structure remains intact, the hair is quite detailed and the blending is reasonably good.

In all images the reproduction of the pose from the original is accurate. To a human viewer the level of anonymisation is also good. It is on the basis of these results that layer 1 and the combination of all five layers were selected for further experimentation. Layer 1 due to the excellent face quality and blending, and the combination of all layers because it does well overall.

(29)

Original

Layer 0

Layer 1

Layer 2

Layer 3

Layer 4

All Layers

Figure 5.3: Examples of generated faces using anonymisers trained levels 0-4 of Perceptual Loss only. The bottom row shows faces generated using all five layers combined. The original faces are shown for comparison.

The results of combining Perceptual Loss with GAN Loss can be seen in Figures 5.4 and 5.5. What is immediately apparent is that Perceptual Loss does not have as big a visual impact as Detection Loss. This is logical because, as seen above, both Perceptual Loss and GAN Loss alone generate natural-looking faces, so there will not be such extreme differences.

The most visually distinct set of faces are those trained using Perceptual Loss layer 1 (Figure 5.5) and λp= 1. Here we see the same simplicity or ‘cleanness’ that we saw in the early layers of Perceptual

Loss only. An explanation of why these faces are more distinct than their λp = 1 counterparts using all

Perceptual Loss layers (Figure 5.4), could be that all-layer Perceptual Loss faces are more similar to the GAN-only ones than layer 1 Perceptual Loss faces.

Just as for Detection Loss, Inception Scores are calculated for the faces anonymised with the Perceptual Loss models, and are shown in Table 5.2. In general, the faces anonymised with Perceptual Loss models have slightly lower scores than those anonymised without, or with Detection Loss. The differences are minor, but it could be that Perceptual Loss is leads to faces becoming more similar. The biggest outlier is the model using all-layer Perceptual Loss with λp= 0.5, though visually there is nothing obvious that

(30)

Original

Figure 5.4: Examples of generated faces using the GAN + Perceptual Loss (all layers) anonymiser trained with various Perceptual Loss weights λp. The original and GAN-only (λp = 0) faces are shown

for comparison.

Model Inception Score

GAN Only 4.8632 Layer 1 & λp= 0.1 4.7854

Layer 1 & λp= 0.5 4.8028

Layer 1 & λp= 1 4.7688

All Layers & λp = 0.1 4.8274

All Layers & λp = 0.5 4.5964

All Layers & λp = 1 4.6503

Perceptual Only (layer 0) 4.7935 Perceptual Only (layer 1) 4.7401 Perceptual Only (layer 2) 4.6279 Perceptual Only (layer 3) 4.7372 Perceptual Only (layer 4) 4.8554 Perceptual Only (all layers) 4.7009

Face Anonymisation in Face Detector Training Data

MSc Artificial Intelligence

Master Thesis