Localization Confidence to address Occlusions in Face Detection

(1)

Master Thesis

Localization Confidence to address

Occlusions in Face Detection

by

Alexander H.I.P.R. Hustinx

11246545

February 3, 2020

36 EC

February ‘19 - February ‘20

Supervisors:

Prof. Theo Gevers

Dr. Davide Zambrano

Drs. Casper Thuis

Assessor:

Prof. Jan-Mark

Geusebroek

Sightcorp

(2)

(3)

ABSTRACT

Object and face detectors propose many overlapping bounding boxes containing the same objects. These detectors rely on non-maximum suppression (NMS) to reduce duplicate detections and only preserve the best ones. Almost all of these NMS strategies base their deterministic decisions on classification confidence. B. Jiang et al., 2018 have shown there is a misalignment between classification confidence and intersection over union (IoU) between a detection and matched ground truth. This misalignment negatively influences the resulting suppressions.

First, through a case study conducted on two object detection datasets and three face detection datasets, we conclude that this misalignment is present for face detection as well as object detection. We also show that this misalignment is even

worse for occluded faces. To address this misalignment we used an IoU-net to predict localization confidence of each detection. This localization confidence is

used in IoU-guided NMS, instead of classification confidence. Additionally, we present our own Soft IoU-guided NMS strategy which increases the mean average

precision (mAP) of our model, at the cost of localization accuracy.

Next, we analyze the influence of two hyperparameters of the IoU-Net, and IoU threshold for different NMS strategies. Building on these analyses we choose the ideal hyperparameters and thresholds to use during our further experiments. These

experiments were conducted on the WIDER FACE dataset for face detection.

We find that for face detection our implementation of IoU-net overall works better than our baseline in every observed case, for mAP as well as localization accuracy.

This is in part due to the additional constraint put on the shared feature maps, attributed to the multi-task learning approach. For all cases IoU-guided NMS and

Soft IoU-guided NMS outperform Greedy and Soft NMS, respectively. However, when jointly training they perform similarly. Finally, we conclude that for normally

and heavily occluded faces specifically IoU-net also performs better than the baseline. We also present multiple ways to potentially further increase the

(4)

ACKNOWLEDGEMENTS

This Master thesis signifies the end of a journey.

First, I would like to thanks my colleagues at Sightcorp for making my year long graduation a lot more bearable and providing me with a fun working environment. In particular Davide Zambrano and Casper Thuis, who have helped me throughout, even after pursuing their next challenges. Without them this project would not have

been the same.

Next, I would like to thank my parents and brother who supported me through college.

Finally, I would like to thank my friends from Vuurdoop, as well as Jonas, Nout, Arent and Pim, for all the fun times over the years.

(5)

INTRODUCTION

This chapter starts with a general introduction of computer vision, object and face detection. This is followed by a definition of the problem that will be addressed. After this the outline of the thesis is described.

1.1 General introduction

Generally, humans don’t have any problems with interpreting the world around them. Shortly after someone is born they will learn how to interact with the world around them. First seeing colors and shapes, and afterwards recognising these colors and shapes make up animate objects, or just stills. Humans have close to no problem recognising objects and people, and distinguishing them from their respective backgrounds. This is not unique in humans as it is also observed in many other animals.

For computers however, it is not as straightforward to distinguish objects from images. An image in a computer is simply a large collection of zeros and ones. Making a computer learn how to recognise objects or people is therefor a lot more challenging. Whereas a human has no problem identifying scale, rotation, occlusion, etc., this certainly is a challenge for computers.

In the ‘60s researchers started looking at solving computer vision. They saw com-puter vision as a step towards artificial intelligence. Since, the field has grown enormously with many different tasks, e.g. medical imaging, people counting, emo-tion recogniemo-tion, etc. Over the years many different approaches have been taken to tackle the challenges. Deep learning is the most recent of these approaches. A pop-ular method, Convolutional Neural Networks (CNN), has taken the field by storm, achieving never before seen (Sidike et al., 2018). Even with these new methods object and face detection still face many of the same challenges. For this thesis we will mainly focus on face detection, as well as the additional challenge of occlusions in faces.

(8)

1.2 Problem definition

Face detection is a fundamental problem that has been around since the early ‘70s. Faces hold a lot of information, e.g. age, identity, emotion. To be able to extract this information from photos or videos it is important to first accurately localize faces. For most straightforward cases with many constraints, e.g. a face on an id card, this is already very accurate. However, for the harder/unconstrained cases, like occlusions, more problems occur.

B. Jiang et al., 2018 describe a misalignment between classification confidence and overlap (IoU) between a detection and its matched ground truth, for object detection. They find that for object detection classification confidence is not a reliable reflection of localization accuracy. Yet, most bounding box suppression and refinement methods use classification confidence to decide which detections are more accurately localized than others. This is a distressing discovery, because as a result proposed bounding boxes might miss crucial information of the face. They propose IoU-net to explicitly predict the overlap between a detection and a ground truth, the localization confidence. Therefor, as part of our research we hope to answer the questions: Does the misalignment between classification confidence and

IoU also occur within the face detection domain? Can IoU-net be used to address this misalignment for face detection?

In parts of Asia it is common to wear masks, similar to surgical masks. Contrary to popular beliefs, nowadays this is not due to pollution, or to prevent yourself from becoming ill. Instead, it is a form of social courtesy, where the wearer wants to avoid transmitting their germs to others. It is said that this form of social courtesy in the future might become more popular in the rest of the world as well. These masks however occlude parts of the face which are important for localization. Occlusions in face detection can cause faces not to be localized correctly, or sometimes at all (as seen in Figure 1.1). This can cause more problems further along the process of extracting useful information. For example, if a face is incorrectly localized, cutting off the chin, facial landmark placement is almost impossible. As such, the final part of our research is aimed at answering the question: Can IoU-net be used to increase

detection performance on occluded faces?

Note for the reader, throughout our research it was not the goal to reach state-of-the-art performance. Achieving state-state-of-the-art performance builds on many extensive finetuning processes including multi-scale testing (Thuis, 2018), hard

(9)

Figure 1.1: Example of (left) a non-occluded and (right) heavily occluded face. Where the green boxes are the ground truths and the blue box is the detection.1

negative mining(Singh and Davis, 2018; Singh, Najibi, and Davis, 2018) and more. Instead, we design a baseline and hope to find fundamental improvements comparing our performance to it. In this thesis the following contributions are made:

• We find a similar misalignment to that described in (B. Jiang et al., 2018) for the face detection domain.

• We experiment with a well known object detection framework (Faster R-CNN) and a recently proposed addition to two stage object detectors (IoU-net).

• We show that the additional constraints posed by IoU-net improve the overall performance of both object and face detection.

• We propose a combination of the Soft NMS (C. Zhu et al., 2019) and IoU-guided NMS strategies, showing it improves the overall mean average preci-sion for face detection.

• We experiment with several hyperparameters of the IoU-net framework to find its optimal settings.

• We experiment with the performance of IoU-net specifically for occluded faces.

1.3 Document outline

In this document, first related work in the field of computer vision, and more specifically object and face detection, will be described. Next, we describe the datasets and metrics we used throughout our research as they are important for a more broad understanding of the subsequent chapter. A case study about the misalignment between classification confidence and intersection over union for

(10)

object detection and face detection. Afterwards, the methods applied towards researching and addressing the problem, including the models and non-maximum suppression strategies will be described. This is followed by a description of the experimental setup, the baselines, the actual experiments that were run and their results, which will then be analyzed and discussed. Finally, the conclusion of the research will be presented and some possible future work in the field will be discussed.

(11)

C h a p t e r 2

RELATED WORK

This chapter describes related works in the field of computer vision and face detec-tion. First we describe some influential object detection papers. This is followed by modern face detection methods that build on these papers. Finally, popular variants of non-maximum suppression are discussed along with a flaw in traditional box suppression.

2.1 Object detection

In the past decade many new deep learning methods have been developed, achiev-ing remarkable performance compared to more classical approaches like template matching (LeCun, Bengio, and G. Hinton, 2015), more specifically these methods use Convolutional Neural Networks (CNNs) (Zhao et al., 2019b). Krizhevsky, Sutskever, and G. E. Hinton, 2012 was one of the first papers using deep CNNs for object classification, greatly outperforming traditional object detectors using hand crafted features (Krizhevsky, Sutskever, and G. E. Hinton, 2012). Their research has shown that CNNs are able to learn robust and high-level feature representations of images. Girshick et al. later experimented with using this for object detection and proposed using Regions with CNN features (Girshick et al., 2014). Based on their work many new methods have been developed. Object detection is approached in one of two ways: One-stage detectors and Two-stage detectors. One-stage detectors attempt to localize and classify objects in one step. Whereas two-stage detectors employ a coarse-to-fine strategy, first localizing objects and afterwards classify-ing them. There is a speed and accuracy trade-off between the two approaches, one-stage detectors being faster while usually lower performance (J. Huang et al., 2017). However, speed is not relevant to our research goals, thus we focus mainly on two-stage detectors.

Most modern deep learning detectors are first trained on a large scale dataset, e.g. ImageNet. ImageNet is an object classification dataset with over 5,000 synsets and 3.2 million images, making it a great dataset to train models on for classification (Deng et al., 2009). Models pretrained on ImageNet are often also used as a base network to extract features from an image. Though it has been stated that using a pretrained base network does speed up convergence, it does not always converge

(12)

to the best performing model (He, Girshick, and Dollár, 2019). Some of the most commonly used base networks are VGG, ResNet and DenseNet (Simonyan and Zisserman, 2014; He, X. Zhang, et al., 2016; Iandola et al., 2014). In recent years many new object classifiers have been applied to object detection as a base network too, increasing speed and performance (Xie et al., 2017; He, Gkioxari, et al., 2017; Chollet, 2017; Howard et al., 2017; Z. Li et al., 2017).

Originally, R-CNN Girshick et al., 2014 used selective search to get a set of candidate bounding boxes for a single image (Uijlings et al., 2013). These bounding box proposals were then scaled to a fixed size image and fed into the pretrained CNN base network. The resulting features were then used as input for linear classifiers (e.g. SVMs) to predict object classes of those regions. A drawback of this approach was the large number of overlapping region proposals, resulting in a slow detection speed.

Shortly after R-CNN, Fast R-CNN (Girshick, 2015) was designed. As the name suggests, Fast R-CNN was a lot faster as they first extracted the feature map using the base network, followed up by selective search. The proposals were applied to the features, therefor reducing the redundant computation on overlapping boxes. To enable the selection of proposals from the feature map, RoI pooling was introduced. RoI pooling is a type of max pooling used to project features of a RoI on the original feature map to a fixed size window. However, the original RoI pooling was found to have a location misalignment caused by quantization. Later, RoIAlign was designed as part of (He, Gkioxari, et al., 2017) to address this location misalignment. It uses continuous values to more properly align input pixels. Afterwards, bilinear interpolation is used to compute the resulting floating-point location values in the input. Fast R-CNN also used multi-task loss to allow the single network to predict both classification confidence and regress bounding box coordinates.

The next improvement in the field of two-stage detectors was Faster R-CNN (Ren et al., 2015). Faster R-CNN suggested a region proposal network (RPN), which allowed the network to propose regions of interest (RoIs). This sped up the network even further, being one of the first end-to-end trainable, near-realtime modern object detector. Although Faster R-CNN addressed the speed bottleneck of it’s predecessor (Fast R-CNN), there is still computation redundancy at the detection stage. Several improvements were proposed to address this, e.g. R-FCN (Dai et al., 2016) and light-head R-CNNs (Z. Li et al., 2017).

(13)

based on Faster R-CNN. These networks use features of several different levels instead of only the network’s stop layer. They found that features in the top layers of CNNs helped with classification but not as much with localization. They instead use a top-down architecture with lateral connections to build high-level semantics at different scales. FPNs are now widely used in many well-performing object detectors (Zhao et al., 2019b).

2.2 Face detection

In the past the domains of object detection and face detection were considered to be entirely separate. As most classical approaches used hand-crafted features face de-tectors were explicitly used for face detection. Some such approaches include: feature-based, template-based and appearance-based techniques. Feature-based methods are based on the localization of invariant facial features like eyes, nose, mouth, etc. Some such detectors are presented in (Fischler and Elschlager, 1973; Kanade, 1974; Moghaddam and Pentland, 1997; Wiskott et al., 1997). Template-based methods focused on template matching, where a manually defined template is compared to the contents of a sliding window. These methods were also referred to as active appearance models (Cootes, Edwards, and Taylor, 2001). Finally there were appearance-based methods. These used a sliding template approach, similar to template-based methods. But instead of manually creating the template, they learned templates from sets of images. The most popular of these methods is the Viola-Jones algorithm (Viola and M. J. Jones, 2004), which is a boosting feature based technique using Haar features to determine facial features. It was shown to work well for frontal non-occluded faces. Other such techniques include (Sung and Poggio, 1998; Romdhani et al., 2001).

Many modern deep learning based face detectors however are built on current object detectors as they share similar challenges and concepts (X. Wu, Sahoo, and Hoi, 2020). As such many of the current face detectors are built on generic object detection frameworks such as Faster R-CNN (H. Jiang and Learned-Miller, 2017a; H. Jiang and Learned-Miller, 2017b) or the one-stage SSD (W. Liu et al., 2016; S. Zhang et al., 2017a; Najibi et al., 2017). These algorithms focus more on learning robust feature representations. In order to handle the large scale variance in face detection, most of the detection frameworks are extended with multi-scale feature learning models. For instance X. Sun, P. Wu, and Hoi (X. Sun, P. Wu, and Hoi, 2018a) adapted Faster R-CNN and S. Zhang et al. (S. Zhang et al., 2017b; J. Zhang et al., 2019) adapted the SSD framework, both with multi-scale features. Though

(14)

some models addressed the adaptation to face detection more simplistically, by changing the anchor scales and ratio’s (X. Sun, P. Wu, and Hoi, 2018b; H. Jiang and Learned-Miller, 2017b). To address occlusions in face detection there are two popular approaches: attention mechanisms and detection based on parts. The attention mechanisms are used to highlight the features of underlying face targets (J. Wang, Yuan, and Yu, 2017). Whereas, detection by parts uses concepts from deformable parts models, finding a face by separately finding components of the face (S. Yang, Luo, C. C. Loy, et al., 2017; Yan et al., 2014). As we are not looking to improve the state-of-the-art we will only minimally adapt the domain by changing matching our anchor scales and ratios to that of (X. Sun, P. Wu, and Hoi, 2018b).

2.3 Bounding box suppression

Detectors commonly return multiple overlapping bounding boxes, each with a clas-sification confidence. During evaluation of a model only the first detection that overlaps a ground truth bounding box is scores as a true positive, while the others are treated as incorrect, redundant boxes. To reduce the number of overlapping detections, many overlapping detection with a lower score will be suppressed. This process is called non-maximum suppression (NMS). Standard greedy NMS has been used as a standard part of object detectors for many years. It was first proposed for edge and curve detection (Rosenfeld and Thurston, 1971), but later for many more purposes including work by Viola and Jones (Viola and M. Jones, 2001). Nowadays, classification confidence is commonly used to score bounding boxes and decide which will be suppressed. Though this can lead to incorrect suppressions as with the traditional greedy NMS you set a single overlap threshold. Soft NMS (Bodla et al., 2017) was originally proposed for densely populated scenes where ob-jects overlap each other. Traditional greedy NMS would usually incorrectly preserve a single overlapping box, while suppressing the others. Soft NMS however does not suppress boxes, it reduces the value of boxes based on how much they overlap. Highly overlapping boxes would therefor be less likely to be considered, yet still may be. Several papers have also proposed learning to find an ideal threshold (S. Liu, D. Huang, and Y. Wang, 2019; Hosang, Benenson, and Schiele, 2017) based on the number of overlapping detections or clustering. Other have suggested predicting more scores per detection. (B. Jiang et al., 2018) first described a misalignment between classification confidence and intersection over union (IoU) between a de-tection and a ground truth box. Their research shows a fundamental flaw in using classification confidence to assess localization accuracy. To address it they suggest

(15)

IoU-net, which predicts a localization confidence based on the overlap of a detection and a ground truth. This localization confidence is used during their IoU-guided NMS, basically ignoring the classification confidence during NMS. This approach (S. Wu, X. Li, and X. Wang, 2019; L. Zhu et al., 2019; Tychsen-Smith and Petersson, 2018; Z. Huang et al., 2019) all suggest similar approaches, predicting a localization confidence, IoU or fitness score to be used for NMS together with classification con-fidence. Alternatively, (Zhou, D. Wang, and Krähenbühl, 2019) suggests detecting objects by keypoint estimation. Modelling objects as single points and make box suppression obsolete. Their approach achieves a very high speed-accuracy trade-off. A soft variant has since been proposed (C. Zhu et al., 2019) and a face detection spe-cific version as well (Xu et al., 2019). We focus on the fundamental flaw described by B. Jiang et al., looking into the misalignment between classification confidence and overlapping boxes as described in (B. Jiang et al., 2018), and researching its impact on occluded faces.

(16)

C h a p t e r 3

DATASETS AND METRICS

This chapter first describes the datasets used in this research, summarizing what they consist of, how many images and faces or objects they consist of, what makes them special, and what they were used for during this research. Afterwards, the evaluation metrics used throughout this research are described, explaining how they work, and what they are used for.

3.1 Datasets

This section briefly describes the datasets used during this research, and what they were used for.

Object detection PASCAL VOC 2007

PASCAL VOC 2007 (Everingham and Winn, 2007), PASCAL, is an object detection challenge created in 2007, accompanied by a small dataset created in 2006. There are 20 different object types for this challenge, labeled in the dataset. The goal of the challenge is to recognize objects from these object types in realistic scenes, i.e. not pre-segmented objects.

The overall splits are:

Split # Images # Objects

train 2,501 6,301

val 2,510 6,307

train+val 5,011 12,608

test 4,952 12,032

Table 3.1: PASCAL VOC 2007 dataset splits

PASCAL has also released the annotations for the test images (on a later date). This means the test split is commonly used for evaluation, and the training and validation sets are can be used during training. For each object the annotations include object type, bounding box coordinates, view (i.e. frontal, rear, left, right), a ‘truncated’ label denoting that the object is not entirely fit on the image, and a ‘difficult’ label denoting that the object will be ignored during evaluation.

(17)

During my research this dataset was mainly used while conducting experiments, and reproducing the results of B. Jiang et al., 2018 for object detection, as it is a relatively small dataset.

MS COCO 2014

MS COCO 2014 (Lin, Maire, et al., 2014), or COCO, is a large object detection dataset created in 2014 to place the question of object recognition in the context of scene understanding. The dataset is made up of images of complex everyday scenes containing common objects. There are 80 different object types, each claimed to be easily recognisable by a 4 year old. The dataset has more than 120k labeled images, with 886k object instances in the training and validation splits alone.

The most commonly used splits are:

Split # Images train 82,783 val 40,504 test 40,775 minival 5,000 train+val-minival 117,266

Table 3.2: MS COCO 2014 dataset splits

Contrary to PASCAL, the annotations of the test images were not released publicly, so most evaluation is done on the validation set. The minival split is a smaller validation split, which allows the remaining validation images to be used for training. For each object the annotations include object type, bounding box coordinates, area of the bounding box, and a ‘crowd’ label which denotes if the object overlaps with others.

During our research this dataset was mainly used during the case study to confirm some findings presented in (B. Jiang et al., 2018).

Face detection WIDER FACE

WIDER FACE (S. Yang, Luo, C.-C. Loy, et al., 2016), or WIDER, is a very large face detection dataset constructed in 2016 by using images of the WIDER dataset that contain faces. It is organized based on 61 different event classes for scenes the faces occur in (e.g. sports, protests, concerts, etc.). The dataset has more than

(18)

32k images, with a total of 393,703 labeled faces. This includes images with very crowded scenes, containing well over hundred faces.

Split # Images train 12,880 val 3,226 test 16,097

Table 3.3: WIDER FACE dataset splits

For each face the annotations include the bounding box of the face, a ‘blur’ label denoting how badly an image is blurred, an ‘expression’ label denoting if the face is not neutral in expression, an ‘illumination’ label denoting the degree of lighting on the face, an ‘occlusion’ label denoting the degree of occlusion, a ‘pose’ label denoting if the face is an in extreme pose, and an ‘invalid’ label denoting if the face is to small or occluded to be used for evaluation. Splits that show the number of faces per annotation can be found in Table A.1, in Appendix A. The dataset is also evaluated with three degrees of difficulty: easy, medium, and hard. Each face falls into one or more of the difficulties, decided based on the detection rate of EdgeBox (Zitnick and Dollár, 2014).

During our research we are also particularly interested in occluded faces. The amount of non, normally and heavily occluded faces in the training and validation set of the dataset are listed in the Table 3.4.

Face type train val non-occluded 86,339 21,437 normally occluded 26,554 6,856 heavily occluded 33,651 8,189

Table 3.4: Number of faces without occlusion, normal occlusions and heavy occlu-sions in WIDER FACE’s train and validation splits.

Because of the size of the dataset, the variety of faces, and the number of labeled faces, this is one of the most widely used datasets for face detection. It is also the main dataset used during my research on face detection.

MAFA

MAFA (Ge et al., 2017) is a large face detection dataset of faces with various orientations and degrees of occlusion, while at least one part of the face is occluded

(19)

by e.g. a mask. It was created in 2017 to address the issue that there are few large datasets focusing on masked faces.

Split # Images # Faces train 25,876 29,452 test 4,935 6,354 total 30,811 35,806

Table 3.5: MAFA dataset splits

For each face the annotations include the bounding box of the face, the location of the eyes, the locations of all masks, the face orientation, occlusion degree, mask type, and an ‘ignore’ label denoting that the face is ignored during evaluation. The occlusion is measured by looking at which regions are occluded: eyes, nose, mouth, and chin. The occlusion degree has three classifications: easy, where 1-2 regions are occluded; medium, where 3 regions are occluded; hard, where all (4) regions are occluded.

This dataset is briefly used in a case study to see if the findings in (B. Jiang et al., 2018) also occur for face detection.

FDDB

FDDB (Jain and Learned-Miller, 2010) is a small dataset created in 2010 to address the lack of common evaluation schemes for face detection, and better capture aspects of face appearances that manifest in real-world scenarios. It consists of 2,845 color and grayscale images with a total of 5,171 annotated faces.

For each face the annotations include a elliptical face region, the degree of occlusion, the pose, a label denoting that the face is out of focus, and a label denoting that it is of low resolution.

This dataset is briefly used in a case study to see if the findings in (B. Jiang et al., 2018) also occur for face detection.

3.2 Evaluation metrics

This section briefly describes the evaluation metrics used during this research, what they are used for, and how to calculate them.

(20)

Figure 3.1: Several examples of datapoints and their accompanying Pearson corre-lation coefficients. It is important to note that a higher absolute correcorre-lation means the data pairs have a lower spread. The example in the middle has an undefined Pearson correlation as the ... variance is zero. Source (Commons, 2010)

Pearson correlation

The Pearson correlation coefficient is a linear correlation measure between two variables. In our case the correlation between classification confidence and the IoU between the detected bounding box and the ground truth bounding box. Or localization confidence instead of classification confidence.

It can be calculated by:

rx y = Ín i=1(xi− ¯x)(yi− ¯y) q Ín i=1(xi− ¯x)2 q Ín i=1(yi− ¯y)2 ,

where n is the sample size,

xi, yiare sample points from variables X, Y , ¯

x, ¯y is the sample mean: 1 n

Ín i=1xi

The Pearson correlation coefficient can take on a value between [−1.0, 1.0], where r < 0.0 implies a negative correlation and r > 0.0 implies a positive correlation. It is important to note that a high absolute correlation (|r |) means data pairs have little spread, i.e. if all the datapoints are on a line the absolute correlation will be |r | = 1 (unless y is the same over all x). Note that this does not mean that the Pearson correlation is only 1 if all values lie on a unit line (y = x). Some examples can be seen in Figure 3.1.

(21)

(a) (b)

Figure 3.2: (a) Visual representation of how to calculate IoU. (b) Three different sets of boxes with their resulting IoU scores. (Rosebrock, 2016)

Intersection over Union

Intersection over Union (IoU), also known as Jaccard index, is a popular metric used to measure the degree to which two bounding boxes overlap.

In the field of object or face detection, IoU is often used to determine if two bounding boxes belong to the same object. More specifically, if the bounding box of a detected object overlaps enough with the bounding box of a ground truth, to determine if they can be considered the same object.

IoU is a straightforward to compute metric. As it says in the name, the intersection between the two bounding boxes is divided by their union:

IoU(D, G) = |D ∩ G| |D ∪ G| =

|D ∩ G| |D|+ |G| − |D ∩ G|,

where, D is the detected object’s bounding box, and G is the ground truth object’s bounding box. Figure 3.2a depicts a visual representation of how to calculate IoU.

The resulting IoU is a continuous value in range [0.0, 1.0]. A higher value means that the boxes are more similar to each other. So, an IoU score of 1.0 means that the two boxes are exactly the same. While an IoU score of 0.0 means that the boxes do not overlap at all. Some examples of overlapping bounding boxes and their IoU scores can be found in Figure 3.2b.

In object detection IoU is often used to determine if a detection is a true positive (TP), a false positive (FP) or a false negative (FN). It is also possible to have a true negative (TN), but because we assume each image has an object of a searched for type this is not taken into account. Additionally, taking TN’s into account would serve a different purpose, e.g. background detection, which is not relevant for our research. If the IoU between a detected bounding box and a ground truth bounding

(22)

box is higher than a certain threshold (usually 0.5) the detection is labeled as a TP. If the IoU between the boxes is lower than the threshold, or if it is higher but the ground truth has already been ’found’ be a different detection (i.e. the detections overlap), the detection is labeled as a FP. Finally, if the IoU between the boxes is higher than the threshold, but the detection is of the wrong class, the detection is labeled as a FN. This is also the case if a ground truth is not detected at all.

Precision/Recall curve

Precision/Recall curves, or PR curves, plot the precision and recall of a model.

Precision is given as the the ratio of true positives and the total number of positive prediction, this includes false positives. It describes how good a model is at predicting the correct class of an object.

Pr ecision = TP TP+ FP.

Recall is the true positive rate, given as the ratio of true positives and total number of ground truth positives, so this includes the false negative predictions. It describes how many correct predictions are retrieved of all ground truths.

Recall = TP TP+ FN.

Figure 3.3 shows a visual representation of precision and recall.

Both values are important in showing the retrieval performance of a model. As pre-cision and recall are inversely related, the trade-off between them is important. Their trade-off can be shown in a PR-curve, where each datapoint in the curve represents the precision and recall at a certain cut-off point, e.g. IoU ∈ {0.01, 0.02, ..., 0.99, 1.0}.

(mean) Average Precision

The general definition of Average Precision (AP) is finding the area under the PR curve (AuC). This is the average precision over all recall values in range [0.0, 1.0].

∫ 1

0

p(r)dr .

As precision and recall are both in range [0.0, 1.0], the AP also falls in range [0.0, 1.0].

Different datasets use different version of AP. We will describe those of our two main datasets, PASCAL VOC and WIDER FACE. In both cases only the first correct detection (IoU > 0.5) is considered as TP, while the others will be labeled as FP.

(23)

Figure 3.3: Visual representation of precision and recall (Walber, 2014)

PASCAL VOCuses the 11-point interpolated AP. In this case we consider the average of the maximum precision values at the 11 points in [0.0, 0.1, ..., 1.0].

AP= 1 11 Õ x∈{0.0,...,1.0} max r ≥x p(r), where p(r) refers to the maximum precision for recall r.

This interpolated method is an approximation and is thus less precise. Therefore it has been revised in the PASCAL VOC 2012 challenge.

WIDER FACE uses the same version of AP as PASCAL VOC 2012, the actual AuC. It samples the curve at all unique recall values, whenever the maximum precision drops. Thus,

AP= Õ n

(rn+1− rn) max r ≥xn+1p(r).

This method results in a more accurate estimation of the AuC. The mean average precision (mAP) for these datasets is the mean AP over all object classes. In case of PASCAL VOC 2007 this is mAP= ₂₀1 Í

k ∈K APk, where K are all object classes. For WIDER FACE, as it only has a single object class, faces, mAP= AP.

Localization-Recall-Precision

Localization-Recall-Precision (LRP) Error is a new metrics specifically designed for object detection (Oksuz et al., 2018). The authors suggest that AP has a number

(24)

of shortcomings. Most importantly, it does not take the localization accuracy of a detection into account. It is comprised of three components: Localization, FP rate and FN rate.

Localization (LRP LocError) is a metric similar to IoU-loss, often used in object detection. It represents the IoU tightness of correct detections.

L RPIoU(X, Y) = 1 T P T P Õ i=1 (1 − IoU(xi, yxi)).

1 − L RPIoU represents the average IoU over all correct detections.

FP ratemeasures the false positive rate.

L RPr F P(X, Y) = 1 − Precision = 1 − T P

|Y | = F P

|Y |.

FN ratemeasures the false negative rate.

L RPr F N(X, Y) = 1 − Recall = 1 − T P | X | =

F N | X |,

where X is the set of ground truth objects and Y is the set of detections (IoU > 0.5). Note that |Y |= TP + FP and |X| = TP + FN.

Together these values are combined into the LRP Error.

L RP(X, Y) = ÍT P i=1 1−IoU(xi,y_xi) 1−τ + FP + FN T P+ FP + FN . In our case τ = 0.5.

LRP Error is in range [0.0, 1.0], where a lower values means a better performance, combined as well as individually. e.g. a LRP LocError, or L RPIoU, of 0.0 would mean it has been localized perfectly.

We mostly use LRP LocError to show the localization accuracy of the model, also referred to as the quality of the detections. The LRP metric overall has been included to allow for further comparability.

(25)

C h a p t e r 4

CASE STUDY: MISALIGNMENT CLASSIFICATION

CONFIDENCE - IOU

B. Jiang et al. describe a misalignment between classification confidence and IoU in the object detection domain. As classification confidence is often used during non-maximum suppression (NMS) to decide which boxes will be suppressed and which will be preserved, this misalignment is distressing. Due to the misalignment NMS might end up suppressing the more accurately localized boxes when basing the decisions on classification confidence. This chapter describes how to show and measure this misalignment. Afterwards, the misalignment in the object detection domain is once again shown. Finally, we shows that the misalignment is also present in the face detection domain.

4.1 Measuring the misalignment

In the field of object detection intersection over union (IoU) is often used in unison with classification confidence to compute if a detected object is also the ground truth or if it is something else, e.g. a false positive. B. Jiang et al. show a misalignment between classification confidence and IoU, both visually and with a single unit metric, the Pearson correlation. Visually they show the misalignment with the graph in Figure 4.1a. They find that trained object detection models are usually either very certain (have a high classification confidence) for a higher IoU, or very uncertain (have a low classification confidence) for a lower IoU. In the graph this can be seen by looking at the IoU values, where IoU < 0.5, often results in a classification confidence of ∼0, while IoU > 0.5 often results in a classification confidence of ∼1. It is this property that can cause a greedy NMS strategy to incorrectly suppress higher IoU boxes. We are most interested in values of IoU > 0.5 because that is commonly the threshold used to declare whether an object is correctly localized or not.

First all datapoints are gathered by running the detector and non-maximum suppres-sion (NMS), in this case a Faster R-CNN model was used. The graph is constructed by sampling n classification confidence scores of these datapoints from m IoU in-tervals within (0.0, 1.0]. E.g. (n = 5, m = 500) would translate into 500 intervals: {(0.000 − 0.002], (0.002 − 0.004], . . . , (0.998 − 1.000]} with each 5 classification

(26)

(a) (b)

(c)

Figure 4.1: Misalignment classification confidence versus IoU with the ground truth bounding box for MS COCO 2014 (a) by B. Jiang et al., (b) our own reproduction, and (c) for PASCAL VOC 2007.

confidence scores sampled (with replacement) from that interval. These pairs are then plotted to create a graph similar to the one shown in Figure 4.1a.

The single unit metric B. Jiang et al., 2018 and I used to measure correlation is the Pearson correlation coefficient. The reported Pearson correlation coefficient accompanying the data with IoU > 0.5 of Figure 4.1a is Pear son= 0.217.

4.2 Misalignment in object detection

As reported by B. Jiang et al., they found a Pearson correlation coefficient of 0.217 on COCO for bounding boxes with an IoU > 0.5. This means there is a very low correlation between the classification confidence and the IoU of the detected bounding box and the ground truth bounding box.

(27)

Dataset Pearson IoU >0.5 COCO (B. Jiang et al.) 0.217

COCO (Ours) 0.311

PASCAL 0.578

FDDB 0.331

MAFA 0.250

WIDER FACE 0.476

Table 4.1: Pearson correlation coefficient between the variables: classification confidence, and IoU between detected bounding box and ground truth bounding box. The top section includes two object detection datasets, while the bottom section includes three face detection datasets.

this research it has also been reproduced on COCO. An object detector based on Faster R-CNN, was trained on COCO to collect the classification confidence - IoU pairs. Figure 4.1b shows the results after sampling 5 pairs per bin, for 500 bins when keeping the top 100 bounding boxes after NMS. The plots show that the two graphs are very similar. Additionally, these experiments were also conducted on PASCAL and their results are shown in Figure 4.1c. The top section of Table 4.1 shows the Pearson correlation coefficient for IoU > 0.5 for object detection on COCO and on PASCAL. Also the coefficients for the cases where IoU > 0.5 can be considered similar. The (slight) difference between the results can be caused by several factors. Most importantly, B. Jiang et al. used a ResNet-FPN backend, while in these experiments a ResNet101-Faster R-CNN was used, ResNet-FPN has been shown to improve detection results (Lin, Dollár, et al., 2017). Model performance is very important when collecting the pairs, as a better model will more accurately find the correct bounding boxes. However, it might be overconfident for less accurate bounding boxes (Hall et al., 2018). Overall, we state that the results have been reproduced adequately.

4.3 Misalignment in face detection

As we know this misalignment is present in the field of object detection, we are also interested to know if it is a problem in the field of face detection. Similar to testing to what degree it occurs in the field of object detection, a face detector was trained using Faster R-CNN to collect the classification confidence - IoU pairs. This face detector was trained on the WIDER FACE training split and used to collect the pairs in the face detection datasets FDDB (training split), MAFA (test split) and WIDER

(28)

(a) _(b)

(c)

Figure 4.2: Misalignment classification confidence versus IoU with the ground truth bounding box for (a) FDDB, (b) MAFA and (c) WIDER FACE. The difference in the number of datapoints is due to FDDB and MAFA having fewer faces in their datasets. However, we can still see a similar shape in all three plots.

FACE (validation split). Plots for these datasets can be found in Figure 4.2.

Overall the shape of the graphs is very similar. However, there are fewer available datapoints in FDDB and MAFA than in COCO and WIDER, the reason being that there are fewer images in the dataset. Also visible in the graphs is that FDDB and MAFA have fewer high IoU results. This is because the face detector was trained on WIDER, which has tighter bounding boxes as targets, see Figure 4.3 for two examples.

The bottom section of Table 4.1 shows the Pearson correlation coefficient for IoU > 0.5 for these datasets. As can be seen from the results, similar to the misalignment in the object detection domain, the pairs in these datasets also show a low correlation between classification confidence and the IoU between the detected bounding box

(29)

(a)

(b)

Figure 4.3: Face detection results from the Faster R-CNN model used to collect the pairs, where the red bounding boxes are the detected boxes and the green bounding boxes are the ground truth boxes, (a) is an example image from FDDB, (b) is an example image from MAFA.

Split Pearson

IoU >0.5

Overall 0.476

Normal occlusions 0.394 Heavy occlusions 0.343

Table 4.2: Pearson correlation coefficient between the variables: classification confidence, and IoU between detected bounding box and ground truth bounding box. When considering only normally and heavily occluded faces the correlation between the two variables is even lower.

and the ground truth bounding box. We therefor find that the misalignment is also present in the face detection domain.

Misalignment in occluded faces

We have confirmed that the misalignment is present in face detection. Part of our research however is addressing occluded faces. As such, we have also gathered the data pairs for the normally and heavily occluded splits of the WIDER FACE dataset, separately. These data pairs were gathered using the same Faster R-CNN model trained on WIDER FACE. Plots for these splits can be found in Figure 4.4 and the Pearson correlations belonging to the splits, as well as that of the entire WIDER FACE dataset can be found in Table 4.2.

In conclusion, we believe that the similar shape in Figures 4.1 and 4.2, and the low Pearson correlations as shown in Table 4.1 is enough proof to show that the

(30)

(a) (b)

Figure 4.4: Misalignment classification confidence versus IoU with the ground truth bounding box for the occluded splits of WIDER FACE, (a) normally occluded faces, (b) heavily occluded faces.

misalignment is also present in the domain of face detection, albeit it less strong. However, this misalignment is stronger when considering only occluded faces. The low correlation for the MAFA dataset supports this further, as the dataset is built from only occluded faces. With these findings we conclude that face detection might also benefit from localization confidence predicted by an IoU-net (as first proposed by B. Jiang et al., 2018 for object detection). This stronger misalignment for occluded faces suggests that it might benefit even more from localization confidence than faces in general would.

More information has been gathered on the WIDER Faces dataset with respect to the Pearson correlation coefficient and graphs showing the pairs, they can be found in Appendix A.

(31)

C h a p t e r 5

METHODOLOGY

This chapter describes the architecture and workings of the Faster R-CNN and IoU-Net models used during this research, including some information about the domain adaptation from object detection to face detection. Afterwards, non-maximum suppression is described, including our proposed strategy.

5.1 Faster R-CNN

During this research the Faster R-CNN (Ren et al., 2015) architecture was used as a baseline. The main reasons for this are because it is a very popular two-stage detector, suggested by the authors of the paper proposing IoU-net, and the authors themselves use a similar network. Faster R-CNN is an end-to-end trainable, two-stage object detector. It consists a RPN and a Fast R-CNN, both sharing the same CNN base network. First, it uses a region proposal network (RPN) to select regions that are of interest (RoIs) in an image. Afterwards, through a Fast R-CNN model, it regresses the resulting bounding boxes and classifies them.

Figure 5.1 shows the architecture of the Faster R-CNN model used in this research.

5.1.1 Base network

In object and face detection a base network serves as a feature extractor. It is usually composed of a stack of convolutional and pooling layers. It is common to use a network pretrained on ImageNet (Deng et al., 2009; Zhao et al., 2019b) as feature extractor, as ImageNet is a very large image classification dataset. By retraining,

Figure 5.1: Faster R-CNN architecture used (image inspired by (B. Jiang et al., 2018)).

(32)

dropping or finetuning the final layers of a pretrained network, the low and mid level features remain the same. This is called transfer learning and reduces the training time of a network. Our baseline uses a truncated ResNet101 base network (He, X. Zhang, et al., 2016), pretrained on ImageNet, to extract the feature map of the input image. However, as with most object and face detectors the base network can be replaced by any standard CNN architecture, e.g. VGG or MobileNet (Simonyan and Zisserman, 2014; Howard et al., 2017).

5.1.2 Region proposal network

The region proposal network (RPN) uses the feature map of an input image to find a predefined number of regions which may contain objects. It outputs the coordinates of these regions, and the likelihoods of each of them to be foreground or background. During training the RPN learns to predict if a reference box is part of the foreground or background.

Anchors and outputs

To predict whether these reference boxes are fore- or background the RPN uses anchors. Anchors are bounding boxes of fixed sizes and aspect ratio’s. These anchor boxes, or reference boxes, are placed throughout an image. For each anchor then the RPN predicts whether it contains an object and four offset coordinates to correct the anchor to the right position.

Each pixel in the feature map is considered when placing anchors, which means we end up with about W · H · k anchors in total. Where W is the width of the feature map, H is the height of the feature map and k is the maximum number of proposals for each location. k is determined by the number of scales and ratio’s of each anchor box. Figure 5.2 shows an example of anchor box types, their placement near an object, and their presence in an image.

The RPN outputs offset to the regions of interest (RoIs) and their likelihood of containing an object. The offsets are {δxcenter, δycenter, δwidt h, δheight} for each anchor. The likelihood of containing an object is a tensor, referring to the likelihood to be foreground, and a likelihood to be background. A bounding box is labeled as foreground if the IoU between an actual object and the bounding box proposal is greater than 0.5. When the IoU is lower than 0.1 it is labeled as background. The RPN is trained to predict these labels.

(33)

Figure 5.2: (left) Example of 9 anchor box types, using 3 different sizes and 3 different aspect ratio’s. (center) Anchor box overlap on an object in an image. (right) Overall anchor box presence in an image (image by (Rey, 2018)).

5.1.3 Classifiers

The Fast R-CNN network further assesses the RoIs proposed by the RPN. It does this by observing the feature map of only the RoIs. The classifiers expect fixed length feature vectors. As such RoI pooling is used to ensure that the RoI will always result in a feature map of the same predefined size. These resulting feature maps are then used as input for two branches: the classification branch and the box regression branch.

RoI pooling

RoI pooling is a type of max pooling used to project features of a RoI on the original feature map to a fixed size window. As such, after RoI pooling, the Fast R-CNN will always have the same size input into its classification, and box regression branches. The RoI is divided into a H by W grid, where each sub-window is approximately of size h/H by w/W (i.e. 7x7). Afterwards, max pooling is applied to each of these sub windows. The pooling operations are independently applied to each feature map channel. Figure 5.3 illustrates this process. It is possible to back propagate through the RoI pooling similar to max pooling.

The original RoI pooling was found to have a location misalignment caused by quantization. RoIAlign was designed by He, Gkioxari, et al., 2017 to address this location misalignment. The original RoI pooling rounds down the operations h/H and w/W to integers. RoIAlign avoids such rounding and instead uses the continuous value to properly align the input pixels. Afterwards, bilinear interpolation is used to compute the resulting floating-point location values in the input. Later B. Jiang et al., 2018 propose using PrRoIPooling instead of RoIAlign because it is continuously

(34)

Figure 5.3: The process of going from an image to a feature map by convolutions and pooling. Followed by RoI pooling, which pools a RoI of the feature map onto a smaller fixed size feature map. (Image by F.-F. Li and Johnson, 2016.)

differentiable. Both these methods avoid rounding and use bilinear interpolation to address the location misalignment. We conducted some experiments and observed negligible differences in performance between the two. Yet, to more closely follow the implementation of IoU-net, we use PrRoIPooling.

Outputs

The resulting pooled feature maps are used as input for the R-CNN. After two fully connected layers the pooled RoI is branched into the two output layers. The first is the classification-head, which is a softmax estimator of the K + 1 classes (for face detection K = 1). This estimator outputs a discrete probability distribution per RoI. The second is the box regression-head. For each class it predicts the offset relative to original RoI.

5.1.4 Training

During training of the entire Faster R-CNN, the RPN and Fast R-CNN share features as the R-CNN utilizes the proposals of the RPN. To allow the network to be trained end-to-end, rather than training them separately, 4-step alternating training is used. This involves first training the RPN and using the proposals to train the Fast R-CNN. After which the tuned Fast R-CNN is used to initialize the RPN, and so on, alternating between the two.

(35)

Multi-task loss

During training the model optimizes the RPN and Fast R-CNN alternatingly. The RPN or Fast R-CNN will be optimized using a loss for the classification and a loss for the box regression. For classification the RPN only considers two classes: foreground and background, while the Fast R-CNN considers K+ 1 classes (for face detection K = 1). However, by using a class specific approach we only consider the predictions of the ground truth class. As such the classification loss is:

L_cls(pi, p∗i)= −p ∗

i log pi− (1 − p∗i) log (1 − pi),

where piis the predicted probability of anchor i and ground truth label p∗_i is 1 when positive and 0 when negative.

The regression loss is the Smooth L1 loss, (Girshick, 2015), also known as Huber’s loss, given as:

Lr eg(ti, t_i∗)= Smooth L1 Loss(x = ti− t_i∗)=        0.5x2, if | x| < 1 | x| − 0.5, otherwise,

where tiis a vector representing 4 parameterized coordinates of the predicted bound-ing box and t_i∗ is the ground truth box associated with the positive anchor. Smooth L1 loss basically combines the linear aspect of L1 loss, and the exponential aspect of L2 loss. As a result of this the outliers will not be punished as heavily as it would with L2 loss. This helps prevent exploding gradients.

These losses are combined into a multi-task loss, as:

L({pi}, {ti})= λ1 1 Ncls Õ i L_cls(pi, p_i∗)+ λ2 1 Nr eg Õ i p∗_iL_{r eg}(ti, t_i∗),

where Ncls is the mini-batch size during training (i.e., Ncls = 256). Nr eg is the number of anchor locations (i.e., for face detection Nr eg ≈ 4, 800). These terms are used to normalize the losses. λ is a balancing parameter, used to weight the losses more adequately. Following Ren et al., 2015, we use λ1= 1, λ2 = 10. It is important to note that through p∗_i only the ground truth class is used, while the background class is ignored.

5.2 IoU-net

IoU-Net, as proposed by B. Jiang et al., extends a two-stage object detector. On top of predicting the classification confidence and the bounding box coordinates it

(36)

predicts the localization confidence of the bounding box proposal. First the pooled feature map of a RoI is fed into part of the convolutional layers of the two-stage detector. Afterwards it is passed through two separate fully connected layers to predict the IoU between the bounding box proposal and a ground truth object. Figure 5.4 shows the architecture of the IoU-Net model used in this research. Like with our implementation of the Faster R-CNN, each fully connected layer is followed by batch normalization and a ReLU activation, except for the final layer.

Figure 5.4: IoU-Net architecture used (original image by (B. Jiang et al., 2018)).

(B. Jiang et al., 2018) show that IoU-Net can work with a FPN, Mask R-CNN, and Cascade R-CNN. They claim that it can essentially work with any two-stage object detector. The model used in this research uses a Faster R-CNN.

5.2.1 Additional classifier

IoU-net uses the provided RoI to infer the localization confidence. Similar to Faster R-CNN, it does so by observing the pooled feature map of the RoI. The pooled feature map is fed through two fully connected layers followed by a Tanh activation. The Tanh function normalizes the output to range [−1, 1]. This range is used when calculating the loss between the predicted localization confidence and the ground truth. Finally, we convert it into a range which corresponds to Ωtr ain. This hyperparameter indicates the range in which the output localization confidence can occur. For instance, Ωtr ain = 0.5 means the resulting localization confidence can be in range [0.5, 1.0]. To do this we use the following equation:

x0= x( 2 1 − Ωtr ain

− 2 − (1 − Ωtr ain) Ωtr ain

(37)

where x is the normalized localization confidence in range [−1, 1] and x0 is the final localization confidence in range [Ωtr ain, 1.0]. Additionally, the value of Ωtr ain indicates the range of the target localization confidence of the jittered data.

5.2.2 Training

In order to train the model how to predict localization confidence, it needs an input bounding box and a target localization confidence. During training the best overlapping ground truth bounding box and the RoI are used to calculate the target localization confidence, or IoU. The IoU-Net can uses RoIs proposed by the RPN as input bounding boxes. However, training the Faster R-CNN will result in more accurate proposals from the RPN and thus the target localization confidence of the training data would increase with it. This results in a training data imbalance, which in turn will results in the model becoming more bias. Instead B. Jiang et al., 2018 propose jittering ground truth bounding boxes.

Jittered RoIs

It is possible to synthesize training data. This is done by jittering, or adding noise to, the ground truth bounding box coordinates to create jittered RoIs. These jittered RoIs can then be used instead of the RPN proposals, to calculate the target localization confidence. When jittering a ground truth bounding box, it is possible to generate many of different bounding boxes, with different localization confidence values. This process gives us more control over the distribution of the training data.

Literally adding random noise to coordinates will create a normal distribution cen-tered around ∼ 0.7 because of how IoU is calculated (Rezatofighi et al., 2019). Instead, to generate more uniform data a sample batch of jittered bounding boxes we take the following steps:

1. Divide the sample range into many small intervals, e.g. [0.50, 1.00) into {[0.50, 0.51), ..., [0.99, 1.00)} .

2. Randomly select a ground truth from the batch, G.

3. Uniform randomly select an interval, [L, R), from the list of intervals created in 1.

4. Jitter each coordinate of G = {x0, y0, x1, y1} to get G0= {x0₀, y₀0, x₁0, y0₁}, until IoU(G, G0_{) ∈ [L, R).}

(38)

Step 4 implies there are cases where the resulting IoU is not within the desired range. Note that the lowest value in the list of step 1 is Ωtr ain, e.g. in this case Ωtr ain = 0.50. We also indicate a step size of 0.01 in the example, this step size is _Ω1

jitter , e.g. in this case Ωjitter = 100. Thus, Ωjitter influences the number of

intervals we generate data for. To reduce computation we calculate the upper and lower bounds of the distance di ∈ {d0, d1, d2, d3} from each coordinate to be in the desired IoU range. More explicitly, G0is defined as:

x₀0 = x0− d0 y0₀= y0− d1 x₁0 = x1+ d2 y0₁= y1+ d3

Where, x0, y0 are the top left and x1, y1 are the bottom right coordinates ground truth bounding box, and di ∈ {d0, d1, d2, d3} are the jitter values of {x0, y0, x1, y1}, respectively.

First we jitter x0with d0, as d0depends on nothing. Then we jitter x1 with d2, as it only depends on x0. Afterwards we jitter y0 with d1, as it depends on x0 and x1. Finally we jitter y1 with d3, as it depends on all other coordinates. Where di ∈ {d0, d1, d2, d3} is the distance from each respective coordinate. The equations required to calculate the lower and upper bounds for each diis worked out below.

If we consider only a single coordinate x0and consider the others as invariant, we could see it as the following problem:

for d < 0: 0 ≤ L ≤ (w+ d)h w · h ≤ R ≤ 1, for d > 0: 0 ≤ L ≤ w · h (w+ d)h ≤ R ≤ 1,

where, w and h are the width and height of a ground truth box, d is the amount you jitter the current box with, L and R are the lower and upper bounds of the desired IoU range.

Note, this is basically a simplified IoU calculation. Which we can further simplify because for this explanation we only jitter a single coordinate, x0, height is invariant. for d < 0:

0 ≤ L ≤ w+ d

(39)

for d > 0:

0 ≤ L ≤ w

w+ d ≤ R ≤ 1.

We can simplify it further by assuming we want value B, instead of L or R: for d < 0: w+ d w = B, for d > 0: w w+ d = B, where, 0 ≤ B ≤ 1.

Which we can solve for d to get: for d < 0:

d = w(B − 1), for d > 0:

d = w(1 − B)

B .

Finally, we convert B back into (L, R] to get the upper and lower bounds of d for both cases: for d < 0: Bl = w(L − 1), Br = w(R − 1), for d > 0: Bl = w(1 − R) R , Br = w(1 − L) L ,

where, Br is the upper bound, or the maximum value, of di and Bl is the lower bound, or minimum value, of di. The final value of diwill be determined by taking a random value in the interval [Bl, Br). If Bl ≤ d ≤ Br we will always have an IoU in range [L, R).

In order to allow for a bigger spread in coordinates, we allow d0, d1, d2 to be in between the lower bound of case d < 0 and the upper bound of case d > 0 to get:

w(L −1) ≤ d0≤

w(1 − L)

(40)

This briefly explains how the bounds can be calculated for a single coordinate. A more complete calculation of all coordinates, including an example, is described in Appendix B.

When the jittered bounding boxes are generated, the IoU is calculated between these jittered boxes (G0) and their respective ground truth boxes (G). Generating the bounding boxes this way commonly results in a distribution of IoU values that closely resembles a uniform distribution. However, it is possible to sample from the uniform distribution in whichever way desirable.

Multi-task loss

Similar to the box regression layer, the loss function used in the localization confi-dence branch of the IoU-net is the Smooth L1 loss, and it is class specific. Thus the loss is:

Lloc(qi, q_i∗)= Smooth L1 Loss(x = qi− q_i∗)=        0.5x2, if | x| < 1 | x| − 0.5, otherwise, where qiis the predicted localization confidence and q_i∗is the ground truth localiza-tion confidence, both in range [−1, 1].

It is possible to jointly train the IoU-net using a multi-task loss. To do so we combine the multi-task loss of Faster R-CNN with that of the localization confidence branch. As a result the multi-task loss of the joint IoU-net becomes:

L({pi}, {ti}, {qi})= λ1 1 Ncls Õ i L_cls(pi, p∗i)+λ2 1 Nr eg Õ i p_i∗L_{r eg}(ti, ti∗)+λ3 1 Nloc Õ i p_i∗L_loc(qi, qi∗). Where Nloc is the normalization term for the number of RoI used (i.e. Nloc = 256)

and balancing parameter λ3 = λ1= 1 5.2.3 Domain adaptation

Both Faster R-CNN and IoU-net were originally designed for object detection. To allow them to work for faces some adaptation of the hyperparameters is required. However, the goal of this research is not to tune all the hyperparameters of our models in order to find the best one, or to achieve state-of-the-art performance. Achieving state-of-the-art performance builds on many extensive finetuning pro-cesses including multi-scale testing (Thuis, 2018), hard negative mining(Singh and Davis, 2018; Singh, Najibi, and Davis, 2018) and more. Instead, our objective is to evaluate IoU-net, whether it can be used for face detection and more specifically

(41)

to study its performance on occluded faces compared to the baseline. Therefor we worked on a reasonable enough baseline to compare with. We minimally tuned the hyperparameters to fit the face detection domain better. This was done by com-paring different research which uses popular object detection frameworks for face detection. As such, we changed the anchor ratio’s to [1, 1.5, 2], as faces are generally more oblong, and we changed the anchor scales to [162, 322, 642, 1282, 2562, 5122] as the WIDER FACE dataset many different scales of faces (X. Sun, P. Wu, and Hoi, 2018b).

5.3 Non-maximum suppression

After an image passed through the detection pipeline, we can be left with many overlapping bounding boxes. Often many of these bounding boxes will correspond to a single object or face. If we consider all these overlapping bounding boxes, many will be labeled as false positive, reducing the models performance during testing. To address this, non-maximum suppression (NMS) was proposed (Rosenfeld and Thurston, 1971; Viola and M. Jones, 2001).

Greedy NMS

The original NMS, referred to throughout this thesis as Greedy NMS, selects the bounding box with the highest classification confidence and removes all bounding boxes overlapping it with an IoU higher than a given threshold (ΩN M S). Thus, with an ideal threshold, we will be left with only the best bounding box.

IoU-guided NMS

In our case study we show there is only a low correlation between classification confidence and IoU. As a result, using classification confidence to determine whether one bounding box is better than another is unreliable. IoU-net predicts localization confidence, which is trained to be the expected IoU between a detection and a ground truth. This localization confidence can be used during the post processing of the model detections, i.e. during NMS or bounding box refinement. During NMS it can use this localization confidence to suppress overlapping bounding boxes, hopefully being left with better bounding boxes. This is achieved by extending the standard Greedy NMS to prioritise localization confidence over classification confidence. So instead, it selects the bounding box with the highest localization confidence and awards it the highest classification confidence of all bounding boxes overlapping it with an IoU greater than a given threshold. After which, like with Greedy NMS,

(42)

it removes the overlapping boxes. We refer to this version of NMS as IoU-guided NMS, as originally proposed by B. Jiang et al., 2018. Note that IoU-guided NMS is also a greedy strategy, as it still selects bounding boxes based on the highest localization confidence.

The pseudocode for Greedy and IoU-guided NMS can be found in Algorithm 1.

Algorithm 1Greedy, and IoU-guided Non-Maximum Suppression. Input: B = {b1, ..., bn}, S, I, Ωnms

B, set of bounding boxes,

S, set of classification confidence score, belonging to B, V, set of localization confidence score, belonging to B, Ωnms, NMS threshold.

Output: D, set of detected bounding boxes with classification confidence scores.

1: D ← ∅

2: while B , ∅ do

3: m ←arg max S . Greedy NMS

4: m ←arg max V . IoU-guided NMS

5: B ← B\{bm} 6: s ← sm 7: for bi ∈ B do 8: if IoU(bm, bi)> Ωnms then 9: s ←max(s, si) . IoU-guided NMS 10: V ← V\{vi} 11: B ← B\{bi} 12: S ← S\{si} 13: end if 14: end for 15: D ← D ∪ {bm, s} 16: end while 17: Return D Soft NMS

Soft NMS does not discard detections, instead it lowers the classification confidence of overlapping detections to reduce the likelihood of them being selected (C. Zhu et al., 2019). The degree to which the classification confidence is decreased depends on the IoU and the Soft NMS variation used. A larger IoU between two detections means the confidence score will be decreased more. Bodla et al., 2017 propose Linear-Soft and Exponential-Soft NMS, where the first will linearly decrease the confidence score based on the IoU between two detections and the second will

(43)

exponentially decrease it. Soft NMS has been shown to increase the mAP of object detectors, especially for classes with many overlapping objects.

Soft IoU-guided NMS

As Soft NMS improves the overall mAP of object detectors using classification confidence, the concept could also work with localization confidence. We propose to combine the two into Soft IoU-guided NMS, where we will reduce the localization confidence as well as the classification confidence for overlapping bounding boxes. Ideally, this would combine the desired features of Soft NMS and IoU-guided NMS to increase the overall localization of the remaining bounding boxes after suppression, as well as the number of correct bounding boxes (IoU > 0.5). During our research we used Soft NMS, and throughout this thesis we refer to Linear-Soft NMS as Linear-Soft NMS. Also for Linear-Soft IoU-guided NMS, we chose to use the Linear function.

The pseudocode for Soft and Soft IoU-guided NMS can be found in Algorithm 2.

Algorithm 2Soft, and Soft IoU-guided Non-Maximum Suppression. Input: B = {b1, ..., bn}, S, I, Ωnms

B, set of bounding boxes,

S, set of classification confidence score, belonging to B, V, set of localization confidence score, belonging to B, Ωnms, NMS threshold.

Output: D, set of detected bounding boxes with classification confidence scores.

1: D ← ∅

2: while B , ∅ do

3: m ←arg max S . Soft NMS

4: m ←arg max V . Soft IoU-guided NMS

5: B ← B\{bm}

6: s ← sm

7: for bi ∈ B do

8: if IoU(bm, bi)> Ωnms then

9: s ←max(s, si) . Soft IoU-guided NMS

10: vi ← vi(1 − IoU(bm, bi)) 11: si ← si(1 − IoU(bm, bi)) 12: end if 13: end for 14: D ← D ∪ {bm, s} 15: end while 16: Return D

Localization Confidence to address Occlusions in Face Detection

Master Thesis