Integral Channel Features with Random Forest for 3D Facial Landmark Detection

(1)

MSc Artificial Intelligence Track: Computer Vision

Master Thesis

Integral Channel Features with Random Forest

for 3D Facial Landmark Detection

by Arif Qodari 10711996 February 2016 42 EC Supervisor/Examiner: Prof. dr. Theo Gevers Sezer Karaoglu Accessor: dr. Leo Dorst dr. Jacco Vink Informatics Institute University of Amsterdam

(2)

Abstract

Detecting facial landmarks are important to understand human faces. While 2D image-based approaches have been well studied in literature, a 3D-image-based approach remains a challenging problem due to several reasons, e.g. performance issue in noisy data, run time issue because of model complexity, and the robustness issue of pose variations. In this thesis, we investigate the performance of random forest based models combined with integral channel features to detect 3D facial landmark. We study the influence of using heteregeneous information computed from multiple channels features to obtain an accuracy 3D landmark detector. A variant of a random forest algorithm that utilizes multiple channel features is proposed to localize 3D facial landmarks. Multiple channel features provide rich and divers information. These features are efficiently computed using integral image from noisy RGB-Depth images. Finally, we present our experimen-tal results evaluated on the Biwi Kinect dataset containing a large range of head pose angles. The results show that adding more channel features, more specifically gray and gradient channels, has a positive influence on the accuracy of our detector rather than using a single depth channel. Moreover, using the additional gray and gradient chan-nels also increases the robustness of the detector against head pose variations. We also demonstrate that our approach produces higher mean accuracy compared to a 2D-based state-of-the-art method.

(3)

Acknowledgements

I would like to thank to Theo Gevers for giving me the opportunity to work on an interesting topic under his supervision.

Many thanks to Sezer Karaoglu for his valuable advice and feedback to improve the quality of this thesis. He helped me intensively with the project and writing process. During the project, we have had many meetings to discuss both the theoritical and technical aspects. Despite his busy schedule, I could always contact and discuss the problems I faced.

I also want to thank my wife and my parents for supporting me along the way.

(4)

(5)

List of Figures

2.1 The pipeline of training process . . . 8

2.2 Vote clustering . . . 12

3.1 Integral image. . . 15

3.2 Multiple registered image channels . . . 16

4.1 Landmark annotation result . . . 20

4.2 The influence of number of trees . . . 23

4.3 Accuracy . . . 24

4.4 Accuracy vs efficiency in frontal face test set . . . 25

4.5 Accuracy vs efficiency in full test set . . . 26

4.6 Channel features performance comparison in frontal face dataset . . . 27

4.7 Examples of failure cases in frontal face test set . . . 28

4.8 Channel features performance comparison in full test set . . . 28

4.9 Examples of failure cases in full test set . . . 29

4.10 Performance under head pose variations . . . 30

4.11 Examples of failure cases in large head poses . . . 31

(8)

(9)

List of Tables

3.1 Average time needed to generate image channel . . . 18

4.1 Summary of the performance of channel features . . . 29

4.2 Evaluation result on 4-fold cross validation . . . 30

4.3 Performance comparison with 2D-based method. . . 31

(10)

(11)

Chapter 1

Introduction

1.1 Motivation

Facial landmark detection is the problem of detecting point of interests (e.g. eyes, mouth corners, and nose tip) on human faces. Facial landmarks are important for many facial analysis tasks such as face recognition [1], facial expression recognition [2], and facial animation [3]. Therefore, detecting facial landmarks is an essential aspect to understand faces.

There has been many approaches developed to robustly detect facial landmarks. These methods can be categorized into 2D based and 3D-based approaches. 2D image-based approaches operate in 2D and hence return the detected landmarks in 2D coordi-nates, while 3D-based approaches detect 3D facial landmarks from depth images.

Facial landmark detection from 2D images has been well studied in the literature. Re-cently, a number of real-time performing approaches have achieved high detection accu-racy on face images collected in the wild [4][5][6][7][8]. However, the performance of 2D image-based methods usually deteriorates under varying illumination conditions (e.g. highlights, shadows and dim light). In addition, 2D image-based methods often require an initial face region obtained by a face detection algorithm. As a consequence, the performance of these methods are limited by the accuracy of the face detector.

Prior work on 3D-based approach shows that it is more robust than a 2D image-based approach against lighting conditions and head pose variations [9][10]. For instance, the method proposed by Baltruˇsaitis et al. [9] integrates depth and intensity images to alleviate the problem caused by poor lighting conditions. Since depth information is independent of light, the appearance of objects is not affected by lighting condition. The authors reported accurate results for detection and tracking of 3D facial landmarks.

(12)

Chapter 1. Introduction 2

Another relevant example is the method by Papazov et al. [10]. The method exploits depth features to detect 3D facial landmarks under varying head poses. The method obtains high detection accuracy in real-time.

Detecting 3D facial landmarks remains a challenging problem because of several rea-sons: performance issue due to noisy input data, run time issue due to computational complexity, and robustness issue for pose variations and local deformations. Another challenge of this topic is the lack of annotated 3D facial landmark datasets. Some of the prior work used synthetic a head model [9] or high-quality face scans [11] [12] leading a performance gap when tested in noisy depth data acquired by depth sensors. In our experiments, we used the Biwi Kinect dataset [13] which has more than 15K RGB-Depth streams for various head poses. This dataset does not provide any landmark annota-tions. However, head rotation angles and head center locations are provided. In section

4.1.1, we present an algorithm to annotate rigid landmark points through all frames using the provided head rotation angles and center locations.

Random forest has been widely used in many computer vision tasks including 2D [14][7] and 3D facial analysis [12][14][13]. A random forest consists of multiple decision trees. A single tree maps complex feature spaces into simpler decision spaces. Random forest has the capability to handle large data inputs efficiently. Moreover, the concept of randomness also brings an advantage to avoid overfitting.

A number of random forest based models have been proposed for 2D facial landmark detection. Dantone et al. [14] proposed a conditional regression forest to detect 2D facial landmarks in 2D images. The proposed regression forest is conditional to global face properties, e.g. head pose. This method employs multiple channel features: raw and normalized gray values and Gabor filter banks [15] computed for varying parameters. The authors demonstrated the benefit of using a random forest based model to map complex high-dimensional features into a multi-class decision model. A similar approach was proposed by Kazemi et al [7], namely Ensemble Regression Trees (ERT). Unlike the method proposed by Dantone et. al., ensemble trees work in cascaded architectures and are trained by a gradient boosting algorithm. The method utilizes differences of intensity values at a pair of pixels as features. The combination of random forests using these features accurately predicts 2D facial landmarks efficiently (about 1ms to process one image).

(13)

In the context of 3D facial landmark detection, Fanelli et al. [12] proposed a random forest to localize 3D facial landmarks under various facial expressions from high-quality face scans. The method utilizes generalized Haar-like features [16] computed from the depth channel. The authors reported real-time and accurate detection results.

Today, advances in device technology allow us to record depth information as well as RGB information at low cost, e.g., MS Kinect and Asus Xtion. Inspired by the work of Fanelli et al. [12], we employed a similar approach to detect 3D facial landmarks from RGB-D images. The difference is that we exploited multiple sources of information, i.e. RGB and depth, and analyzed the influence of data diversity on the 3D facial landmark detection performance.

A number of methods have shown that integral channel features [17] are effective for many computer vision tasks, including object recognition [18], pedestian detection [19], and local region matching [20]. Integral channel features capture rich information from the different and diverse channels in images. In addition, the features can be efficiently computed using integral images. For these reasons, we combined integral channel fea-tures with a random forest model to detect 3D facial landmarks.

Our main contribution is the combination of integral channel features with random forest to detect 3D facial landmark using RGB-D images. We study the influence of various channel features on the 3D facial landmark detection performance. We also investigate the robustness of our approach under varying head poses.

1.2 Goal

This thesis focuses on investigating the combination of different channel features for 3D facial landmark detection. The following research questions are:

1. How to integrate multiple channel features into a random forest based model for 3D facial landmark detection ?

2. What are the best performing channels for detecting 3D facial landmarks ?

(14)

1.3 Thesis Outline

In Chapter2, we first summarize prior work related to 3D facial landmark detection and a variant of random forest based approaches for facial analysis tasks. We then explain the details of the random forest algorithm specific for 3D facial landmark detection.

Chapter 3 describes the integral channel features and approaches for integrating the features into a random forest. It discusses three different channel types: depth, gray and the gradient histogram. This chapter also provides a discussion on the computational time.

Implementation details (e.g. dataset annotation, parameter settings, and evaluation metric) and experiments are discussed in Chapter 4.

(15)

Chapter 2

Random Forest for 3D Facial

Landmark Detection

Typical random forest algorithms work in a supervised way, i.e. the algorithm constructs trees from a set of training data annotated with the desired output labels. We call this as the training process. A tree is constructed to maximize the information gain by mapping complex input spaces into simpler discrete (classification) or continuous (regression) output spaces. The mapping process is done in every non-leaf node, while the leaf node stores the information to be used for prediction. Once the forest is constructed, a testing process is conducted to evaluate the generalization ability of the trained forest from given unseen data. A set of testing data are propagated down the trees where each tree gives a prediction vote. The forest determines final prediction by either averaging the votes or choosing the majority votes.

This chapter discusses a specific variant of the random forest algorithm for 3D facial landmark detection. Section 2.1 presents a literature review related to 3D facial land-mark detection and random forest based solutions for facial analysis. Training and testing are discussed in section 2.2and 2.3, respectively.

2.1 Related Work

2.1.1 3D Facial Landmark Detection

A number of methods have been proposed in the literature for detecting 3D facial land-mark from noisy and high-quality input.

(16)

Chapter 2. Random Forest for 3D Facial Landmark Detection 6

Baltruˇsaitis et al. [9] proposed a 3D Constrained Local Model (CLM-Z), which is an extension of the Constrained Local Model [21], for facial landmark tracking under varying pose. Depth and intensity channels were integrated to reduce missed detections caused by poor lighting condition. This model has shown a robust performance for varying lighting conditions and poses.

Ju et al. [22] combined a 3D shape descriptor with binary neural networks to detect nose tip and eyes. The descriptor is invariant against illumination variations. The reported accuracy was over 99, 6% in the presence of facial expressions.

Zhao et al. [23] introduced the Statistical Facial Model (SFAM) which combines local variations of texture and geometry around each landmark with global variations between landmarks. A robust fitting algorithm was proposed to localize landmarks under facial expressions and occlusions. Although high accuracy results were reported, the proposed algorithm is computationally expensive.

Papazov et al. [10] proposed Triangular Surface Patch (TSP) features extracted from 3D point clouds to jointly estimate the head pose and 3D facial landmarks. The authors demonstrated that these features are efficient to compute, are viewpoint-independent and they are insensitive to pose changes. The proposed approach achieved high accuracy and real time.

2.1.2 Random Forest

Random forest, as introduced by Breiman [24], is an ensemble learning method that consists of multiple decision trees [25]. Each tree in the forest is constructed from a randomly sampled subset of training data. Starting from its root node, every non-leaf node generates a number of candidate splits and finds the optimal split of the incoming data input. The optimal split φ∗ is defined as the one which maximizes the information gain: φ∗= arg max φ IG(φ), (2.1) IG(φ) = H(P) − X i∈{L,R} wiH(Pi(φ)), (2.2)

where wiis the ratio of data input propagated to each child node and H(P) is uncertainty

measure of the input set P. After the split, the results are sent to the left and right child nodes. The procedure is then repeated until all leaves are created.

(17)

In the context of 3D face analysis, random forest based approaches have been applied to estimate head pose from high-quality head scans [11]. The authors achieved real-time performance without requiring a Graphical Processing Unit (GPU). The authors extended their work [26] to use noisy depth data obtained from a consumer depth camera and still managed to obtain low regression error. However, the result was not as accurate as the previous system due to more noisy data input. In their subsequent paper [12], the authors extended their work for facial landmark detection. The method was evaluated on high quality face scans containing facial expressions and head pose rotations. High accuracy results were reported.

Another relevant work by Fanelli et al. [13], proposed a random regression forest to steer fitting of an Active Appearance Model (AAM) [27]. The authors achieved robust performance by integrating depth and intensity channels.

2.2 Training Forest

A forest is basically a collection of decision trees. To construct a decision tree T in the forest T = {Tt}, a set of randomly sampled training images is provided. Every

single image has multiple registered channels, which will be discussed later in chapter

3. Next, a set of fixed-size image patches are extracted from each training image and each channel. The patches are extracted around the facial landmark points (positive samples) and outside the face region (negative samples). More specifically, a patch is considered as a positive sample for a landmark point k if the distance dk between the

center of the patch and the landmark point is below a certain radius. We follow the parameter setting from [12], in which the radius is defined as one fifth of the radius of an average human face. In other words, dk≤ 0.2r, r is the radius of an average human

face. Figure2.1 illustrates the pipeline of the training process.

Each patch Pi consists of multiple channel features Ii= (Ii1, . . . , IiC) and annotated with

a class label ci∈ 0, 1, . . . , K and an offset vector θi = (θi1, . . . , θiK). K is the number of

landmark points and ci = 0 means that the patch is sampled from background, e.g. hair,

body. The offset vector θk = (θk_x, θk_y, θk_z) represents the relative position of landmark point k from the patch center.

Each tree is constructed using a different set of training patches to make sure that the trees are less correlated. Reducing correlation between any two trees in the forest reduces the error rate [24]. This is because a single decision tree can be seen as a predictor with a high variance. Adding more trees and averaging the results will move the final prediction close to the actual value.

(18)

Figure 2.1: The pipeline of training process: (1) RGB and depth images are aligned using calibration matrix. (2) Multiple channels are generated from each image in the training set. (3) A set of positive and negative training patches are extracted from the

registered image channels. (4) Training patches are used to construct trees.

A tree is grown from its root node until all leaf nodes are created i.e. when either the maximum depth tree is reached or less than a certain number of patches are left. The algorithm for growing a tree in the forest is summarized as follows:

1. Sample with replacement N training images from the original training set. 2. Randomly extract a number of positive and negative samples from training images.

3. Starting from root node:

(a) Generate different sets of parameters to perform binary tests {φ = {f, R1, R2, τ }}.

The detail of binary test will be described in section2.2.1. (b) Perform binary tests for all generated parameters.

(c) Select the optimum parameters which maximizes the objective function. The optimum parameters are then stored in the current node. The detail of ob-jective function will be described in section 2.2.2.

(d) Divide the incoming patches P into two subset PLand PRand send them to

the appropriate child nodes.

4. Repeat step 3 until all leaves are created. Once a leaf node L is created, it stores two kinds of information:

(a) Probability of each class in that leaf p(c = k|L), computed as the ratio of positive samples of class k arrive at that leaf.

(19)

(b) Distribution over offset vectors for each facial landmark. The distribution is simply modelled by multivariate Gaussian, similar to [14]:

p(θk|L) = N (θk;θ¯k; Σk), (2.3)

whereθ¯k and Σkare the mean and covariance matrix of the offset vectors of facial landmark k.

2.2.1 Binary Test

As described in the previous section, a binary test is performed to split incoming patches into two subsets. In order to find the optimum split, typically a large number of candi-date splits are generated. This means generating a large number of candicandi-date parameters and then evaluating them using a binary test. The binary test is defined as follows [12]:

1 |R1| X q∈R1 If(q) − 1 |R2| X q∈R2 If(q) > τ, (2.4)

where I is the image channel, f is channel’s index, R1 and R2 are two rectangular

sub-patches within the patch, and τ is a threshold. The parameters f , R1, R2, and τ are

generated randomly and the result of this test determines how to split the incoming image patches. A patch is sent to the right child node if the test returns true, otherwise is sent to the left child node.

It can be derived from equation2.4 that the test measure is the difference between the average values of two rectangular sub-patches. Using the average pixel values reduces the effect of missing information in noisy data. Section 3.2 discusses how to compute the sum of pixel values over any rectangular region R using integral images.

2.2.2 Objective Function

As mentioned in section 2.1.2, a forest is trained to maximize the information gain in every node in the tree, which results in minimum uncertainty measure. In this particular case where we want to localize facial landmarks in image patches. Therefore, the term H(P) in equation 2.2can be replaced by a classification uncertainty measure and it is defined by: H(P) = − K X k=0 p(c = k|P) log(p(c = k|P)), (2.5)

(20)

where K is the number of classes (number of landmarks + 1) and p(c = k|P) is the probability of class k in the patch set P. The probability p(c = k|P) is aproximated by computing the ratio of positive patches for landmark k in the set P.

A complete objective function is obtained by substituting equation2.5into equation2.2. The optimum split is the one which maximizes this objective function.

2.3 Testing

Once a complete forest has been trained, we would like to test the performance of the trained forest to detect 3D landmarks for unseen images. A set of dense patches are extracted from a test image with a predefined stride parameter. Stride parameters control the distance between patches. These patches are then sent to the trained trees. In each tree, the binary test with optimum parameters is performed to lead a patch from the root node until it reaches a leaf node. The information in the leaf node is used to compute a prediction vote. So, for each patch P , we will obtain a set of prediction votes from the trees.

However, not all votes are considered. A leaf node L is allowed to vote for the location of landmark point k, if the following conditions are met:

1. The probability of class k stored in the leaf node is higher than a threshold, i.e. p(c = k|L) ≥ trprob

2. The trace of the corresponding covariance matrix (Equation2.3) is below a maxi-mum variance, i.e. Tr(Σk) < trvar.

The optimal values for trprob and trvar are 0.75 and 300, respectively. Those values are

found by trial-and-error experiments. This criteria ensures that only votes with high confidence is considered for prediction.

After sending all patches to the trees, K different sets of votes {vk_i} are obtained. Each set represents the set of location candidates for the corresponding landmark k. Location candidates are calculated by adding the patch center coordinates with the mean offset vectorθ¯k stored in the leaf node.

Finally, a mean shift clustering [28] is performed for each vote set k to get the final prediction. The next section describes the vote clustering algorithm.

(21)

2.3.1 Vote Clustering

Since our approach does not involve any face or head detection, a bottom-up clustering with a predefined radius (the radius of the average human face) is computed to localize head positions and to filter out the outliers. Outliers are identified if the number of votes in the resulting cluster is below a threshold that defines minimum number of votes. We follow [12] to set the threshold value.

Within each head cluster, a mean shift clustering is performed for each landmark k. Mean shift is a non-parametric iterative algorithm that can be used to find the mode of a density function. The algorithm assumes that the given data are sampled from a probability density function where the dense region corresponds to local maxima or the mode of the density function.

Starting from an arbitrary location, mean shift operates by defining a window around it and computes the weighted mean of the data within the window. The window size is defined by a kernel function. There are many choices to define a kernel function, e.g. flat kernel, Gaussian kernel. Next, the center of the window is shifted into the new weighted mean. This procedure is then repeated until converges or it reaches a maximum number of iterations.

Given a set of landmark votes {vk_i} and a Gaussian kernel K, the clustering procedure is summarized as follows:

1. Set initial estimate mk_t=0 with the mean of landmark votes

2. Repeat until mk converges or maximum iteration: (a) Update the weighted mean mk_,

mk_t+1= P vk i K(v k i − mkt)vki P vk i K(v k i − mkt) , (2.6) where K(v_ik− mk t) = exp −kvki−mktk2 2h2 and h = 0.2r.

The bandwidth parameter h determines the size of clustering window, we set its value to one fifth of the radius of average face.

(22)

In each iteration, the window determined by a Gaussian kernel K is be shifted to a more dense region. So, at the end, it will reach the peak of the density function. The final prediction for landmark k is given by the final value of the weighted mean mk. Figure

2.2illustrates the votes for all landmark points and the final prediction.

Figure 2.2: All votes for each landmark are represented as point clouds in different colors. The centers of the circles represent the final prediction of landmark positions.

(23)

Chapter 3

Integral Channel Features

In the previous chapter, we have explained random forest-based models for 3D landmark detection. The performance of these models are not only determined by the learning algorithm itself, but also the feature representation. Thus, the choice of features is an important aspect to develop a robust landmark detector.

In this chapter, we discuss integral channel features and how these features are integrated into a random forest model. The idea of integral channel features is simple but effective. A number of image channels are generated from a given image. These channels can be generated in many different forms. For instance, depth and color channels are obtained directly from an image. A channel can be computed using linear transformation (e.g. Gabor filters), non-linear transformation (e.g. gradient) or even a pointwise transfor-mation. Once the channels are generated, features such as local sums, histograms, and Haar features are computed for each channel. The features capture heterogenous and richness information from different types of channels. Furthermore, these features can be computed efficiently using integral images.

In the first section, we summarize the related work on integral channel features that have been applied in different computer vision tasks. Section 3.2 describes integral images. Section3.3 explains the different channel types, and how the channels and features are integrated into a random forest model. Lastly in section 3.4, we present the evaluation result of different channels in terms of computational time.

3.1 Related Work

The notion of integral channel features is inseparable with the concept of an integral image. The first work adopting integral images in computer vision domain was the work

(24)

Chapter 3. Integral Channel Features 14

of Viola and Jones [29]. Viola and Jones proposed cascaded AdaBoost classifiers with Haar-like features for object detection. They achieved real time performance with high detection accuracy. Their work was a breakthrough in computer vision in which their proposed feature representations are proven efficient yet effective for object detection. Later, similar framework has been adopted in many other applications.

Integral channel features, in particular, have proven to be effective for many computer vision tasks, e.g. object recognition [18], pedestrian detection [19] and local region matching [20]. In the medical imaging domain, Tu et al. [30] introduced a probabilistic boosting tree framework with various image channels for MRI brain segmentation. The authors computed Gabor filters and edge response channels at different scales combined with 3D Haar filter channels on top of them.

Dollar et al. [31] trained an edge detector using a large number of channel features. These channels included gradients at various scales, Gabor filters and Gaussian filters obtained high accuracy. In their subsequent paper [17], Dollar et al. explored different types of channel features and studied the performance of different channel types for pedestrian detection. Their proposed method succesfully outperformed other features including Histogram of Oriented Gradients (HoG).

A variant of integral channel features, named aggregate channel features was proposed by Yang et al. [32] to train multi-view face detector. The authors adopted the Viola-Jones learning framework and utilized different types of color and gradient channels to deal with different poses from ranging frontal faces to profile faces. The algorithm achieved high detection accuracy for face images in the wild

Although many methods have reported that use integral channel features in many differ-ent tasks, only a few methods utilized integral channel features in 3D related problems. In this thesis, we are interested to exploit multiple channel features computed from RGB-Depth images.

3.2 Integral Image

In image processing, an integral image is known as the algorithm to efficiently calculate the sum of values (pixel values) in a rectangular image area. Figure 3.1illustrates how the integral image algorithm works.

At each location (x, y), an integral image contains the sum of the pixel values above and to the left of (x, y). It is formally defined by:

(25)

I(x, y) = X

x0_≤x y0≤y

i(x0, y0), (3.1)

where i is image input and i(x0, y0) is pixel values at location (x0, y0) in image input.

Once the integral image has been computed, the sum of values over any rectangular area (x0, y0, x1, y1) within the integral image can be calculated in constant time O(1) using

four references:

X

x0<x≤x1 y0<y≤y1

i(x, y) = I(x0, y0) + I(x1, y1) − I(x0, y1) − I(x1, y0) (3.2)

Figure 3.1: (a) Input image and (b) Computed integral image. Sum values in region

A can be computed using four references: L1+ L4− L2− L3.

The concept of integral images has been extended in various ways. For instance, integral images can also be used to compute the local product of any rectangular area within an image. This can be done by taking the log of pixel values and to compute the sum, since exp (P

ilog(xi)) = Qixi. Lienhart and Maydt [33] extended the integral

image representation to compute the sum of pixels in rotated rectangular regions. They proposed rotated Haar-like features with boosted classifiers for object detection. These features were reported produced more robust and accurate detection. Another variant is the integral volume representation, which is the three-dimensional generalization of the integral image. Ke et al. [34] exploited this kind of volumetric features in the spatio-temporal domain for event detection in video sequences. The method achieved real time performance with low errors.

(26)

3.3 Channel Features

In section2.2.1, we have defined features such s average pixel values over two rectangular regions within a patch. This kind of features can be computed from any type of channel. The only prerequisite is that a channel C has to be translational invariant. That means that if two images I and I0 are related by a translation, the generated channels C and C0 are related by the same translation. This criteria allows us to efficiently compute features from any rectangular within the image channel. An image channel C is only generated once rather than for every image patch. Computing features in an image patch is done by using integral images.

In this section, we study three different channel types: depth, gray and gradient his-tograms. These channels are illustrated in Figure 3.2. The rationale behind selecting these channels is that these channels capture local information about the face surface and its contours. In particular, depth values indicate how the face surface looks like. Gray values capture the texture of each face surface. The image gradients capture information about the rate of texture changes and edge responses along different angles.

Figure 3.2: Examples of generated image channels: depth, gray, and gradient along 4 different angles. Sum over any rectangular region within the image is computed using

integral image.

1. Depth channel

It is the channel that is obtained directly from the RGB-D image. The sum over any rectangular region of the depth channel is computed directly by using integral images.

2. Gray channel

The gray channel is generated from the RGB color channels. Normalized (raw) gray values are used to minimize the effect of illumination variations.

3. Gradient Histogram

The algorithm to compute the histogram using integral images was first introduced by Porikli [35]. Gradient histograms are the most commonly used variants of

(27)

integral histograms. Gradient histograms are generated by quantizing the gray image into a number of gradient angles. Each value within the quantized image is weighted by its gradient magnitude.

Qθ(x, y) = G(x, y)1[Θ(x, y) = θ], (3.3)

where 1 is indicator function, θ is gradient angle, G(x, y) and Θ(x, y) are the gradient magnitude and the quantized gradient angles at pixel location (x, y), respectively.

In our settings, instead of combining the obtained quantized images into his-tograms, we adopted the quantized images themselves as multiple individual chan-nels. The only parameter to be set here is the number of quantized images that are computed. This parameter influences the performance of the model. The impact of this parameter will be discussed in chapter 4.

This technique can also be applied to approximate HoG features, as in [19], by combining all quantized images into histogram and normalize it with gradient image computed at a different scale.

Integrating channel features into our random forest model is as follows. Since RGB-D images are used, the RGB and depth images have to be aligned first. In the training process, each training image is transformed into multiple image channels. A set of patches extracted from these channels, are then used to grow trees. During the training stage, the learning algorithm will select an optimum channel for every non-leaf node which maximizes the information gain. The same approach is applied in testing phase. Intuitively, the more channels used in the model, the richer are information the model is collecting to classify patches correctly. However, adding more channels also increases the complexity of the model and increases the computational time. Our experiments study which combination of channel features produces the best performance to detect 3D facial landmarks. Our experiment results reported in chapter 4.

3.4 Evaluation on Computation Time

To demonstrate the efficiency of integral channel features, we perform experiments to measure the average time needed to generate each individual channel and combination channels. All experiments are conducted on the same standard PC.

Table 3.1 illustrates the average time needed to generate different channel types plus the computation of integral images from the channel. Computing integral images from

(28)

Channel Types Time (ms)

1 Gradient 16.29

4 Gradients 22.75

9 Gradients 34.71

Depth + Gray + 1 Gradient 20.67 Depth + Gray + 4 Gradients 26.21 Depth + Gray + 9 Gradients 38.46

Table 3.1: Average time needed to generate image channel

depth and gray images takes around 2 ms. While time to compute the sum values over any random rectangular region within the channel is also negligible since it is O(1) operation.

(29)

Chapter 4

Experiments

4.1 Experiment Setup

4.1.1 Dataset Labelling

We evaluated our model on the Biwi Kinect head pose dataset1. The dataset contains 24 sequences of 20 subjects (14 men and 6 women) with more than 15K frames in total. Each frame has both a RGB image and a depth image as well as information about the head rotation and location. The head rotation angles in each subject varies: ±60◦pitch, ±75◦ yaw, and ±50◦ roll.

The dataset has no landmark annotation. To annotate landmarks for each subject, we used the following algorithm:

1. Manually annotate landmark points in the first frame. Any facial landmark de-tector can also be used to automate this step. In our setting, we annotated the first frames using the 2D landmark detector proposed in [6]. This step will result in facial landmark annotations in 2D coordinate.

2. From 2D landmarks, compute 3D landmarks using the corresponding depth image and the intrinsic matrix.

p = M−1x (4.1)

x is a vector representing 2D landmarks, vector p denotes the landmarks in 3D, and M is the camera intrinsic matrix.

1

Biwi Kinect dataset is available at: http://fanelli.li/

(30)

Chapter 4. Experiments 20

3. Shift the location of head center to the origin of coordinate system. In other words, substract the position of head center from the landmark position.

4. Transform the landmarks with the inversed rotation matrix. This results in land-marks in 3D camera coordinates.

p0 = R−1₁ p (4.2)

where the rotation matrix R1 is the rotation matrix at the first frame.

5. From frame 1 until frame N , transform the landmarks p0using rotation matrices at each frame. Final landmark positions are obtained by translating the transformed landmark positions with the original head center location. Figure 4.1 illustrates landmark annotation results for different head poses.

pn= (Rnp0) + h (4.3)

pn and Rn are respectively final 3D landmarks and the rotation matrix at frame

n.

Figure 4.1: Examples of annotation results for different head poses. The landmarks (green dots) are visualized in 2D. The black dots represent landmarks that are not

visible when projected in 2D.

We identified that for large head pose, several landmarks are not aligned on face surface. In addition, these points are visible when the 3D image is projected onto a 2D image, as illustrated by the third and fourth images in Figure4.1. Considering this, we performed an additional step to verify whether a landmark is located on the face surface. A landmark that has neighbourhood point clouds within a certain radius is categorized as visible, otherwise it is not visible. This visibility information used when evaluating the performance of the landmark detector. Only visibile landmarks are considered in the evaluation.

Once the dataset is annotated, we followed the settings of [12] to split the dataset into training and testing sets. The testing set contains only 2 subjects: man and woman with large pose variations (subject number 01 and 12). The rest subjects are used in the

(31)

training set, except for subjects 06, 17 and 19. These subjects have facial expressions and missing depth data in one or more fiducial points, e.g. eye corners. As a consequence, the position of rigid landmark points in these subjects are hard to approximate.

In order to analyze the robustness of our landmark detector in the presence of head pose variations, we conducted two experiments. First, we trained and evaluated the model with subsets of frames, having less than 20◦head rotation (frontal face). This constraint ensures that all landmarks are visible and surrounded by sufficient facial surface to be computed. The second experiment, we relaxed the constraint and constructed a forest from the full training set. The trained trees are then evaluated using the unconstrained test set. In this experiment, the evaluation is conducted only from visible landmarks.

We study optimum parameters of random forest, error thresholds and comparison of different combinations of channel features. Moreover, we compare the performance of the landmark detector with other 2D-based method to gain insight about the advantage of this method.

4.1.2 Parameter Settings

In order to fairly compare the performance of the channel features, we trained multiple forests using identical training images and patches. For training, we fixed the following parameters:

1. Number of image samples on each tree: 1000 (frontal face set), 3000 (full set).

2. Maximum tree depth: 20.

3. Number of positive patch samples extracted from each image: 120.

4. Number of negative patch samples extracted from each image: 50. 5. Patch size: 20 × 20 pixels.

6. Minimum number of patches required for a split: 20.

7. Number of binary tests in each non-leaf node: 62500.

2500 different combinations of R1, R2 and f in Equation2.4, each with 25 different

threshold τ .

In the testing phase, the following parameters are applied:

(32)

2. Threshold class probability: 0.75

3. Bandwidth parameter for mean shift clustering: 0.2r, r is radius of average face

4.1.3 Evaluation Measure

We measured the error for each landmark as an Euclidean distance between the predicted location and the ground truth (Equation 4.4). We also measured the ratio of correctly detected points if the error produced is less than an error threshold. The optimum error threshold is discussed in section4.2.3.

error(yk, tk) = q

(yk

x− tkx)2+ (yky − tyk)2+ (yzk− tkz)2 (4.4)

yk and tk are predicted location and the ground truth location of landmark k in 3D coordinate, respectively.

4.2 Results

4.2.1 Number of Trees

The experiment was conducted with three different channel combinations both for the frontal face and full dataset. The results of this experiment are presented in Figure4.2a

and 4.2b. The graphs show the mean Euclidean error (in milimeters) as function of the number of trees when the maximum tree depth is fixed to 20. Both graphs illustrate that adding more trees gives a positive impact to reduce the error of the landmark detector. The same trend happens for all combinations of channel features in both sets.

In Figure4.2a, we can derive that the accuracies for Depth channel and Depth + Gray channels stabilize at about 7 trees, while the accuracy for Depth + Gray + 4 Gradients converges even faster after using 3 trees. We noted that when 7 trees are used, Depth + Gray channels and Depth + Gray + 4 Gradients channels perform equally well.

In Figure4.2b, the combination of depth, gray and 4 gradient channels outperforms the other channels. Using additional gray and 4 gradient channels is able to reduce the error especially for the landmarks that have small variations of depth values, e.g. eye corners.

Our following experiments are conducted with optimal number of trees (7) for frontal face dataset and (20) for full dataset.

(33)

Chapter 4. Experiments 23 -40 -20 0 20 40 60 80 100 120 140 1 2 3 4 5 6 7 8 9 10 Mean Er ror (mm) #Trees (Depth = 20) Depth Depth + Gray Depth + Gray + 4 Gradients

(a) Frontal face test set [−20◦, 20◦]

0 50 100 150 200 250 300 350 400 2 4 6 8 10 12 14 16 18 20 Mean Er ror (mm) #Trees (Depth = 20) Depth Depth + Gray Depth + Gray + Gradient

(b) Full test set

Figure 4.2: The influence of number of trees, measured with mean euclidean error, averaged over all landmarks and all images in test set.

4.2.2 Accuracy

In section2.2, we have defined positive samples for each landmark k by a certain radius. To preserve consistency, we also evaluated the accuracy of the detector with a certain er-ror threshold. Any prediction that produces an erer-ror larger than threshold is considered as a missed detection. Figure4.3a and 4.3bdepict the accuracy as function of different error thresholds evaluated on both the frontal face and full test set, respectively. Stable accuracy is achieved when the threshold is set to 20mm. Once again for the frontal face, the combination of Depth + Gray channels and Depth + Gray + 4 Gradients channels have similar performance. While for the full test set, the combination of Depth + Gray + 4 Gradients channels provides higher accuracy than the other combinations.

(34)

Chapter 4. Experiments 24 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 A ccuracy (%) Error Threshold (mm) Depth Depth + Gray Depth + Gray + 4 Gradients

(a) Frontal face test set [−20◦, 20◦]

0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 A ccuracy (%) Error Threshold (mm) Depth Depth + Gray Depth + Gray + 4 Gradients

(b) Full test set

Figure 4.3: Detection accuracy as function of error threshold, averaged over all land-marks and all images in test set.

4.2.3 Accuracy vs Efficiency

In this experiment we study the effect of the stride parameter in terms of accuracy and efficiency. We measured average time needed to test a single image after it has been loaded into the memory, and compared it with the resulting accuracy. Figures4.4a and

4.4bshow the evaluation results on the frontal face test set. Figure 4.5a and4.5b show the evaluation results on the full test set.

The results illustrate that the value of the stride parameter is negatively correlated with the accuracy. Using a smaller stride value yields high accuracy (Figures4.4band 4.5b). However, it comes with the expense of processing time (Figure4.4aand4.5a). When we

(35)

Chapter 4. Experiments 25 0 50 100 150 200 250 300 350 400 450 4 6 8 10 12 14 16 18 20 Time (ms) Stride (pixel) Depth Depth + Gray Depth + Gray + 4 Gradients

(a) 60 65 70 75 80 85 90 95 100 4 6 8 10 12 14 16 18 20 A ccuracy (%) Stride (pixel) Depth Depth + Gray Depth + Gray + 4 Gradients

(b)

Figure 4.4: 4.4aExecution time as function of the stride parameter. 4.4bAccuracy as

a function of the stride parameter. Time and accuracy are averaged over all landmarks and all images in the frontal face test set.

use larger stride values, it fastens the process but decreases the accuracy. By comparing the results, we can conclude that the choice of the stride parameter controls the trade-off between accuracy and efficiency. In the case when execution time is not a constraint, 5 pixels stride can be utilized since it mantains high accuracy with computational time still under 1 second. For real time applications, larger stride values can be considered. Our following experiments are conducted with a 5 pixel stride.

(36)

Chapter 4. Experiments 26 0 200 400 600 800 1000 1200 1400 1600 4 6 8 10 12 14 16 18 20 Time (ms) Stride (pixel) Depth Depth + Gray Depth + Gray + Gradient

(a) 35 40 45 50 55 60 65 70 75 80 85 90 4 6 8 10 12 14 16 18 20 A ccuracy (%) Stride (pixel) Depth Depth + Gray Depth + Gray + 4 Gradients

(b)

Figure 4.5: 4.5a Execution time as function of stride parameter. 4.5b Accuracy as

function of stride parameter. Time and accuracy are averaged over all landmarks and all images in full test set.

4.2.4 Channel Features Comparison

Our approach differs from other facial alignment approaches because we do not build a landmark or shape model beforehand to fit test image. This makes our detection results sensitive for landmark combination. For this reason, we prefer to evaluate each individual landmark separately. We present the performance results of different channel features evaluated on both the frontal face and full test set. Table 4.1summarizes the experimental results.

(37)

Chapter 4. Experiments 27 0 10 20 30 40 50 60 70 80 90 100

Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out

A

ccuracy (%)

Depth Depth + Gray Depth + Gray + 1 Gradient

Depth + Gray + 4 Gradients Depth + Gray + 9 Gradients

Figure 4.6: Channel features performance comparison in frontal face dataset (head

pose range [−20◦, 20◦]). Note that in this dataset all landmarks are visible.

4.2.4.1 Frontal face dataset

Figure4.6illustrates the performance of different combinations of channel features. The nose tip landmark is the most correctly predicted landmark followed by the inner eye corner landmarks. This is not surprising since for nearly frontal faces, the nose area is the most distinctive area in the face. To detect these landmarks, using only depth channels already achieves 100% or close to 100% accuracy. Adding more channels does not yield performance improvement.

In contrast, the chin is the landmark that is often misplaced when only the depth channel is used. Since we use relatively small patches, this implies that the depth values around the chin are not distinguishable enough compared to the other regions. Using an additional gray channel is effective to reduce the misdetection rate. Adding gradient channels also increases the accuracy but still cannot outperform the combination of depth and gray channels.

Another example of misdetection is between the outer eye corners and mouth corners. For a number of subjects, the detector wrongly predicts the mouth corners as outer eye corners. We identified that this condition happens when the features between these regions contain small variations. Some examples of the failure cases are presented in Figure 4.7.

Overall, the best performing channels to detect 3D landmarks for frontal faces is the combination of depth, gray and 4 gradient channels. This combination produces highest the mean accuracy and lowest error, as shown in Table4.1.

(38)

Figure 4.7: Examples of failure cases in frontal face test set [−20◦, 20◦], randomly

selected from all channel combinations. Chin and outer eye corners are the most often misplaced landmarks. 4.2.4.2 Full dataset 0 10 20 30 40 50 60 70 80 90 100

Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out

A

ccuracy (%)

Depth Depth + Gray Depth + Gray + 1 Gradient

Depth + Gray + 4 Gradients Depth + Gray + 9 Gradients

Figure 4.8: Channel feature performance comparison for the full test set. The accu-racy is computed only using the visible landmarks.

The performance results for large head pose variations are shown in Figure 4.8. We can derive that adding gray and gradient channels provides significant improvent to the accuracies of chin and outer eye corners. For the nose tip and inner eye corners, adding gray and gradient channels only results in a small improvement, since depth channel already produces at least 85% accuracy.

However, the results also imply that chin landmarks are still the most difficult landmarks to detect. When only the depth channel used, our detector only achieves a 32% accuracy. Adding gray and 4 gradient channels augments the accuracy up to 68%. Even with 36% improvement, its accuracy is still the lowest compared to the others. The other landmarks have at least 80% accuracy when the gray and 4 gradients channels used.

A number of misdetection cases are shown in Figure 4.9. It shows that landmarks obtained using only the depth channel compared to landmarks obtained using additional gray and gradient channels.

(39)

Figure 4.9: Examples of failure cases in full test set. First row: Examples of failure cases when only depth channel used. Second row: Results from the same images when

gray and gradient channels are added.

The comparison results are summarized in Table 4.1. These results lead us to the same conclusion as the previous experiment. We can conclude that the best performing channels to detect the 3D landmarks for large head pose variations is the combination of depth, gray and 4 gradient channels.

Dataset Channel Features Mean Accuracy (%) Mean Error (mm)

Frontal face

Depth 90.62 16.00

Depth + Gray 95.49 8.03

Depth + Gray + 1 Gradient 94.72 9.14

Depth + Gray + 4 Gradients 95.90 7.82

Full

Depth 68.80 44.19

Depth + Gray 77.06 29.68

Depth + Gray + 1 Gradient 77.31 26.54

Table 4.1: Summary of the performance of different channel features, averaged over all landmarks and all test images in dataset.

(40)

Lastly, using the best performing setting, we computed a 4-fold subject-independent cross validations on the entire Biwi kinect dataset. The result for this evaluation is presented in Table4.2.

Mean Error Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out

42.38 62.18 38.68 39.74 38.50 36.89 38.27

Table 4.2: Evaluation result on 4-fold subject-independent cross validation performed with the best setting. The numbers represent euclidean error in mm.

4.2.5 Analysis Performance under Head Pose Variations

In this section, we further analyze the accuracy of our detector for different poses. To do this, we test our best performing model on the discretized test set. The test set was discretized according to head poses in 20◦× 20◦ _{areas and the accuracy was computed}

for each range separately. Hence, the performance is known for the detector in each discretized head pose. The result of this experiment is presented as a heat map in Figure 4.10. -80 -60 -40 -20 0 20 40 60 80 -80 -60 -40 -20 0 20 40 60 80 Yaw Pitch 0.20 0.17 0.61 0.24 0.03 0.83 0.90 0.86 0.78 0.72 0.94 0.97 0.99 0.82 0.47 1.00 1.00 1.00 0.86 0.26 0.92 1.00 0.99 0.83 0.41 0.91 0.89 0.89 0.74 0.06 0.38 Success R atio

Figure 4.10: Evaluation result on test set, discretized in 20◦× 20◦areas depending on

their head pose angles. The colors and numbers represent success ratio of the detector, averaged over visible landmarks and test images in each area. The optimal settings

were applied (20 Trees with Depth + Gray + 4 Gradient channels)

The graph shows that the detector achieved highest accuracy in frontal faces (−20◦ ≤ head pose ≤ 20◦). This result is consistent with our previous result (Table4.1), where for frontal faces the success rate is 1 or close to 1. We can also see from the graph that the success rates naturally decrease when the head pose angles become larger, especially when the angle is larger than 40◦.

(41)

For large poses, the detector often wrongly predicts the areas that have similar texture or depth to the ground truths as landmarks. Even the areas that do not belong to the face region. For instance, ears and hair are misdetected as eye corners since they have similar textures as well as the depth values. Another factor that contributes to the performance drop is that the lack of training images for large poses. We noted that frontal faces have many more training images than the faces with large poses. Adding more training images or oversampling the images for large poses would resolve this.

Figure 4.11: Examples of failure cases in large pitch (top row) and yaw (bottom row) angles. The detector often mistakenly predict areas such as ear, hair line, and neck as

landmarks.

4.2.6 Comparison with 2D-based State-of-the-Art

For the last experiment, we compared the performance of our detector against the 2D-based method, that is Ensemble Regression Trees (ERT) [7]. We used available source code from DLIB Library [36] and ran 4-fold cross validation on entire Biwi Kinect dataset. Since this method relies on the face detection, we also provided 100x100 pixels ground truth face boundary as input. The trained model detects landmarks in 2D and then convert these landmarks into 3D coordinate to be evaluated. The result of this evaluation is presented in Table 4.3.

Method Mean Error Chin Nose Tip R Eye Out R Eye Inn L Eye Inn L Eye Out ERT [7] 57.85 80.72 107.36 39.34 28.91 31.82 58.97 RF (ours) 42.38 62.18 38.68 39.74 38.50 36.89 38.27

Table 4.3: Performance comparison with Ensemble Regression Trees [7]. The numbers

(42)

We identified that the worst performances of ERT is for detecting the nose tip and chin. This seems like an anomaly case, because ERT is an alignment based method that involves the face shape model. So the performance for all landmarks should be consistent. We investigated this case and figured out that the high error is caused at least by two factors: First, missing depth data around the landmark. Second, the chin landmark is located on the face boundary line which has large differences of depth values compared to the neighbouring regions (e.g. neck). As a consequence, small errors in 2D image become even larger when the predicted landmark is projected into 3D coordinates.

ERT originally has a competitive performance compared to our detector. We noted at least two drawbacks: (1) An accurate face detection is needed to localize the face region, (2) Since the method works in 2D, it cannot handle missing depth data.

On the other hand, our approach has the ability to tackle these issues and the algorithm itself fully operates in 3D space. In conclusion, our detector outperforms ERT on average denoted by the lower mean Euclidean distance in Table 4.3.

(43)

Chapter 5

Conclusion

In this thesis, we have presented a combination of a random forest based model with integral channel features to detect 3D facial landmarks from noisy RGB-D images. Our main focus was to study the influence of channel features to the performance of a 3D landmark detector. Multiple channel features have been computed efficiently using inte-gral images. These features were integrated in every non-leaf node in the trees. In each non leaf-node, the learning algorithm chose the best performing channel that maximizes the information gain.

In general, our experiments show that adding gray and gradient channels which com-puted from RGB image yields positive improvement to the landmark detector rather than using a single depth channel. More specifically, the combination of Depth + Gray + 4 Gradients channels is the best performing channel features among the other combi-nations studied in this thesis. These channels boost the accuracy of landmark detector even for the most difficult areas, e.g. chin.

For frontal faces, all channel combinations perform similarly well. The only exception is the chin landmark. Chin landmarks are the landmarks that often misplaced. Adding gray and gradient channels helps the detector to identify them more accurately. For large pose variations, our best performing channel features achieves 85% mean accuracy with execution time under 1 second.

We also demonstrate that our approach works better than a 2D-based state-of-the-art method, namely Ensemble Regression Trees (ERT). While the 2D-based method fails to tackle the problem of missing depth data, our approach is shown to be more robust to this. Our approach also does not rely on any face detection method as the most 2D-based methods do.

(44)

Chapter 5. Conclusion 34

For future work, there are many possible directions that can be explored. Since integral channel features can be computed from a countless number of channels, a further explo-ration of other variants of channel features is worth to study. Another possible idea is to exploit the learning algorithm to get a more efficient learning procedure. The current algorithm finds the best split for each node by evaluating a large number of candidate splits. Thus, constructing a decision tree is expensive when a large number of training patches used. An optimization method could be involved to find the optimal split.

The algorithm can also be implemented for tracking by performing detection in frame-by-frame basis. For this purpose, temporal information should be incorporated to yield more stable performance.

(45)

Bibliography

[1] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lars Wolf. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.

[2] Ziheng Wang, Shangfei Wang, and Qiang Ji. Capturing complex spatio-temporal relations among facial muscles for facial expression recognition. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3422–3429. IEEE, 2013.

[3] Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. Face/off: Live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/eurographics symposium on computer animation, pages 7–16. ACM, 2009.

[4] Donghoon Lee, Hyunsin Park, and Chang D Yoo. Face alignment using cascade gaussian process regression trees. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 4204–4212, 2015.

[5] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2):177–190, 2014.

[6] Xuehan Xiong and Fernando De la Torre. Supervised descent method and its appli-cations to face alignment. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 532–539. IEEE, 2013.

[7] Vahdat Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1867–1874. IEEE, 2014.

[8] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 fps via regressing local binary features. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1685–1692. IEEE, 2014.

(46)

Bibliography 36

[9] Tadas Baltruˇsaitis, Peter Robinson, and Louis-Philippe Morency. 3d constrained local model for rigid and non-rigid facial tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2610–2617. IEEE, 2012. [10] Chavdar Papazov, Tim K Marks, and Michael Jones. Real-time 3d head pose

and facial landmark estimation from depth images using triangular surface patch features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4722–4730, 2015.

[11] Gabriele Fanelli, Juergen Gall, and Luc Van Gool. Real time head pose estima-tion with random regression forests. In Computer Vision and Pattern Recogniestima-tion (CVPR), 2011 IEEE Conference on, pages 617–624. IEEE, 2011.

[12] Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International Journal of Computer Vision, 101(3):437–458, 2013.

[13] Gabriele Fanelli, Matthias Dantone, and Luc Van Gool. Real time 3d face alignment with random forests-based active appearance models. In Automatic Face and Ges-ture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1–8. IEEE, 2013.

[14] Matthias Dantone, Juergen Gall, Gabriele Fanelli, and Luc Van Gool. Real-time facial feature detection using conditional regression forests. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2578–2585. IEEE, 2012.

[15] Jitendra Malik and Pietro Perona. Preattentive texture discrimination with early vision mechanisms. JOSA A, 7(5):923–932, 1990.

[16] Constantine P Papageorgiou, Michael Oren, and Tomaso Poggio. A general frame-work for object detection. In Computer vision, 1998. sixth international conference on, pages 555–562. IEEE, 1998.

[17] Piotr Doll´ar, Zhuowen Tu, Pietro Perona, and Serge Belongie. Integral channel features. In BMVC, volume 2, page 5, 2009.

[18] Ivan Laptev. Improvements of object detection using boosted histograms. In BMVC, volume 6, pages 949–958. Citeseer, 2006.

[19] Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, and Shai Avidan. Fast human detection using a cascade of histograms of oriented gradients. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 1491–1498. IEEE, 2006.

(47)

Bibliography 37

[20] Boris Babenko, Piotr Doll´ar, and Serge Belongie. Task specific local region match-ing. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.

[21] David Cristinacce and Timothy F Cootes. Feature detection and tracking with constrained local models. In BMVC, volume 1, page 3. Citeseer, 2006.

[22] Quan Ju, Simon O’keefe, and Jim Austin. Binary neural network based 3d facial feature localization. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 1462–1469. IEEE, 2009.

[23] Xi Zhao, Emmanuel Dellandrea, Liming Chen, Ioannis Kakadiaris, et al. Accurate landmarking of three-dimensional facial data in the presence of facial expressions and occlusions using a three-dimensional statistical facial feature model. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 41(5):1417– 1428, 2011.

[24] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[25] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classifi-cation and regression trees. CRC press, 1984.

[26] Gabriele Fanelli, Thibaut Weise, Juergen Gall, and Luc Van Gool. Real time head pose estimation from consumer depth cameras. In Pattern Recognition, pages 101– 110. Springer, 2011.

[27] Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. Active appearance models. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):681– 685, 2001.

[28] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603–619, 2002.

[29] Paul Viola and Michael Jones. Robust real-time object detection. International Journal of Computer Vision, 4:51–52, 2001.

[30] Zhuowen Tu, Katherine L Narr, Piotr Doll´ar, Ivo Dinov, Paul M Thompson, and Arthur W Toga. Brain anatomical structure segmentation by hybrid discrimina-tive/generative models. Medical Imaging, IEEE Transactions on, 27(4):495–508, 2008.

[31] Piotr Dollar, Zhuowen Tu, and Serge Belongie. Supervised learning of edges and object boundaries. In Computer Vision and Pattern Recognition, 2006 IEEE Com-puter Society Conference on, volume 2, pages 1964–1971. IEEE, 2006.

(48)

Bibliography 38

[32] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014.

[33] Rainer Lienhart and Jochen Maydt. An extended set of haar-like features for rapid object detection. In Image Processing. 2002. Proceedings. 2002 International Con-ference on, volume 1, pages I–900. IEEE, 2002.

[34] Yan Ke, Rahul Sukthankar, and Martial Hebert. Efficient visual event detection using volumetric features. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 166–173. IEEE, 2005.

[35] Fatih Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 829–836. IEEE, 2005.

[36] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.

Integral Channel Features with Random Forest for 3D Facial Landmark Detection

Integral Channel Features with Random Forest

for 3D Facial Landmark Detection

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Goal

1.3

Thesis Outline

Chapter 2

Random Forest for 3D Facial

Landmark Detection

2.1

Related Work

2.2

Training Forest

2.3

Testing

Chapter 3

Integral Channel Features

3.1

Related Work

3.2

Integral Image

3.3

Channel Features

3.4

Evaluation on Computation Time

Chapter 4

Experiments

4.1

Experiment Setup

4.2

Results

Chapter 5

Conclusion

Bibliography