Facial Expression Recognition in theWild: The Influence of Temporal Information

(1)

Facial Expression Recognition in the Wild:

The Influence of Temporal

Information

Steve Nowee

10183914

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Prof. dr. T. Gevers Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Dr. R. Valenti Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Friday 27thJune, 2014

(2)

Acknowledgements

I would like to thank Prof. dr. Theo Gevers and Dr. Roberto Valenti for agreeing to supervise my thesis and helping me come up with an interesting research topic.

Also, I would like to thank Sightcorp for the use of their facial expression recognition software CrowdSight.

Abstract

In this thesis, the influence of temporal information in the task of fa-cial expression recognition in the wild is studied. To investigate this influ-ence, several experiments have been conducted using three different methods of classification. More specifically, static classification using Naive Bayes and dynamic classification using Conditional Random Fields and Latent Dy-namic Conditional Random Fields. These classifiers have been applied to two types of features, extracted from the Acted Facial Expression in Wild data set [Dhall et al., 2012]. These features were static Action Unit (AU) intensity values and spatio-temporal Local Binary Patterns on Three Orthog-onal Planes (LBP-TOP). The highest achieved accuracy was 38.01%, 4% higher than the baseline, using the LDCRF classification on the LBP-TOP features. Comparing the performance of this dynamic classifier with that of the static Naive Bayes classification yields an improvement in accuracy of 30%. Furthermore, by comparing the performance of the LDCRF on the AU intensity features with the performance of the LDCRF on the LBP-TOP features, it was found that the use of the spatio-temporal LBP-TOP features resulted in an improvement in accuracy of 65%. Thus, showing a positive in-fluence of temporal information on the task of facial expression recognition in the wild.

(3)

1 Introduction

In recent years, automatic facial expression recognition has been an increas-ingly popular field of study. A accurate classification of how a person feels can help make the interaction between a user and a system, human-computer interac-tion(HCI), more natural and pleasant [Picard et al., 2001]. Such a system would have a better understanding of the needs of a user. This is also called affective HCI, interpreting and acting on the affect of the user. The affect is also reffered to as the experience of feeling or as emotion.

The form of human affect that can be employed in HCI is the emotional state or mood of a person. This emotional state can be determined through different means, for example through the use of linguistic or acoustic data. In this thesis, the focus is on the use of facial expressions for determining the emotional state of a person. Facial expressions are a major part of human non-verbal communica-tion, which are often used to show one’s emotional state. Research on automatic facial expression recognition has been conducted and has resulted in a high accu-racy of recognition [Cohen et al., 2003, Jain et al., 2011]. However, most of this research occurred using unrealistic data sets that consist of unnatural and lab con-trolled facial expressions, instead of using facial expressions as they would occur in reality, also referred to facial expressions in the wild. Facial expressions in the wild are less controlled than those in the unrealistic data sets and because of this, facial expression recognition in the wild is a more challenging task. Consequently, performance of facial expression recognition in the wild is overall lower than that of facial expression recognition using the lab controlled data [Gehrig and Ekenel, 2013]. However, in order to create applications and software that can be utilized in situations in a complex world, a facial expression recognition system based on such unrealistic data will not suffice. Thus, the performance of facial expression recognition in the wild should be increased as to create applications that will func-tion appropriately in the real world.

It has been discovered that dynamic facial expressions elicit a higher perfor-mance of emotion recognition by humans than static facial expressions [Alves, 2013]. Also, neuroimaging and brain lesion studies provide evidence for two dis-tinct mechanisms, one for analysing static facial expressions and one for dynamic facial expressions. Thus, it seems that the use of temporal analysis of facial expres-sions is employed by humans. Also, in the facial expression recognition research conducted using the unrealistic data, it was found that the use of temporal infor-mation and temporal classification methods achieved a high performance. These methods, using temporal analysis, have not been widely used in research of facial expression recognition in the wild, however. In this thesis I propose the use of such temporal analysis on facial expression recognition in the wild, to discover if it has a positive influence on performance or not. In this work, the performance of static and dynamic classification methods will be compared and analyzed. These methods will be applied to action unit (AU) intensity features and Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) features, extracted from a data

(5)

set of facial expressions in the wild.

The thesis is structured as follows. Firstly, related research will be discussed in Section 2. This is followed by the used features and employed methods of classifi-cation, which will be explained in Section 3. In Section 4, the utilized data set will be clarified. The following section, Section 5, will present the conducted experi-ments and will discuss their results. Lastly, Section 6 will conclude the findings of this thesis and discuss possible future work.

2 Related Work

To describe facial expressions in movements of the face or action units (AUs), Ekman and Friesen [1978] developed the Facial Action Coding System (FACS). Action units are fundamental facial movements, caused by muscular contraction or relaxation within the face. The process of recognizing facial expressions according to FACS is performed on static images of facial expressions at their peak. How-ever, working with FACS is a manual form of facial expression recognition and is thus a laborious task.

After the development of FACS by Ekman and Friesen, image and video pro-cessing started to be utilized in the analysis of facial expressions. In this, points on the face are tracked and extracted from images and videos. Also, the intensity of displayed AUs can be extracted. Using this data, patterns can be sought for a facial expression. For example, Cohen et al. [2003] proposed two different approaches for facial expression classification using image and video processing. One of these approaches was static classification. Static, in this sense, means on a single frame. In this approach Bayesian networks were employed, such as a Naive-Bayes clas-sifier and a Gaussian Tree-Augmented Naive Bayes clasclas-sifier (TAN). The other proposed approach used dynamic classification, which entails that it is not per-formed on a single frame, but also takes earlier information into account. In this manner, the temporal information of a displayed facial expression is included in the classification process. The proposed dynamic approach employed a multi-level Hidden Markov Model classifier. Cohen et al. found that the TAN achieved the highest accuracy. However, it was concluded that if there is not sufficient training data, this approach will become unreliable and the Naive Bayes classifier becomes a more reliable option.

Jain et al. [2011] proposed using other features than AUs. They proposed us-ing facial shape and facial appearance, with static and dynamic approaches of clas-sification. In this, the facial shape was represented by landmark points around contours in the face, e.g. eyebrows, eyes and lips. To extract these shapes, Gen-eralized Procrustes Analysis (GPA) was applied, after which the dimensionality of the features was reduced by Principal Component Analysis (PCA). The facial appearances were represented by applying the Uniform Local Binary Pattern (U-LPB) method. The dimensionality of the facial appearance features was reduced by applying PCA as well. For the classification of the facial expressions, Jain et

(6)

al. employed Support Vector Machines (SVM), Conditional Random Fields (CRF) and Latent-Dynamic Conditional Random Fields (LDCRF). Of these methods, the SVM is static, whereas the CRF and LDCRF are dynamic methods of classifica-tion. CRFs are discriminative models, defining a conditional probability over label sequences given a set of observations, and are similar to HMMs. The Extended Cohn-Kanade data set (CK+) was utilized in each of the conducted experiments. It was discovered that CRFs are a valid indicator for transitions between facial ex-pressions. However, subtle facial movement classification did not achieve a high performance when employing CRFs. To increase the performance of recognizing subtle facial movement, Jain et al. proposed to employ LDCRFs. In LDCRFs, a set of hidden latent variables is included between the observation sequences and the expression labels. It was discovered, that the use of the shape features resulted in an overall higher performance than in using the appearance features. As expected, the dynamic methods of classification, CRFs and LDCRFs, achieved a higher per-formance than the static SVM. Of the two dynamic methods, LDCRF classification achieved a higher accuracy of recognizing facial expressions than when employing CRF classification.

The previously described works all utilized data sets that were acquired in a ‘lab controlled’ recording environment. This means that the subjects face the cam-era directly and show obvious expressions, either acted or genuine. However, in human-human interaction, such conditions are no necessity. More naturally dis-played facial expressions, however, are classified with less accuracy by a computer. A more robust recognition of facial expressions in the real world might be achieved by training on a data set with more realistic conditions and in a more realistic en-vironment. One of such data sets is the Acted Facial Expression in Wild (AFEW) data set [Dhall et al., 2012]. This data set consists of video fragments, extracted from movies. In these fragments, the subjects do not always face the camera di-rectly, they might not keep their head still and the expressions might vary in their level of intensity. This makes the conditions for extracting data, such as tracking points and action units, more complicated. The baseline for this data set, achieved by applying a non-linear SVM to feature vectors consisting of Local Binary Pat-terns on Three Orthogonal Planes (LBP-TOP) values, lies at 34.4%.

The AFEW data set from the challenge of 2013 has been used by Gehrig and Ekenel [2013] to analyse the use of several features and methods of classifica-tion. The features that were used are Local Binary Patterns (LBP) features, Gabor features, Discrete Cosine Transform (DCT) and AU intensities. The classifiers em-ployed using these features were a Nearest Neighbor classifier, a Nearest Means classifier and SVMs with linear, polynomial and Radial Basis Function kernels. It was discovered that the highest performance was achieved by employing a SVM with an RBF kernel on the LBP-TOP features, which resulted in a performance equal to the baseline of the data set of 2013, which was equal to 27.27%. Also, a human evaluation has been performed and Gehrig and Ekenel discovered that human classification achieves only 52.63% accuracy with a data set such as the AFEW.

(7)

3 Methodology

The standard pipeline of automatic facial expression recognition consists of: detecting faces in visual data, representing these faces as a set of features and using these representations to compute some dependencies between a facial ex-pression and its features. Within such a pipeline, different features and classifiers can be employed. The following subsections are allotted to clarify the features and classifiers that have been used in this thesis.

3.1 Features

In facial expression recognition using image processing, the complete images are usually not used. Instead, the most characteristic information that represents an image, the features, is used. The use of such features scales down the size of used data, because only a relatively small amount of values is used per image. The fea-tures that have been used in this thesis are Action Unit Intensity and Local Binary Patterns on Three Orthogonal Planes. Both of these features will be explained in more detail in the following subsections.

3.1.1 Action Unit Intensity

As mentioned in Section 2, action units (AUs) are fundamental movements of the face that are caused by contractions and relaxations of facial muscles. The activity or intensity of these AUs can represent certain facial expressions, as de-scribed by the FACS [Ekman and Friesen, 1978]. Some examples of AUs can be seen in the left image of figure 1 and an example of how a facial expression can be decomposed into AUs with a certain intensity is shown in the right image of that figure.

Figure 1: Left: Examples of action units. Right: Example of facial expression decomposition in AUs.

For this thesis, facial expression recognition software called CrowdSight1has

(8)

been used. This software detects faces in an image or a video frame and extracts intensities from each of these detected faces, for nine AUs which can be found in table 1. The intensity values extracted from the AFEW data set, range from -315 to 210. Also, information about the head poses of the subjects can be retrieved. This information consists of pitch, roll and yaw values in radians, which cover the three degrees of freedom of the head. During the process of extracting the AU intensities, only faces with a yaw value between -15 degrees and 15 degrees were used to ensure a measure of reliability of the extracted AU intensities. Because, if this precaution would not be taken, parts of the to be analyzed faces would be rotated out of view and no AU intensity values could be extracted from these parts of the faces. Furthermore, AU intensities have only been extracted from videos that consist of ten or more frames, because videos with less than ten frames are likely to be uninformative.

AU Name Facial Muscle

1 Inner Brow Raiser Frontalis, pars medialis 2 Outer Brow Raiser Frontalis, pars lateralis

4 Brow Lowerer Corrugator supercilii, Depressor supercilii 9 Nose Wrinkler Levator labii superioris alaquae nasi 12 Lip Corner Puller Zygomaticus major

15 Lip Corner Depressor Depressor anguli oris 20 Lip Stretcher Risorius w/ platysma 23 Lip Tightener Orbicularis oris

28 Lip Suck Orbicularis oris

Table 1: The nine AUs extracted by CrowdSight.

AU intensities have been extracted from both the video fragments of the AFEW data set and the aligned faces per frame, per video fragment of the AFEW data set (see Section 4). It was expected that applying CrowdSight to the aligned face im-ages would yield more accurate AU intensity values and thus a more accurate clas-sification of facial expressions, because the aligned face images are already given and no face detection is necessary. In this process of face detection on the video fragments, wrongfully detected faces might occur. The results of both extraction methods will be discussed in Section 5.

3.1.2 Local Binary Patterns on Three Orthogonal Planes

In order to explain the Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) feature, the standard Local Binary Patterns (LBP) feature has to be explained as well. LBP features are used to describe spatial information of an image. In extracting LBP features the whole image is used and encoded into a his-togram, instead of using only a small set of keypoints of an image. The procedure

(9)

that achieves the encoding of the images functions as follows. For each pixel in an image or video frame, the surrounding eight pixels will be thresholded. If the pixel value of a surrounding pixel is higher than the threshold, it will be equivalent to a one and if its value is lower it will be equivalent to a zero. These eight binary values form a binary pattern, by concatenating the values clockwise. The process of thresholding a pixel’s surrounding pixels can be seen in figure 2.

Figure 2: Example of forming a binary pattern for one pixel (LBP).

The result will be a binary pattern for each pixel in an image. This set of binary patterns can be turned into a histogram by summing all similar occurrences of bi-nary patterns. There are two often used methods of computing this histogram. One option is to sum all similar occurrences of the binary patterns over the whole im-age. More often, however, the image is divided into sub-images and the binary patterns of each sub-image are summed to a histogram. Subsequently, all of the re-sulting histograms are concatenated into one histogram. This method is visualized in figure 3.

Figure 3: Example of computing an LBP histogram, by concatenating histograms of sub-images.

The LBP-TOP feature is an extension of the LBP feature. As noted before, de-scribing images using LBP features uses spatial information. In other words, each frame is analyzed in the spatial (X,Y)-space. The LBP-TOP features, however, also incorporate temporal information with the spatial information. Instead of an-alyzing single frames, a set of frames (a video) is analyzed in the spatio-temporal (X,Y,T)-space. In this space, X and Y still denote spatial coordinates and T now de-notes the temporal coordinates, or frame index. There are three orthogonal planes in this (X,Y,T)-space, which are the XY plane, the XT plane and the YT plane. The XY plane represents the standard spatial information, whereas the XT and the YT planes respectively represent changes in the row values and changes in the col-umn values in time. From each of these three planes, LBP features are extracted

(10)

and turned into a histogram, as explained in the previous paragraph. Then, lastly, the three resulting histograms are concatenated to form a single histogram. This process, of extracting LBP-TOP features, is shown in figure 4.

Figure 4: Example of the process of extracting LBP-TOP features.

The LBP-TOP features used in the experiments of this thesis were extracted by dividing the frames of each of the three orthogonal planes into sixteen non-overlapping blocks (4×4). From each of these individual blocks, the binary pat-terns were created and the histograms were computed. The computation of these histograms occurred using uniformity of the binary patterns. A binary pattern is uniform if it consists of at most two contiguous regions. The eight surrounding pixels, one binary value each, resulted in 256 possible binary patterns. Within these possibilities only 58 patterns are uniform, each of which is given an indi-vidual label. All non-uniform binary patterns are labelled using one unique label. This results in 59 labels and thus 59 bins per histogram. Per orthogonal plane, the histograms of the sixteen blocks were concatenated, resulting in three histograms. Lastly, these three histograms, one for each orthogonal plane, were concatenated, forming one feature vector per video. Each of these feature vectors consisted of 2832 values (16 × 59 × 3).

To investigate the influence of temporal information, different methods of clas-sification have been applied to the features that have been described in the previous subsections. This thesis will continue by describing said methods of classification.

(11)

3.2 Methods of Classification

In machine learning, classification can be defined as the procedure of catego-rizing observations based on their characteristics, or features. The three methods of classification that have been used for this thesis originate from a probabilistic modelling of data. These methods will be discussed in the following subsections.

3.2.1 Naive Bayes

The Naive Bayes classifier is a generative probabilistic method of classifying, using Bayes’ theorem. When using a Naive Bayes classifier, a strong independence is assumed between the sequences of features. Because of this assumed indepen-dence, any actual relation the sequences of features may have is neglected. For that reason, the Naive Bayes classifier is called naive. However, this same indepen-dence makes the Naive Bayes classifier a simple and fast method of classification. As noted above, Naive Bayes classification is based on probability. More specifically, the conditional probability of a class given a set of features. This probability can be written as:

p(C|F1, . . . , Fn) (1)

in which C denotes the class or label and F1, . . . , Fndenote the observed features.

This can be rewritten using Bayes’ theorem, resulting in:

p(C|F1, . . . , Fn) =

p(C)p(F1, . . . , Fn|C)

p(F1, . . . , Fn)

. (2)

Since all feature values Fiare known the denominator is a constant, thus only the

numerator is of importance. This numerator is equivalent to the joint probability of C and all features Fi, which can be rewritten using the chain rule:

p(C, F1, . . . , Fn) = p(C)p(F1, . . . , Fn|C)

= p(C)p(F1|C)p(F2, . . . , Fn|C, F1)

= p(C)p(F1|C)p(F2|C, F1) . . . p(Fn|C, F1, F2, . . . , Fn−1)

(3)

However, since the Naive Bayes classifier assumes independence between its fea-tures, each of the conditional probabilities of a feature will only depend on the class C, for example p(Fi|C, Fj, Fk, Fl) = p(Fi|C). For that reason, the joint

(12)

probability can be simplified to: p(C|F1, . . . , Fn) = 1 Zp(C, F1, . . . , Fn) = 1 Zp(C)p(F1|C)p(F2|C)(Fn|C) = 1 Zp(C) n Y i=1 p(Fi|C) (4)

in which Z denotes the probability of a sequence of features p(F1, . . . , Fn), which

was a constant since these features are known.

Despite the simplicity and ‘naivety’ of the Naive Bayes classifier, it generally achieves a high accuracy in a wide variety of classification tasks. It has been con-cluded in the past, that Naive Bayes classification competed with state-of-the-art decision tree classifiers of that time [Langley et al., 1992] and are still compet-ing with state-of-the-art now. Thus, one of the main reasons for uscompet-ing Naive Bayes classification for recognizing facial expressions, is its simplicity and relatively high performance. The other reason for using Naive Bayes classification in this thesis, is because of its independence assumptions. The assumed independence between the features make Naive Bayes classification a static classification. Frames will be analyzed apart from each other and there will be no dependency between the data in separate frames. By comparing the results from temporal classification methods with the results from Naive Bayes classification, the effect of the temporal classifi-cation can be examined.

3.2.2 Conditional Random Field

A conditional random field (CRF) is an undirected graphical model that is con-ditioned on observation sequences X and is often used for segmenting and labeling structured data. For example, CRFs have been used to segment and label docu-ments of text, either handwritten or machine printed, with higher precision than the use of Neural Networks and Naive Bayes classification [Shetty et al., 2007]. Also, it has been found that CRFs perform with higher accuracy than Hidden Markov Models (HMMs) in activity recognition on temporal data [van Kasteren et al., 2008]. An HMM is a statistical Markov model with unobserved (hidden) states and can be defined as a dynamic Bayesian network.

An example of the structure of a CRF can be seen in figure 5. In this figure, one can see that the nodes of the CRF graph can be divided into the observations X = {x1, . . . , xn} and the labels Y = {y1, . . . , yn}. On these two sets, X and Y ,

(13)

Figure 5: Example of a CRF, in which xidenote the observation sequences and yi

denote the label sequences.

Theoretically, a CRF can be motivated from a sequence of Naive Bayes classi-fiers. The result of putting Naive Bayes classifiers in sequence, is an HMM. Like a Naive Bayes classifier, an HMM is also generative. The counterpart of a generative classifier is a discriminative classifier. Where a generative classifier is based on a model of joint distribution p(y, x), a discriminative classifier is based on a model of conditional distribution p(y|x). The discriminative counterpart of the genera-tive HMM is the CRF. The modelling of the conditional distribution of a CRF is achieved with the following equation [Sutton and McCallum, 2006]:

p(Y |X) = 1 Z(X)exp ( _K X k=1 λkfk(yt, yt−1, Xt) ) , (5)

in which Z(X) denotes a normalization function:

Z(X) =X Y exp ( _K X k=1 λkfk(yt, yt−1, Xt) ) . (6)

To train a CRF, its parameters or weights, θ = {k}, have to be estimated. To

estimate these parameters, training data D = {X(i), Y(i)N_i=1} is used. In this data each X(i) = {x(i)₁ , x(i)₂ , . . . , x(i)_T } is an input sequence of observations and each Y(i) = {y(i)₁ , y₂(i), . . . , y_T(i)} is an output sequence of labels. The estimation is performed by maximizing a conditional log likelihood:

l(θ) =

N

X

i=1

logp(Y(i)|X(i)). (7)

The method used to optimize l in this thesis, is called Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [Bertsekas, 1999, Byrd et al., 1994]. Normal BFGS optimization computes an approximation of the Hessian, because the use of the full Hessian is not practical due to its quadratic number of parameters. The Hessian is a matrix of second-order partial derivatives of a function and describes the curvature of that function, making it useful for optimization. Even though the normal BFGS uses an approximation of this Hessian, it still requires quadratic

(14)

size. The L-BFGS method, instead of storing the full and dense approximation of the Hessian, only stores several vectors representing the approximations implic-itly. Storing vectors instead of full matrices, the memory requirement of L-BFGS becomes linear, which makes L-BFGS a good method when a large amount of fea-tures is used. For a full introduction to CRFs, please see [Sutton and McCallum, 2006].

In tasks such as activity recognition or facial expression recognition, often, dy-namic methods such as CRFs are used. Facial expressions in videos, for example, cannot be fully explained by features from each separate static frame of that video. Instead, some interdependency between the features in the frames should be taken into account. A method such as Naive Bayes, discussed in Section 3.2.1, does not take such interdependency into account. CRFs, however, do take this interde-pendency into account. This fact makes CRFs a good method of classification for facial expressions.

3.2.3 Latent Dynamic Conditional Random Field

As Jain et al. [2011] stated, CRFs are a reliable method to model transitions be-tween facial expressions. However, the distinction bebe-tween similar facial expres-sions is made by subtle changes. In the detection of these subtle changes, CRFs are less reliable. For that reason, Latent Dynamic Conditional Random Fields (LD-CRFs) were proposed. The structure of an LDCRF is similar to that of a CRF. Only, an LDCRF has an added layer of hidden states, or latent states. This structure can be seen in figure 6. In this figure, as with CRFs, the X = {x1, x2, . . . , xn} denote

observation sequences and the Y = {y1, y2, . . . , yn} denote the labels. The hidden

states are denoted by H = {h1, h2, . . . , hn}

Figure 6: Example of an LDCRF, in which xidenote the observation sequences, yi

denote the label sequences and hidenote the latent states.

The conditional probability of labels, Y , given observations, X, for an LDCRF is similar to that of a CRF. However, the hidden states H = {h1, h2, . . . , hn} have

(15)

to be incorporated. This results in the following conditional probability:

p(Y |X) =X

H

p(Y |H)p(H|X). (8)

The hidden states of an LDCRF are subject to a restriction: The sets of hidden states connected to different labels should be disjoint. In other words, if a hidden state is connected to label yi, it cannot also be connected to yj, where i 6= j. This

restriction can also be represented by the following relationship:

p(Y |H) =

1 ∀H_m∈ H_{Y m}

0 otherwise (9)

Using this restriction, the conditional probability equation can be simplified:

p(Y |X) = X

H:∀Hm∈HY m

p(H|X). (10)

This p(H|X) can be written in the same form as equation (5) in Section 3.2.2, resulting in: p(H|X) = 1 Z(X)exp ( _K X k=1 λkfk(ht, ht−1, Xt) ) , (11)

with Z(X) written as:

Z(X) =X H exp ( _K X k=1 λkfk(ht, ht−1, Xt) ) . (12)

As with the training of a CRF model, a parameter estimation must be performed to train an LDCRF model. To perform this parameter estimation, again the optimiza-tion algorithm L-BFGS, explained in 3.2.2, was used.

The three discussed methods of classification have been applied to the afore-mentioned features. These features have been extracted from the AFEW data set, which will be clarified in the following section.

4 Data Set

For all of the experiments in this thesis, the AFEW data set [Dhall et al., 2012] has been used. As mentioned before, this data set consists of video fragments of seven different facial expressions, extracted from 54 movies. These seven facial expressions are: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise. Also, the AFEW data set is used in a facial expression recognition in the wild

(16)

challenge: EmotiW2. These video fragments were obtained in .avi format and have been converted to .mp4 format, in order for the CrowdSight software to be able to process the video fragments. Since the video fragments are extracted from movies, the facial expressions shown are in a close to real-world environment. The actors in the video fragments move around, move their heads and show facial expressions with a varying level of intensity. Also, the fragments in the data set have a wide range of illumination conditions, making the environment more natural than lab controlled environments. The data set was divided into a train set and a test set. The number of fragments per facial expression and per train or test set can be seen in table 2.

Number of frames per expression

Anger Disgust Fear Happiness Neutral Sadness Surprise Total

Train set 92 66 67 108 104 84 57 578

Test set 64 40 46 63 63 61 46 383

Table 2: Number of video fragments per facial expression, per train or test set, in the AFEW data set.

Apart from the video fragments, sorted by facial expression, images of the aligned faces for each frame, landmark points for each face in each frame and LBP-TOP features were included for each video fragment.

5 Experimental Results

Using the features described in Section 3.1, several experiments have been conducted to investigate the influence of temporal analysis in facial expression recognition. To show this influence, the static Naive Bayes classification will be compared with the dynamic CRFs and LDCRFs. Also, all three of these methods will be compared to the baseline of the AFEW data set, which is 34.4%.

In these experiments, the AU intensity data has been used in two different man-ners. The first approach was simply using the data as it is: AU intensity values per frame. This leads to a frame-by-frame classification of facial expressions. How-ever, one can also argue that if the majority of frames in a video is classified cor-rectly, the whole video is classified correctly. Both the frame-by-frame and the majority of frames per video methods of evaluating have been used. For the sec-ond manner of using the AU intensity data, the median per AU per video was calculated. This resulted in nine median values per video, one for each AU. The classification using these median values resulted in one label per video, instead of one for each frame. Lastly, the LBP-TOP feature vectors of the AFEW data set have been used without any further processing.

(17)

In the following subsections the results of the three methods of classification found in Section 3.2 are presented. This is followed by a paragraph that discusses these results in more detail.

5.1 Naive Bayes

The experiments using the Naive Bayes classifier have been conducted using built-in Matlab functions of the NaiveBayes class3, both to train and to test the models. Also, the duration of the training and testing procedures has been com-puted. The results of the Naive Bayes classifier and the duration of training and testing are presented in table 3. The results are given for the AU intensity features, extracted from both the video fragments and from the aligned face images, and for the LBP-TOP features.

Time (s)

Training Testing Accuracy (%)

Video AU Int. Median 0.00833 0.00629 25.17

Video AU Int. Frame-by-Frame 0.0167 0.0195 24.69

Image AU Int. Median 0.00849 0.00502 19.11

Image AU Int. Frame-by-Frame 0.0322 0.0354 19.53

LBP-TOP 0.338 0.220 29.38

Table 3: Naive Bayes results on video and image extracted AU intensity features and LBP-TOP features.

There is an obvious difference in performance between the AU intensities ex-tracted from the video fragments and those exex-tracted from the aligned face images. The results of the AU intensity features extracted from the aligned face images are lower, even though it was expected that more reliable AU intensity values would be extracted using these images. Possible reasons for this difference will be discussed in Section 5.4.

Neither the performance of the AU intensity features nor the performance of the LBP-TOP features surpass the baseline of 34.4%, achieved with a Support Vector Machine on the LBP-TOP features. The resulting confusion matrices for the AU intensity median features, extracted from the video fragments, and the LBP-TOP features can be seen in tables 4 and 5 respectively.

(18)

An Di Fe Ha Ne Sa Su An 20 10 3.3 23.3 30 6.7 6.7 Di 26.7 0 6.7 6.7 33.3 26.7 0 Fe 40 0 10 0 30 10 10 Ha 13.6 0 4.5 54.5 9.1 13.6 4.5 Ne 3.6 14.3 7.1 10.7 46.4 14.3 3.6 Sa 10.5 0 5.3 31.6 31.6 15.8 5.3 Su 0 13.3 0 26.7 46.7 13.3 0

Table 4: Confusion matrix in percentages for Naive Bayes classification on median AU intensity features. An Di Fe Ha Ne Sa Su An 44.1 8.5 15.3 8.5 6.8 13.5 3.4 Di 30.8 10.3 5.1 10.3 17.9 15.4 10.3 Fe 43.2 4.5 15.9 6.8 25 4.5 0 Ha 22.2 3.2 9.5 42.9 14.3 6.3 1.6 Ne 13.1 1.6 8.2 8.2 42.6 21.3 4.9 Sa 20.3 6.8 11.9 20.3 16.9 23.7 0 Su 17.4 4.3 17.4 10.9 30.4 8.7 10.9

Table 5: Confusion matrix in percentages for Naive Bayes classification on LBP-TOP features.

When looking at the confusion matrix in table 4, it is clear that for some ex-pressions, for example Sadness, a large amount of videos is misclassified as being either Happiness or Neutral. This same applies for Surprise and Disgust. In ta-ble 5, the misclassifications are more spread out over all expressions. However, most expressions are misclassified as being Anger. Possible explanations for these misclassifications are given in Section 5.4.

5.2 Conditional Random Field

The experiments using the CRFs have been conducted using Matlab toolbox HCRF2.0b. As mentioned before, the optimization algorithm used in this proce-dure was L-BFGS. The number of used iterations has been set at 50, 100, 300 and 500 to examine its influence on accuracy and running time of the procedure. As with the Naive Bayes classification, the running time of the training and testing processes and the accuracy of the CRF experiments are listed in table 6.

(19)

Time (s)

50 iterations AU Int. Median 0.035 0.0009 23.02

AU Int. Frame-by-Frame 4.575 0.05 14.49

AU Int. Frame Majority 15.83

LBP-TOP 20.56 0.102 37.47

LBP-TOP 44.86 0.09 35.85

LBP-TOP 110.28 0.09 36.93

LBP-TOP 119.96 0.09 36.93

Table 6: CRF results on video extracted AU intensity features and LBP-TOP fea-tures. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ model and thus they have the same training and testing time.

As one may have noticed, the results from the AU intensity data extracted from the aligned face images has not been incorporated in table 6, because the results on this data were overall lower than the results on the video fragment extracted AU intensity data. To see the results for the aligned face image extracted AU intensity features, view Appendix A.

The highest results using the CRF were achieved by applying it to the LBP-TOP features. For each variation of the number of iterations, the CRF applied to the TOP features surpasses the data set’s baseline. The best result on the LBP-TOP features was achieved with 50 iterations. However, overall, the results were best when using 300 iterations. Although, the difference in running time between 50 and 300 iterations is not insignificant. The training of the CRF on the LBP-TOP features takes more than five times as long when using 300 iterations, instead of 50 iterations.

In table 7, the confusion matrix for the performance of the CRF with 300 it-erations, applied to the LBP-TOP features, is shown. In this confusion matrix, the expressions Anger and Happiness are classified with a relatively high accu-racy. However, again, the misclassified expressions are for a large part classified as Anger and Neutral.

(20)

An Di Fe Ha Ne Sa Su An 59.3 13.6 5.1 6.8 10.2 5.1 0 Di 25.6 25.6 10.3 12.8 10.3 7.7 7.7 Fe 25 9.1 20.5 9.1 15.9 11.4 9.1 Ha 6.3 9.5 11.1 60.3 4.8 6.3 1.6 Ne 6.6 9.8 9.8 14.8 39.3 18.0 1.6 Sa 11.9 11.9 13.6 11.9 23.7 20.3 6.8 Su 13.0 13.0 17.4 8.7 23.9 4.3 19.6

Table 7: Confusion matrix in percentages for Conditional Random Field classifi-cation with 300 iterations, on LBP-TOP features.

5.3 Latent Dynamic Conditional Random Field

The experiments using LDCRFs were conducted using the Matlab toolbox HCRF2.0b, just as the experiments with the CRFs. However, instead of varying the number of iterations, these experiments were conducted with a variable num-ber of hidden states. The numnum-ber of iterations was set at 300, since this yielded the overall highest results in the experiments with the CRFs. The number of hidden states was set at two, three, four and five. Again, the duration of the training and testing procedures was computed, together with the accuracy. These results are shown in table 8. As with the CRF results, the results of the AU intensity features extracted from the aligned face images are not included in this table, due to the overall lower performance on this data. See Appendix B for these results.

(21)

Time (s)

2 Hidden states AU Int. Median 0.069 0.0009 23.02

LBP-TOP 202.93 0.182 37.74

LBP-TOP 289.72 0.28 38.01

LBP-TOP 377.35 0.407 38.01

LBP-TOP 440.12 0.423 37.74

Table 8: LDCRF results on video fragment extracted AU intensity features and LBP-TOP features. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ model and thus they have the same training and testing time.

The overall highest performance is achieved by the LDCRF with three hidden states. Also, using the LDCRF with three hidden states on the LBP-TOP features results in the highest performance of all tested methods and features. The corre-sponding confusion matrix of the LDCRF with three hidden states on the LBP-TOP features can be seen in table 9.

An Di Fe Ha Ne Sa Su An 61.0 11.9 5.1 8.5 8.5 5.1 0 Di 23.1 25.6 10.3 15.4 10.3 7.7 7.7 Fe 29.5 4.5 22.7 11.4 11.4 9.1 11.4 Ha 6.3 7.9 11.1 65.1 4.8 3.2 1.6 Ne 6.6 9.8 8.2 11.5 42.6 19.7 1.6 Sa 11.9 11.9 11.9 13.6 20.3 20.3 10.2 Su 15.2 15.2 13.0 13.0 26.1 4.3 13.0

Table 9: Confusion matrix in percentages for Latent Dynamic Conditional Ran-dom Field classification with 300 iterations and thee hidden states, on LBP-TOP features.

(22)

The classifications of Anger and Happiness have a relatively high accuracy, just as with the CRF classification. When comparing the confusion matrix from the CRF, in table 7, with this confusion matrix, it can be seen that almost each expression has either the same performance or the performance of the LDCRF surpasses that of the CRF. Only the classification of Surprise achieves a lower accuracy and its misclassifications are spread out equally over the remaining six expressions.

5.4 Discussion

For none of the classification methods did the accuracy of the results using the AU intensity features surpass the baseline of 34.4%. Only using the CRFs and LDCRFs on the LBP-TOP features resulted in an improvement of accuracy over the baseline. However, the highest accuracy of 38.01% is still not high enough to be called reliable. A viable reason for this fact might be that in normal human interaction, we rarely show one single facial expression at a time. Some facial ex-pressions may be shown as a combination, such as Anger and Disgust, and other may be shown in sequence, for example Surprise and Fear. This makes the prob-lem of facial expression recognition in the wild not only a classification task, but also a disambiguation task between facial expressions. This increases the difficulty of the task and thus decreases its performance. This problem was also noted by Gehrig and Ekenel [2013]. In the AFEW data set, some of the video fragments show several expressions, while being labelled as only one of those expressions. The approach of Gehrig et al. was to simply remove the video fragments where more than one facial expression was shown, to form a revised subset. This some-what increased their achieved performance.

Another problem that was observable from the confusion matrices, was that a lot of misclassifications were classified as Neutral. The most probable cause for these misclassifications is that in some of the fragments there is a large amount of frames that show a neutral facial expression more than showing the facial expres-sion of its actual class. This qualifies the fragment as being of that certain class, but it has a high probability of being classified as Neutral, because of the amount of frames in which the face was actually neutral.

Also, the results on the AU intensity features that were extracted from the aligned face images were overall lower than the results from the AU intensity fea-tures extracted from the video fragments. It was expected that the results from the aligned face image data would yield better results, because these AU intensity values would be extracted more reliably from already found facial images. How-ever, examine the images in figure 7. These are examples of some of the aligned face images of the AFEW data set. It seems that the face detection, used to create the aligned face images, was not equally accurate in detecting the faces in all the fragments. For some of the videos, only images such as in figure 7 were given.

(23)

Figure 7: Examples of bad aligned face images from the AFEW data set.

The main comparison in this thesis was between static and dynamic, or tempo-ral, classification. The Naive Bayes classification being the static method of clas-sification and the CRFs and LDCRFs being the temporal methods of clasclas-sification. However, there is also a distinction to be made in the features that have been used. The LBP-TOP features are spatio-temporal features, computed by using temporal information combined with spatial information, whereas the AU intensity features are static values extracted from static images. This fact should make the LBP-TOP features more descriptive for the task of facial expression recognition, than the AU intensity features. This effect can be seen in the results of the Naive Bayes classi-fication. However, accordingly, that should make the classification of the CRF and LDCRF on the AU intensity features better than the Naive Bayes classification on these features. This, however, is not the case. By combining the temporal classi-fication methods, CRF and LDCRF, with the spatio-temporal features, LBP-TOP, however, the results surpass all other results.

Some important comparisons can be made between temporal and static features and temporal and static classifiers. One of these comparisons is between the LBP-TOP features and the AU intensity features, where the performance of LDCRF on the spatio-temporal LBP-TOP features is 65% higher than the performance of that same classifier on the static AU intensities. The other comparison is between the LDCRF classifier and the Naive Bayes classifier, where the performance of the LDCRF classifier on the LBP-TOP features is almost 30% higher than that of the Naive Bayes classifier on the LBP-TOP features. This shows a significant increase in performance of facial expression recognition in the wild, when using temporal information.

6 Conclusion

Results from several experiments have been presented, in order to investigate the influence of temporal analysis in the task of facial expression recognition in the wild. These experiments can be compared on the basis of being conducted with static or temporal classifiers and using static or spatio-temporal features. The highest performance was achieved by the temporal LDCRF on the spatio-temporal LBP-TOP features, which was 38.01%. This outperformed the baseline of the

(24)

AFEW data set with 4%.

By comparing the static and temporal classifiers it was found that an increase in performance of 30% was achieved. The comparison between the static AU in-tensity features and the spatio-temporal LBP-TOP features yielded that an increase in performance of 65% was achieved. This shows that the use of temporal classifi-cation methods and the use of temporal information in the features has a significant positive influence on the performance of facial expression recognition in the wild.

6.1 Future Work

Even though an accuracy above the baseline was achieved, there is still enough room for improvement. This section describes some possibilities for future work that might improve the accuracy achieved in this thesis.

An improvement could be achieved by removing the fragments that showed more than one facial expression, or that showed a certain facial expression only in a small amount of the frames with a neutral expression in the rest of the frames. This is similar to the data set revision proposed in [Gehrig and Ekenel, 2013]. In that manner, the classifiers will have a less difficult task, because there is no need for disambiguation between facial expressions in one fragment.

Another option that might improve the accuracy, is by removing all frames that are classified as the Neutral class, since a large amount of frames in the fragments will show a neutral face. This would remove the problem that causes many of the misclassifications to be classified as Neutral.

Also, fusion with other feature modalities such as acoustic or linguistic data can be attempted. In the EmotiW challenge of 2013, the winner achieved a per-formance of 41% using deep neural networks on a combination of visual features, audio features, recognition of activity and Bag of Mouth features [Kahou et al., 2013].

Lastly, the LBP-TOP features can be extracted from only a number of frames at a time, instead of the full fragment at once. This could help in making a dis-tinguishment between several facial expressions in one fragment, because these different facial expressions will be part of different sliding windows of frames, instead of being all computed into one big set of features.

References

Nelson T. Alves. Recognition of static and dynamic facial expressions: a study review. Estudos de Psicologia (Natal), 18(1):125–130, 2013.

Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 1999.

Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathemati-cal Programming, 63(1-3):129–156, 1994.

(25)

Ira Cohen, Nicu Sebe, Ashutosh Garg, Lawrence S. Chen, and Thomas S. Huang. Facial expression recognition from video sequences: temporal and static model-ing. Computer Vision and Image Understanding, 91(1):160–187, 2003.

Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia, 19 (3):0034, 2012.

Paul Ekman and Wallace V. Friesen. Facial Action Coding System: Investigator’s Guide. Consulting Psychologists Press, 1978.

Tobias Gehrig and Hazım K. Ekenel. Why is facial expression analysis in the wild challenging? In Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop, pages 9–16. ACM, 2013.

Suyog Jain, Changbo Hu, and Jake K. Aggarwal. Facial expression recognition with temporal modeling of shapes. In Computer Vision Workshops (ICCV Work-shops), 2011 IEEE International Conference on, pages 1642–1649. IEEE, 2011. Samira Ebrahimi Kahou, Christopher Pal, Xavier Bouthillier, Pierre Froumenty, Ç aglar Gülçehre, Roland Memisevic, Pascal Vincent, Aaron Courville, Yoshua Bengio, Raul Chandias Ferrari, Mehdi Mirza, Sébastien Jean, Pierre-Luc Car-rier, Yann Dauphin, Nicolas Boulanger-Lewandowski, Abhishek Aggarwal, Jeremie Zumer, Pascal Lamblin, Jean-Philippe Raymond, Guillaume Desjardins, Razvan Pascanu, David Warde-Farley, Atousa Torabi, Arjun Sharma, Emmanuel Bengio, Myriam Côté, Kishore Reddy Konda, and Zhenzhou Wu. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Inter-action, ICMI ’13, pages 543–550, New York, NY, USA, 2013.

Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers. In AAAI, volume 90, pages 223–228. Citeseer, 1992.

Rosalind W. Picard, Elias Vyzas, and Jennifer Healey. Toward machine emotional intelligence: Analysis of affective physiological state. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 23(10):1175–1191, 2001.

Shravya Shetty, Harish Srinivasan, Matthew Beal, and Sargur Srihari. Segmen-tation and labeling of documents using conditional random fields. In Elec-tronic Imaging 2007, pages 65000U–65000U. International Society for Optics and Photonics, 2007.

Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning, volume 2. Introduction to statistical relational learning. MIT Press, 2006.

(26)

Tim L.M. van Kasteren, Athanasios K. Noulas, and Ben J.A. Kr¨ose. Conditional random fields versus hidden markov models for activity recognition in temporal sensor data. 2008.

Appendices

A

Results: Aligned face image AU intensity features with

CRF

Time (s)

Table 10: CRF results on aligned face image extracted AU intensity features. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ model and thus they have the same training and testing time.

(27)

B

Results:Aligned face image AU intensity features with

LDCRF.

Time (s)

Table 11: LDCRF results on aligned face image extracted AU intensity features. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ model and thus they have the same training and testing time.

Facial Expression Recognition in theWild: The Influence of Temporal Information