Automatic Segmentation of Spontaneous Data using Dimensional Labels from Multiple Coders

(1)

Automatic Segmentation of Spontaneous Data using Dimensional Labels from

Multiple Coders

Mihalis A. Nicolaou

∗

_{, Hatice Gunes}

∗,‡

_{and Maja Pantic}

∗,†

∗_{Department of Computing, Imperial College London, U.K.}

(michael.nicolaou08, h.gunes,m.pantic)@imperial.ac.uk

‡_{School of Computing and Communications, University of Technology Sydney, Australia} †_{Faculty of EEMCS, University of Twente, The Netherlands}

Abstract

This paper focuses on automatic segmentation of spontaneous data using continuous dimensional labels from multiple coders. It in-troduces efficient algorithms to the aim of (i) producing ground-truth by maximizing inter-coder agreement, (ii) eliciting the frames or samples that capture the transition to and from an emotional state, and (iii) automatic segmentation of spontaneous audio-visual data to be used by machine learning techniques that cannot handle unsegmented sequences. As a proof of concept, the algorithms introduced are tested using data annotated in arousal and valence space. However, they can be straightforwardly applied to data annotated in other continuous emotional spaces, such as power and expectation.

1. Introduction

In everyday interactions people exhibit non-basic, subtle and rather complex mental or affective states like think-ing, embarrassment or depression (Baron-Cohen and Tead, 2003). Accordingly, a single label (or any small number of discrete classes) may not reflect the complexity of the af-fective state conveyed by such rich sources of information (Russell, 1980). Hence, a number of researchers advocate the use of dimensional description of human affect, where an affective state is characterized in terms of a number of (continuous) latent dimensions (Russell, 1980),(Scherer, 2000).

Spontaneous data and their dimensional annotations, pro-vided by multiple coders, pose a number of challenges to the field of automatic affect sensing and recognition (Gunes and Pantic, 2010). The first challenge is known as reliabil-ity of ground truth. In other words, achieving agreement amongst the coders that provide annotations in a dimen-sional space is very challenging (Zeng et al., 2009). In or-der to make use of the manual annotations for automatic recognition, most researches take the mean of the coders ratings, or assess the annotations manually. How to best model inter-coder agreement levels for automatic affect an-alyzers remain mainly unexplored. The second challenge is known as the baseline problem: having ”a condition to compare against” in order for the automatic recognizer to successfully learn the recognition problem at hand (Gunes and Pantic, 2010). Automatic affect analyzers relying on audio modality obtain such a baseline by segmenting their data based on speaker turns (e.g., (Wollmer, M. and Ey-ben, F. and Reiter, S. and Schuller, B. and Cox, C. and Douglas-Cowie, E. and Cowie, R., 2008)). For the visual modality the aim is to find a frame in which the subject is expressionless and against which changes in subject’s motion, pose, and appearance can be compared. This is usually achieved by constraining the recordings to have the first frame containing a neutral expression. Although

expecting expressionless state in spontaneous multicue or multimodal data is a strong and unrealistic constrain, au-tomatic affect analysers depend on the existence of such a baseline state (e.g., (Petridis et al., 2009; Gunes and Piccardi, 2009)). Moreover, a number of machine learn-ing techniques such as (coupled) Hidden Markov Models and Hidden-state Conditional Random Fields cannot han-dle unsegmented sequences, they require the data to have a class label for the entire sequence. To date, many auto-matic affect recognizers using audio-visual data and utiliz-ing the aforementioned techniques segment their data man-ually (e.g., (Petridis et al., 2009)).

This paper provides solutions to all of the aforementioned issues. It (i) produces ground-truth by maximizing inter-coder agreement, (ii) elicits the frames or samples that cap-ture the transition to and from an emotional state (a baseline condition to compare against), and (iii) automatically seg-ments long sequences of spontaneous audio-visual data to be used by machine learning techniques that cannot handle unsegmented sequences.

2. Data

As a proof of concept, the algorithms introduced are tested using data annotated in arousal (how excited or apathetic the emotion is) and valence (how positive or negative the emotion is) space to obtain sequences that contain ei-ther positive or negative emotional displays. We use the Sensitive Artificial Listener Database (SAL-DB) (Cowie et al., 2005; Douglas-Cowie et al., 2007) and the SE-MAINE Database (SESE-MAINE-DB)1 _{that contain}

audio-visual spontaneous expressions. 2.1. Data Sets and Annotations

Both for the SAL-DB and the SEMAINE-DB, spontaneous data was collected to the aim of capturing the audio-visual

1

(2)

interaction between a human and an avatar with four per-sonalities: Poppy (happy), Obadiah (gloomy), Spike (an-gry) and Prudence (pragmatic).

The SAL data has been annotated by a set of coders who provided continuous annotations with respect to valence and arousal dimensions using the FeelTrace annotation tool (Cowie et al., 2000; Cowie et al., 2005). Feeltrace allows coders to watch the audio-visual recordings and move their cursor, within the 2-dimensional emotion space (valence and arousal) confined to [−1, 1], to rate their impression about the emotional state of the subject.

For SAL-DB, 27 sessions (audio-visual recordings) from 4 subjects have been annotated. 23 of these sessions were an-notated by 4 coders, while the remaining 3 sessions were annotated by 3 coders. The SEMAINE-DB has also been annotated using FeelTrace along five emotional dimensions (valence, arousal, power, expectation and intensity) sepa-rately, by (up to) 4 coders.

2.2. Challenges

The time-based operation of Feeltrace presents us with the following challenges: (i) for the sessions coded, there is no one-to-one correspondence between the timestamps of each coder, (ii) throughout the annotation files, there are time intervals where annotations are not available, and (iii) annotations are not (always) synchronized with the audio-visual data stream.

We tackle the first issue by binning the annotations: an-notations that correspond to one video frame are grouped together. The second point refers to missing annotations for some sets of frames. This could potentially be due to the following reasons: (i) the coder might not be cer-tain about the annotation for that particular interval, (ii) the coder might release the mouse button for some other reason, (iii) the coders appear to stop annotating when the avatar is talking, and (iv) the CPU load may have an effect on the frequency of measurements being recorded. Finally, the third issue could possibly be due to the following: (i) the response time is expression dependent, i.e., positive expres-sions are perceived faster and more accurately than negative ones (Alves et al., 2008), and (ii) the lag caused by the CPU load may have an effect on the synchronization between the actual video played and the recording of the annotations.

Table 1: The inter-coder MSE after applying local normal-isation procedures: normalizing to a standard deviation of one and a zero mean (GD), normalizing to zero mean (ZA) and no normalisation (NN).

ZAM SE GDM SE NNM SE

Valence 0.046 0.93 0.072 Arousal 0.0551 0.9873 0.0829

3. Methodology

In this section we address the challenges identified when working with databases annotated in continuous dimen-sional spaces.

Algorithm 1: Binning the annotations of the coders{set of bins, b} ← Binning()

//all members of any structures are considered to be zero 1

for each coder file c in the annotation files set do 2

for each annotation a in a coder file c with a timestamp of t do 3

Determine bin b where t ∈ b 4

b.val← b.val+ a.val

5

b.arsl← b.arsl+ a.arsl

6

b.annotCount← b.annotCount+ 1

7 end 8

for all bins b in the set of bins do 9

Average b.val and b.arsl by dividing with b.annotCount 10

end 11

end 12

Algorithm 2: Detecting crossovers in coder anno-tations: {P osCrossOver, N egCrossOver} ← De-tectCrossovers(coder c)

//bstr is the binned structure, every member is an annotation of A-V values at that 1

frame by the specific coder for each f in bstr do 2

if sign(bstr(f ).val) "= sign(bstr(f − 1).val) then 3

if sign(bstr(f ).val) > 0 then 4

Add f to P osCrossOver structure 5

end 6

else 7

if sign(bstr(f ).val) < 0 then 8

Add f to N egCrossOver structure 9 end 10 end 11 end 12 end 13 3.1. Annotation Pre-processing

This process involves determining normalisation proce-dures and extracting statistics from the data in order to ob-tain segments with a baseline and high inter-coder agree-ment.

Binning. Binning refers to grouping and storing the anno-tations together. As a first step the measurements of each coder c are binned separately. Since we aim at segmenting video files, we generate bins which are equivalent to one video frame f . This is equivalent to a bin of 0.04 seconds (SAL-DB was recorded at a rate of25 frames/s). The basic binning procedure is illustrated in Algorithm 1. The fields with no annotation are assigned a ”not a number” (N aN ) identifier.

Normalisation. The arousal and valence (A-V) measure-ments for each coder are not in total agreement, mostly due to the variance in human coders’ perception and interpre-tation of emotional expressions. Thus, in order to deem the annotations comparable, we need to normalize the data. Similar procedures have been adopted by other works using SAL-DB (e.g. (Wollmer, M. and Eyben, F. and Reiter, S. and Schuller, B. and Cox, C. and Douglas-Cowie, E. and Cowie, R., 2008)).

We experimented with various normalisation techniques. After extracting the videos and inspecting the superim-posed ground truth plots, we opted for local normalisation (normalizing each coder file for each session). This helps us avoid propagating noise in cases where one of the coders is in large disagreement with the rest (where a coder has a very low correlation with respect to the rest of the coders). As can be seen from Table 1, locally normalizing to zero mean produces the smallest mean squared error (MSE) both

(3)

for valence (0.046) and arousal (0.0551) dimensions. Vary-ing the standard deviation results in values which are out-side the range of[−1, 1] and generates more disagreement between coders.

Statistics and Metrics.We extract two useful statistics from the annotations, with a motivation of using them as mea-sures of agreement amongst the annotations provided: cor-relation (COR) and sign-agreement (SAGR). We start the analysis by constructing vectors of pairs of coders that cor-respond to each video session, e.g., when we have a video session where four coders have provided annotations, this gives rise to six pairs. For each of these pairs we extract the correlation coefficient between the valence (val) values of each pair, as well as the percentage of sign-agreement in the valence values, which stands for the level of agreement in emotion classification in terms of positive or negative:

SAGR(ci, cj) =

|f rames|

f =0 e(ci(f ).val, cj(f ).val)

|f rames| (1)

where ci and cj represent the pair of coders the

sign-agreement metric is calculated for, and ci(f ).val stands for

the valence value annotated by coder ciat frame f .

Func-tion e is defined as: e(i, j) =

1 if(sign(i) = sign(j)) 0 else

The sign-agreement metric is of high importance for the va-lence dimension as it determines whether the coders agree on the classification of the emotional state as positive or negative. More specifically, such metrics provide informa-tion regarding the percepinforma-tion and annotainforma-tion behaviour of the coders (i.e., to what degree data is annotated similarly by different coders). In these calculations we do not con-sider the N aN values to avoid negatively affecting the re-sults.

After these metrics (agreement, correlation) are calculated for each pair, each coder is assigned the average of the re-sults of all pairs that the coder has participated in. In other words, the averaged metric m$S,cj with respect to coder cj

for a specific metric m (i.e., correlation or agreement) is defined as follows: m$S,cj = 1 |S| − 1 i∈S,ci"=cj m(ci, cj) (2)

where S is the relevant session annotated by|S| number of coders, and each coder annotating S is defined as ci ∈S.

Essentially, we calculate the averaged level of agreement of coder cj with respect to the rest, by using the metric

m. This is somewhat equivalent to the enumerator of the modified Williams Index, which would be obtained by di-viding this enumerator by the averaged level of agreement of all the coders except cj (Alberola-Lopez et al., 2004).

Instead, we obtain the weighted average by using the m$

as weights, as shown in line 28 of Algorithm 4. The au-tomatic segmentation process is based on the correlation metric (cor$_{) alone as correlation experimentally proved}

stricter than sign-agreement in providing better comparison between the coders.

Interpolation. In order to deal with the issue of missing values, similar to other works reporting on data annotated

in continuous dimensional spaces (e.g., (Wollmer, M. and Eyben, F. and Reiter, S. and Schuller, B. and Cox, C. and Douglas-Cowie, E. and Cowie, R., 2008)), we interpolated the actual annotations at hand. We used piecewise cubic interpolation as it preserves the monotonicity and the shape of the data.

Algorithm 3: Match crossovers across coders for each session, maximizing the number of coders participating: {M atchedCO} ← MatchCrossOvers(CrossOvers)

for Each session s do 1

for i=4 to 2 do 2

//get as many coders as possible to agree (max. 4 and min. 2) 3

for Each crossover co in CrossOvers belonging to s do 4

currentlyM atched←{co}

5

Find all crossovers co2 in CrossOvers which: 6

- Belong to s 7

- Are from different coders 8

- co2 "= co ∧ abs(co2.time − co.time) ≤ 0.5 seconds 9

Add the co2 to currentlyM atched 10

if length(currentlyM atched) = i then 11

mark all crossovers in currentlyM atched as seen 12

add currentlyM atched to M atchedCO 13

remove currentlyM atched from CrossOvers 14 belonging to s end 15 end 16 end 17 end 18 3.2. Automatic Segmentation

The automatic segmentation stage consists of producing negative and positive audio-visual segments with a tem-poral window that contains an offset before and after (i.e., the baseline) the displayed expression. This process is pre-sented in Algorithm 4 that makes use of Algorithms 2 and 3.

Firstly, we describe the actual time window that the audio-visual segment is supposed to capture. For instance, for capturing negative emotional states, if we assume that the transition from non-negative to a negative emotional state occurs at time t (in seconds), we then have a window of [t − 1, t, t$_{, t}$_{+ 1] where t}$_{seconds is when the emotional}

state of the subject returns to non-negative. The procedure is analogous for positive emotional states.

Detecting and Matching Crossovers. In Algorithm 2, for an input coder c, the crossing over from one emo-tional state to the other is detected by examining the va-lence values and identifying the points where the sign changes. Here a modified version of the sign function is used which returns 1 for values ≥ 0 (a value of 0 valence is never encountered in the annotations), −1 for negative, and0 for N aN values. Algorithm 2 accumu-lates all crossover points for each coder, and returns the set of crossovers positive (P osCrossOver) and to-a-negative(N egCrossOver) emotional state. The output is then passed to Algorithm 3.

The goal of Algorithm 3 is to match crossovers across coders. For instance, if a session has annotations from 4 coders, due to synchronization issues discussed previously, the frame (f ) where each coder detects the crossover is not the same for all coders (for the session in question). Thus, we have to allow an offset for the matching process.

(4)

! "!! #!! $!! %!! &!! '!! !"# !"$% !"$ !"!% ! !"!% !"$ ()*+,-.*/,01, .*/,01, (a) ! "!! #!! $!! %!! &!! '!! ! !(!& !(" !("& !(# !(#& !($ !($& )*+,-. /*01.+2 /*01.+2 (b) ! "! #!! #"! $!! $"! %!! !"&' !"&( !"&) !"&# !"& !"#' !"#( &'()*+ ,(-*./* ,(-*./* (c) ! "! #!! #"! $!! $"! %!! !") !"&% !"& !"#% &'()*+ ,'-.+(/ ,'-.+(/ (d)

Figure 1: Two examples of interpolated valence ((a),(c)) and arousal ((b),(d)) plots from two individual segments produced by the segmentation procedure.

!" #""#!"$""$!"%""%!"&""&!"!"" !"# " !"# !") !"( '()*+, -).+/0+ 012+(3# 012+(3$ (a) !" #""#!"$""$!"%""%!"&""&!"!"" !"# !"$ " !"$ !"# !"% '()*+, -).+/0+ 012+(3# 012+(3$ 4(15/236(578 (b)

Figure 2: Valence annotations from two coders in SEMAINE-DB before and after applying pre-processing operations.

This procedure searches the crossovers detected by the coders and then accepts the matches where there is less than the pre-defined offset (time) difference between them. When a match is found, we remove the matched crossovers and continue with the rest. The existence of different combinations of crossovers which may match using the predefined offset poses an issue. By examining the available datasets, we decided to maximize the number of coders participating in a matched crossover set rather than minimizing the temporal distances between the participating coders. The motivations for this decision are as follows: (i) if more coders agree on the crossover, the reliability of the ground truth produced will be higher, and (ii) the offset amongst the resulting matches is on average quite small (less than0.5 secs) when considering only the number of participating coders. Maximising the number of participating coders can simply be achieved by iterating over the entire set of crossovers. This is expressed by the loop beginning in line 2 of Algorithm 3. We disregard cases where only one coder detects a crossover due to lack of agreement between coders.

Segmentation Driven by Matched Crossovers: This pro-cedure (illustrated in Algorithm 4) takes the output of Algorithm 3, and attains the sets of matched crossovers (Algorithm 3, lines 6-7). An iteration for all sets of matched crossovers for to-Negative transition is shown starting in line7. mcos, mcos(i).f and mcos(i).c repre-sent the current matched crossover, the frame where the

i-Figure 3: Example frames from an automatically extracted segment from SEMAINE-DB capturing the transition from a negative to a positive emotional state and back.

th crossover (of the matched crossover) occurred, and the coder who detected the i-th crossover of mcos, respec-tively. mcos(i).val is the vector of valence measurements for coder i participating in mcos. The crossover frame de-cision (for each member of the set) is made in lines10:17, and the start frame of the video segment is decided. In or-der to capture1 second before the transition window, the number of frames corresponding to the pre-defined offset are subtracted from the start frame. The ground truth val-ues for valence are retrieved in lines19:30 by incrementing the initial frame number where each crossover was detected by the coders. The procedure of determining combined av-erage values continues until the valence value crosses again to a non-negative valence value. The endpoint of the audio-visual segment is then set to the frame including the offset after crossing back to a non-negative valence value. The ground truth of the audio-visual segment consists of the arousal and valence (A-V) values described in lines24 and28 of the algorithm. If only two coders agree in the detection of crossovers, their contribution is weighted by using the correlation metric (cor_{, calculated as described}

in Equation 2).

4. Experiments and Results

As a proof of concept, the algorithms introduced have been extensively tested on SAL-DB.

We first present in Fig. 1 two segments extracted by using Algorithm 4, for a transition to a negative emotional state. The first dashed vertical line represents the transition to that state, and the second one out of that state. In the plots, we present the A-V values after the interpolation. Thus, at times no crossover may be observed in the valence values. As performance evaluation is a significant issue for any au-tomatic system, in Table 2 we attempt to provide mean-ingful performance results of the introduced algorithms on SAL-DB. The table presents the performance of the auto-matic audio-visual segmentation procedure in terms of: (i) how well it is able to utilise the actual number of frames (# of frames), (ii) using the given data, how many audio-visual segments it is able to produce (# of segments), and (iii) how much overlap there is (overlap) between the seg-ments, and between the positive and negative classes. The goal of the automatic segmentation procedure is then to utilise as many frames as possible from the given data to produce a high number of meaningful segments. Too much overlap between the segments or between the classes is

(5)

un-Table 2: Evaluation of the introduced segmentation algorithms using SAL-DB. The table presents the actual number of frames together with the utilised number of frames (# of frames), the number of audio-visual segments produced (# of segments) using the data at hand, and the intra-class (percentage of frames included in more than one segment within the same class) and inter-class (percentage of frames included in both classes) overlap.

subject # 1 2 3 4 total # of frames 56162 80553 28583 88199 negative # of frames 27389 46056 14554 43353 # of segments 110 170 99 166 intra-class overlap 6.42% 8.33% 4.53% 7.70% positive # of frames 23831 36034 13584 38589 # of segments 110 149 91 174 intra-class overlap 18.90% 14.18% 10.22% 11.60% inter-class overlap 6.16% 7.39% 14.37% 9.92%

Algorithm 4: Segment and produce ground truth: Segmen-tation()

for each coder annotation file c do 1

//capture a transition to and from a neg. state to a non-neg. 2

// use the correlation (cor_{) for weighting when match has 2 coders}

3

{P osCrossOver, N egCrossOver} ← DetectCrossovers(c) 4

M atchedP os ← M atchCrossOvers(P osCrossOver) 5

M atchedN eg ← M atchCrossOvers(N egCrossOver) 6

for each matched set of crossovers mcos in M atchedN eg do 7

//average time (frame) of crossing over to negative valence 8

//0.5 second offset has been used 9 if length(mcos) ≥ 3 then 10 //agreement in 3 or 4 coders 11 avgFrm = int _|mcos| i=0 mcos(i).f length(mcos) 12 end 13 else 14

//2 coders agree, weight using correlation (cor₎

15

avgFrm = int

_|mcos|

i=0 (mcos(i).f ∗cor(mcos(i).c))

|mcos| i=0 cor(mcos(i).c) 16 end 17 startF rm= avgF rm − 25 18 incFrm ← 0 19 repeat 20 incFrm ← incFrm + 1 21 if length(mcos) ≥ 3 then 22 //agreement in 3 or 4 coders 23 avgValence = 24 |mcos|

i=0 mcos(i).val(mcos(i).f +incF rm)

length(mcos) end

25

else 26

//2 coders agree, weight using cor

27

avgValence = 28

|mcos|

i=0 (mcos(i).val(mcos(i).f +incF rm)∗cor(mcos(i).c))

|mcos|

i=0 cor(mcos(i).c)

end 29

until sign(avgValence)=1 or avgValence is N aN ; 30

//add offset after crossing back to non-negative (or NaN) 31

endFrm = (avgFrm + incFrm) + 25 32

//Video is segmented in the range [startFrm,endFrm] 33

//Ground truth (valence/arousal) is averaged 34

end 35

//the process is repeated analogously for ”to-Positive” crossovers 36

(M atchedP os) - line 7 end

37

intended and undesirable, but expected to some degree due to the offsets before and after the transitions. By observing Table 2 we conclude that the algorithm fulfills its goal. As a final step we test the developed algorithms on the re-cently released SEMAINE-DB. Although the arousal and valence annotations of SEMAINE-DB do not contain N aN values, the steps to be followed for segmentation are simi-lar.

Finally, a qualitative assessment of the proposed algorithms is provided by Fig. 2 and Fig. 3. Fig. 2 illustrates valence annotations from two coders in SEMAINE-DB before and after applying pre-processing operations (for synchroniza-tion). Fig. 3 shows example frames from an automatically extracted segment from SEMAINE-DB using the presented algorithms. Overall, the produced segment appears to well capture the transition from a negative emotional state to a positive one, and back.

5. Conclusion

This paper introduced efficient algorithms to the aim of (i) producing ground-truth by maximizing inter-coder agree-ment, (ii) eliciting the frames that capture the transition to and from an emotional state, and (iii) automatic segmen-tation of spontaneous multimodal data to be used by ma-chine learning techniques that cannot handle unsegmented sequences. As a proof of concept, the algorithms intro-duced have been tested using SAL and SEMAINE data an-notated in arousal and valence spaces. Overall, the auto-matic segmentation procedures introduced appear to work as desired and output segments that well capture the tar-geted emotional transitions.

6. Acknowledgments

The work of Mihalis A. Nicolaou and Maja Pantic is funded by the European Research Council under the ERC Starting Grant agreement no. ERC-2007-StG-203143 (MAHNOB). The work of Hatice Gunes is funded by the European Com-munity’s 7th Framework Programme [FP7/2007-2013] un-der the grant agreement no 211486 (SEMAINE).

7. References

C. Alberola-Lopez, M. Martin-Fernandez, and J. Ruiz-Alzola. 2004. Comments on: A methodology for evaluation of boundary detection algorithms on med-ical images. IEEE Transactions on Medmed-ical Imaging, 23(5):658–660, May.

N. T. Alves, J. A. Aznar-Casanova, and S. S. Fukusima. 2008. Patterns of brain asymmetry in the perception of positive and negative facial expressions. Laterality: Asymmetries of Body, Brain and Cognition, 14:256–272.

(6)

S. Baron-Cohen and T. H. E. Tead. 2003. Mind reading: The interactive guide to emotion. Jessica Kingsley Pub-lishers Ltd.

R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schroder. 2000. Feeltrace: An instru-ment for recording perceived emotion in real time. In Proc. of ISCA Workshop on Speech and Emotion, pages 19–24.

R. Cowie, E. Douglas-Cowie, and C. Cox. 2005. Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks, 18:371–388. E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox,

L. Lowry, M. McRorie, L. Jean-Claude Martin, J.-C. Devillers, A. Abrilian, S. Batliner, A. Noam, and K. Karpouzis. 2007. The humaine database: address-ing the needs of the affective computaddress-ing community. In Proc. of Int’l Conf. on Affective Computing and Intelli-gent Interaction, pages 488–500.

H. Gunes and M. Pantic. 2010. Automatic, dimensional and continuous emotion recognition. International Jour-nal of Synthetic Emotions, 1(1):68–99.

H. Gunes and M. Piccardi. 2009. Automatic temporal seg-ment detection and affect recognition from face and body display. IEEE Tran. on Systems, Man, and Cybernetics-Part B, 39(1):64–84.

S. Petridis, H. Gunes, S. Kaltwang, and M. Pantic. 2009. Static vs. dynamic modeling of human nonverbal behav-ior from multiple cues and modalities. In Proc. of ACM Int’l Conf. on Multimodal Interfaces, pages 23–30. J. A. Russell. 1980. A circumplex model of affect. Journal

of Personality and Social Psychology, 39:1161–1178. K.R. Scherer, 2000. The Neuropsychology of Emotion,

chapter Psychological models of emotion, pages 137– 162. Oxford University Press.

Wollmer, M. and Eyben, F. and Reiter, S. and Schuller, B. and Cox, C. and Douglas-Cowie, E. and Cowie, R. 2008. Abandoning emotion classes - towards continuous emo-tion recogniemo-tion with modelling of long-range dependen-cies. In Proc. of 9th Interspeech Conf., pages 597–600. Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang. 2009.

A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Tran. on Pattern Analysis and Machine Intelligence, 31:39–58.