A Real-Time Convolutional Approach To Speech Emotion Recognition

(1)

MSc Artificial Intelligence

Master Thesis

A Real-Time Convolutional

Approach To Speech Emotion

Recognition

by

Jorn Engelbart

11331712

July 26, 2018

36 ECTS February 2018 - July 2018

Supervisors:

K Gavrilyuk

A Ghodrati

Assessor:

E Gavves

MessageBird

(2)

Abstract

In the task of Speech Emotion Recognition, the speakers’ emotion is classified based on audio recordings of their voice. This research explores the use of machine learning models to recognize emotions in speech in real-time. For this purpose, this research presents the DeepEmoNet architecture. DeepEmoNet uses a CNN based architecture to classify emotion on a raw audio input. The experiments show that this architec-ture achieves a score that can compete with the current state-of-the-art approaches. Furthermore, the experiments show that the model is able to achieve those score in real-time.

(3)

Acknowledgements

I would like to thank my supervisors Kirill and Amir for helping me throughout the thesis. I also would like to thank all the people that I could bounce ideas off during the thesis, it helped a lot. Finally, I would like to thank MessageBird for giving me the opportunity to write my thesis at the company.

(4)

Chapter 1 Introduction

In recent years human-computer interactions have moved to more human-like conver-sation style [37]. The introduction of digital personal assistants like Siri [2] and Alexa [1] have shown that a more natural style of interacting with computers can help with everyday tasks. Interaction with a computer used to be a very static task where an action leads to a direct result. With a natural conversation style of interacting with a computer, this is completely different. Of course, a certain combination of words can be directly linked to certain actions, but this does not qualify as natural language. Furthermore, on a technical level, it already assumes a perfect conversion from human speech to computer interpretable text, which is a hard task. For a natural conversa-tion not only the speech should be converted to text, this text should be evaluated with all context of the previous conversation in mind. Based on this complete context and the newly gained textual knowledge the computer should be able to decide on what action to take. Possible actions could include responding with an extra question to clarify the task, executing the task and give an appropriate response to the user or maybe do nothing because the conversation was ended. All of this depends on the interpretation the system has of the perceived information. This interpretation could be established using only the literal words the conversation partner expresses, but in human natural conversation, these words are not the only clues. A big part of what makes human conversation natural are emotions and nuances expressed in the speak-ers’ voice. One of the most striking examples of this is sarcasm. The use of sarcasm

(7)

in a sentence could reverse the meaning of the sentence without changing one word. The same holds for saying the same sentence with different emotions, although more subtle, these sentences could have different interpretations and are very important for a fluent and dynamic natural conversation. Think of a situation where during a human-computer conversation the human speaker starts expressing itself in a more and more angry manner. While the words in the conversation might not indicate a frustrated speaker, the voice of the speaker can suggest this. When this frustration is perceived, the computer could change to a more calming tone of voice and de-escalate the situation. Therefore it is important for natural human-computer interaction to have a sense of what emotion is expressed during a conversation. This research aims for a method to extract these emotions from a voice recording in a real-time fashion such that it can be used for systems with natural human-computer interaction.

Building such a Speech Emotion Recognition (SER) model is a complex task that requires the extraction of features from unstructured data. Recent developments in machine learning have shown to excel in such tasks. With techniques such as neural networks, highly complex computer tasks on complex data structures are tackled. This includes Convolutional Neural Networks (CNN) [24] that are able to achieve state-of-the-art image recognition tasks [18, 41, 39] and Long Short-Term Memory (LSTM) networks [19] that have proven to be able to uncover complex time series [25, 31]. These techniques have also been applied to audio-related tasks and reported state-of-the-art performance [42, 7, 33]. With these developments in mind, this re-search aims to develop a neural network approach to recognize emotion in speech in real-time. Having a real-time performance is essential for giving a natural feel in a conversational context. Current approaches [33, 7, 26] extract features from the raw audio before feeding it to the model, resulting in a pre-processing action before a ma-chine learning model is applied. However, with the recent development of WaveNet [42], it was shown that machine learning models are capable of handling raw audio formats. With the WaveNet architecture in mind, the aim of this research is to use as few pre-processing on the audio as possible and use the raw audio format. This

(8)

aim makes the approach in this research differ heavily from previous approaches to the task of Speech Emotion Recognition (SER).

1.1 Research Questions

This research aims to produce a machine learning model that works with raw audio input and achieves real-time performance with that model, the research questions addressed are the following:

1. Is it possible to train a machine learning model able to classify emotion in speech from raw audio?

2. Is it possible to achieve real-time performance in emotion classification in speech with a machine learning model?

(9)

Chapter 2 Related work & Background

This chapter provides background on the Speech Emotion Recognition (SER) prob-lem. Section 2.1 discusses the properties of (digital) audio. Section 2.2 explains approaches to SER up until now, as well as approaches to other audio tasks. Section 2.3 discusses research into emotion in speech on an acoustic level. Section 2.4 the intuition of real-time applications. Section 2.5 summarizes the challenges for SER that arise throughout this chapter.

2.1 Audio

Before going into the topic of SER, it is important to elaborate more on the format of audio. Audio is perceived as a highly complex piece of data. Take music for instance, while listening not only multiple instruments can be distinguished, but those instruments can all produce a different pitch, timbre, and volume. This suggests that a complex data structure is needed to capture all these features. However, all these features can be captured by a single waveform representing the amplitude [35]. In an analog setting, this waveform is continuous, for digital audio, this waveform is approximated by discrete sampling of the waveform. Since this is an approximation of the continuous waveform, quality will be lost. The quality of the audio is expressed in Hertz (Hz), equal to the number of samples per second. To illustrate, the sample rate of a Compact Disk is 44.1 kHz, equal to 44100 samples per second [35]. For good

(10)

quality voice recordings a 16 kHz sampling rate is sufficient, as used in Voice Over IP (VOIP) [43].

2.2 Speech Emotion Recognition

This section discusses the current SER approaches and approaches to other audio related tasks. Section 2.2.1 elaborates on the feature extraction and classification pipeline of current approaches. Section 2.2.2 summarizes the state-of-the-art ap-proaches considered in this research. Section 2.2.3 states the limitations of the current approaches. Section 2.2.4 discusses the recent use of Convolutional Neural Networks for audio analyses.

2.2.1 Pipeline

Multiple studies in the field of Artificial Intelligence have investigated the subject of SER. These studies all approach the problem in two stages. First features are extracted from the raw audio waveform, then those features are used to classify the audio sample using a classification algorithm.

Feature extraction

Extracting features for SER is generally done in one of two ways. The first approach is to manually extract as many relevant features from short segments of audio. An ex-ample is [7] where Mel-Frequency Cepstrum Coefficients (MFCC), chromagram-based and spectrum properties are used. Especially the MFCC features are frequently used in speech tasks [34]. These features are based on the Mel-Scale, a frequency binning method based on the human ear’s frequency resolution. It is proven that human ears resolve these frequencies in a non-linear manner across the audio spectrum. With the use of the Mel-Scale the speech data is parameterized on the frequency response (pitch) in the audio [44]. For the MFCC features, the log of these binned frequencies is taken and a discrete cosine transformation is applied. [12] compares different

(11)

set-without the cosine transformation). This illustrates the high handcrafted nature of the feature extraction. Other notable features used are pitch-related features and energy-related features [17].

The second approach is to learn to extract features using a neural network. This can be done in an unsupervised way using autoencoders [9]. In this approach, the MFCC input feature space is reduced to a smaller latent feature space with an autoen-coder architecture. The intuition for this technique is that the latent feature space contains a more meaningful representation of the speech data since only features im-portant for the reconstruction of the original input features are encoded. Similar to this [26] used a semi-supervised Conditional Variational Autoencoders (CVAE) to extract features. By using a CVAE the input features are conditioned with an emo-tion class to obtain a more emoemo-tion-specific latent representaemo-tion. Others have used supervised methods such as Convolutional Neural Networks (CNN) [24] to extract features from a visualization of audio with spectrograms [20] or time-frequency maps [33]. Figure 2-1 shows an example of a time-frequency map used in [33] to represent the audio input. The time-frequency maps are fed into a CNN layer before the flat-tened features from those layers are fed into the LSTM architecture classifying the samples.

Figure 2-1: Time-frequency map of audio, used as input in [33].

For both these approaches, the features are calculated over a short segment of audio, e.g. every 100 milliseconds. For a complete audio sample that spans more than this segmentation length, this results in a aggregation with the designed feature vector for every time span in the sample. This reduces the number of time points that represent an audio sample of a certain length greatly compared to the raw audio, where every part of the audio is expressed with an individual value representing the

(12)

amplitude at the sample rate of the audio sample.

Classification

After obtaining relevant features the second part of the model is to classify the audio samples based on these features. Classification algorithms such as Hidden Markov Models [38] and Support Vector Machines (SVM) [10] are used in [22, 11]. A more neu-ral approach is taken with Deep Neuneu-ral Networks (DNN) [17] and Recurrent Neuneu-ral Networks (RNN) [7]. [29] use a combination of a CNN followed by a Long Short-Term Memory (LSTM) network, called a Convolutional Recurrent Neural Network (CRNN). [33] use a similar CRNN approach, extended with an attention mechanism (depicted in figure ??).

2.2.2 State-of-the-Art Approaches

The individual parts of feature extraction and classification are already explained in the previous sections, this section will give an overview of which approaches are used to compare this research to and mentions the techniques used.

Chernykh et al. [7] use a set of 34 handcrafted features for every 200 milliseconds of audio, with a moving window of 100 milliseconds for every step this results in 78 vectors of dimension 34 for an audio sample of 4 seconds. These vectors are fed into an LSTM architecture to classify the given sample. The model is trained using the Connectionist Temporal Classification (CTC) loss [15], which is specially designed to classify sequences where the timing in the samples is variable.

For the architecture by Mu et al. [33], audio samples are visualized with a time-frequency map which is fed into a CNN architecture that in turn feeds into an LSTM architecture. Figure 2-1 shows an example of such a time-frequency map. Based on [32] an attention mechanism is introduced to this architecture that focuses the network to the most important part(s) of a sample for emotion recognition.

In the approach by Latif et al. [26], first, a latent representation of audio seg-ments are obtained using a CVAE. Then these representations are fed into an LSTM

(13)

architecture and trained using the CTC loss similar to [7].

2.2.3 Limitations of Recurrent Neural Networks

The state-of-the-art approaches achieve their results with RNN based models. How-ever, these models are not able to handle raw audio data. A raw audio waveform consists of 16000 samples per second when using a 16 kHz sample rate. Section 2.2.1 describes how most approaches divide longer audio samples into short segments which are aggregated into a smaller handcrafted feature set. An example can be found in [7], where every 100 milliseconds (1600 samples) of audio is aggregated into 34 fea-tures. The aggregation of short audio segments is needed since the different recurrent networks used in these approaches are difficult to properly train for long sequences since they suffer from vanishing gradients, exploding gradients and gradient decay over layers [28]. Training them on raw audio waveforms with long sequences is there-fore a hard task. Using the handcrafted features the sequences are shortened, making it feasible to train the recurrent neural networks for classifying emotion in speech.

2.2.4 Convolutional Neural Networks for Audio

CNNs have shown incredible results on image classification tasks [24]. Using residual blocks in these CNN architectures has shown to further improve results for image classification [18]. Both these techniques are used by [42] to analyze audio. [42] shows how a combination of a convolutional architecture with dilated convolutional layers and residual blocks achieve state-of-the-art performance in speech synthesis. Despite not being the main focus of their research, the authors also report state-of-the-art performance in speech recognition using their architecture. With this, they reflect that their architecture is not only a solution for a specific problem but suitable for more audio based tasks.

(14)

2.3 Emotion

This section discusses some aspects of the emotional utterances that are classified in the task of SER. Section 2.3.1 gives more information about the utterances itself, while section 2.3.2 discusses the subjectivity of the labeling of the emotional utterances.

2.3.1 Emotional Utterances

For the task of recognizing emotion in vocal recordings, it is important to have some knowledge of the emotional utterances itself. A lot of research has been done from a psychological and acoustic point of view [40, 22, 36]. An emotional utterance is not captured in a single moment, it is expressed over a certain period. The length of these periods is highly dependent on the speaker and the emotion expressed. However, it has been shown that an emotional utterance can be identified from a minimum length of 250 ms [22, 36].

2.3.2 Subjective Labeling

The identification of emotion is a subjective matter. Whereas most classifying tasks have a certain label for the subject at hand (e.g. object classification in images), with SER the labels assigned to emotional utterances are not certain. One person can identify an utterance as angry while another person might consider it to be sad. [7] reported on the human error over the angry, sad, happy and neutral classes for the widely used IEMOCAP [5] dataset. They report a 70% human accuracy on the dataset when relabeling the audio samples. Figure 2-2 shows the confusion matrix for the human relabeling of the data. This not only shows the high subjectivity of the classification of emotion but also shows that reported performances of models tackling this problem should be considered with this number in mind.

(15)

Figure 2-2: Confusion matrix for human performance on the IEMOCAP dataset.

2.4 Real-time Evaluation

As mentioned in chapter 1, this research aims for a model capable of real-time evalu-ation. Real-time applications have the fundamental requirement that they are able to respond to external events with short and deterministic delays [16, 14]. This is most easily reflected by the computational time it takes to evaluate a sample. However, this metric is highly dependent on the hardware that runs the model. Nevertheless, since the task of SER is one that is intuitively aimed to be used in a conversational setting the measure for real-timeness will be based on that. In such setting, it does not feel natural to wait seconds for a model to classify emotion over the past sentence expressed, it is expected that while a sentence is expressed the classification can be done continuously.

(16)

2.5 Challenges in SER

This chapter discusses the task of SER. There are multiple challenging aspects to this problem. Firstly, since this research aims to use as few pre-processing steps as possi-ble, a model that handles raw audio input is desirable. As discussed in section 2.2.3, using this raw format causes problems with existing RNN approaches. The challenge is to find a model that is able to handle the high frequency of time points in the data. Secondly, as discussed in section 2.3.1, the length of emotional utterances varies de-pendent on the speaker and the emotion uttered. This not only requires a model to maintain long-term relationships in audio samples to extract emotional information but also requires the model to be able to extract only relevant information from a longer frame of audio since it should be able to capture possible longer utterances. Lastly, as discussed in section 2.3.2 the subjective matter of the emotional utterances that need to be classified. This subjectivity makes for a challenging problem since the model can only rely on the annotated dataset. A dataset on which even humans cannot obtain perfect scores.

(17)

Chapter 3 Method & Approach

This research aims for a model that is capable of classifying emotion in speech in real-time with as few pre-processing steps as possible. In this chapter, the DeepEmoNet (DEN) model is presented. The DEN model is a complete end-to-end trainable con-volutional approach for audio analyses that makes use of raw audio waveform input and addresses the challenges stated in section 2.5.

3.1 Problem Statement

The task of the model is to classify emotion in speech audio. The input data is digital audio sampled at 16 kHz. This means that the audio is encoded as a stream of 16000 one-dimensional floats per second that represent the amplitude. The output for this task is a classification per emotion class. For this research four emotion classes are used: neutral, happy, sadness and anger. Therefore the output of the model is a probability distribution over the four classes. The class with the highest probability will be considered as the predicted output class.

(18)

3.2 DeepEmoNet

input, dim=1, samples=2 16 BN causal con v_olutional la y er, filters=8, segmen ts=2 16 CCPB, filters=16, segm en ts=2 14 BN CCPB, filters=32, se_gmen ts=2 12 BN CCPB, filters=64, segm en ts=2 10 BN CCPB, filters=128, seg men ts=256 BN CCPB, filters=256, segmen ts=64 BN fully connected la y_er, units=256 BN fully connected la y_er, units=128 BN fully connected so_ftmax la y_er, units=4 BN Importance Neutral Happy Sad Angry

Figure 3-1: A schematic illustration of an example DEN architecture. Layer descrip-tions on top, Batch Normalization (BN) layers indicated on the bottom.

3.2.1 Network architecture

Figure 3-1 depicts a schematic overview of the DEN network. The DEN network architecture is designed to deal with the raw audio waveform input. To be able to

(19)

handle the raw audio input, the model takes a convolutional approach rather than the RNN based approach. Most layers in the architecture are build up of Causal Convolutional Pooling Blocks (CCPB), as explained in more detail in section 3.2.2. The blocks consist of a causal convolutional layer and a pooling layer. The causal convolutions only allow the layer to use previous values of the input, hence it can never use any values in the future of the input. These causal convolutions allow the model to be used in real-time. The convolutional layer extracts local features from audio while the pooling layer reduces the temporal dimension of the input data. This input is just a one-dimensional stream of floats, nonetheless, the temporal dimension of the input gets large quickly. At a sample rate of 16 kHz, this dimension grows with 16000 floats per second. By stacking a number of these CCPBs the temporal dimension can be reduced from raw audio samples (individual floats) to feature vectors for that represent a couple of milliseconds of audio. These feature vectors representing short parts of audio are referred to as segments. The number and length of the segments depend on the input length, pooling size and the number of convolutional pooling layers. Equation 3.1 shows how to compute the number of segments given input size |x|, pooling size p (with equal stride) and the number of layers |l|. The number of segments is rounded down to the nearest integer.

#segments = b|x|

p|l|c (3.1)

Choosing the number and lengths of the segments for the emotion classification task has multiple aspects. Emotional utterances can be identified from samples 250 ms or longer [33, 36, 22]. The length of these utterances varies heavily depending on the speaker and the emotion uttered. Therefore choosing segments with a length of 250 ms is not preferable. To cope with these varying lengths, the model uses a long input and smaller segments, where every segment gets assigned an importance factor. Here the input is significantly longer than a minimal emotional utterance and the segments significantly smaller than a minimal emotional utterance. This allows

(20)

the model to focus attention on a subset of the most important segments in the input data, moreover, segments with noise or silence can be given a lower importance. Section 3.2.4 describes this mechanism in more detail.

The input of the model is set to 216 _{samples, this equals to 4.096 seconds at a}

sample rate of 16 kHz. This length is significantly longer than the minimal emotion utterance length of 250 ms and it resembles the input length of 4 seconds in [7]. This input is brought down to 64 segments where each segment covers 64 ms of the input audio, significantly smaller than 250 ms. Using a pooling size of 4, equation 3.1 gives

64 = 2

16

4|l|

hence 5 layers of CCPBs are used.

To prevent the model from losing a large number of parameters that give the model expressive power to enable the approximation of complex functions [30] by pooling, the number of filters for the convolutional layers is increased for every pooling action. This is illustrated in the annotation of figure 3-1. The one-dimensional input is fed into the first layer without pooling and 8 filters. From here the number of filters is increased as the number of segments decreases. The input is of size 216with dimension 1, the last CCPB has 64 segments with 256 filter equal to 214 _{values. Which equals}

a parameter reduction of factor 4. If the number of filters would be kept equal to 8 this reduction would be of factor 128.

In order to give each convolution a bigger area of audio to grasp without having an equally big increase in computational effort, dilations are used in the convolutional layer of the CCPBs. A dilated convolution is a convolution that captures a larger input area by skipping a number of input values at a certain rate. This allows the convolutional layer to establish more long-term relationships in the data. These dilations are further explained in section 3.2.3.

Following the CCPBs a series of fully connected layers is applied to the output for every segment. This leads into the final softmax layer that outputs a probability per emotion class. Next to the softmax layer the fully connected layers also lead

(21)

into a layer with a single output that indicates the importance of a segment. This importance value is used in the attention mechanism as described in section 3.2.4.

Although the input size of 216_{samples is chosen, this is not necessarily a constraint}

for the model. Since all segments are evaluated using convolutional layers and all the fully connected layers are only applied to the result of individual segments, the input length can be chosen arbitrarily. However, since the model uses dilations in the CCPBs, every individual segment does benefit from having a sense of the previous segments, therefore choosing an input size larger than one segment would benefit the prediction.

3.2.2 Causal Pooling Blocks

The DEN model uses the raw audio waveform. First, a layer of one-dimensional, causal convolutions over the temporal dimension of the raw audio waveform is applied. Using causal convolutions the model can only use previous values of the input. The first advantage of this is that the model can be used for evaluation in real time since no future values are needed. Secondly, when audio is interpreted in real life, it is not possible to interpret future values before the current ones. If this is the case, e.g. with audio played in reverse, the perception is completely different. Therefore the causal convolutions keep the interpretation of audio correct [42].

After each convolutional layer, a mean-pooling layer is applied. This layer reduces the size of the temporal dimension with a factor pooling size. The combination of these layers can be seen as a Causal Convolutional Pooling Block (CCPB). Figure 3-2a shows an illustration of the combination of the two layers in a CCPB.

3.2.3 Dilations

As mentioned in section 5.1, an emotional utterance can be identified from samples with a length of 250 ms or longer [33, 36, 22], in terms of raw audio this translates to 4000 samples at a 16kHz sampling rate. Hence from the start of an utterance to the end of one, there is a high number of samples. This means that long-term

(22)

relation-t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t0₀ t0₁ t0₂ t0₃ b0 Causal convolution Input Pooling layer

(a) Causal Convolutional Pooling Blocks without dilations

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16

t0₀ t0₁ t0₂ t0₃ b0

Causal dilated convolution Input

Pooling layer

(b) Causal Convolutional Pooling Blocks with a dilation rate of 2

Figure 3-2: Two figures of Causal Convolutional Pooling Blocks b0 with a pooling

size of 4 and a filter width of 2 in the convolutional layer

ships in the input data need to be maintained. The CCPBs capture these relations by combining multiple layers of convolutions and pooling. However, since there is only little overlap in the blocks, they are only able to capture relations in their own input (illustrated in figure 3-2a). In order to really capture long-term relationships, the causal convolutions in the CCPBs should have a sense of the previous blocks, a sense of history. This is achieved by using dilated convolutions in the blocks. These convolutions capture a larger input area by skipping a number of input values at a fixed rate. [42] shows that this technique is helpful when working with raw audio waveforms. It allows the network to grasp a larger input area without having the

(23)

convolution layer to reach input values from previous blocks and establish longer re-lationships. Section 5.1 shows that the dilated convolutions boost the performance of the DEN model. Figure 3-2b shows CCPBs with a dilation rate of 2 in the convo-lutional layer. The figure illustrates how a single CCPB reaches into the input of the previous one.

3.2.4 Attention Model

Stacking these CCPBs results in a reduced temporal dimension. When reducing it to a single dimension and adding a fully connected neural layer producing the output vector p. The model is able to classify an audio sample this resulting prediction vector p.

Rather than reducing the temporal dimension to just one output with prediction vector p, the DEN model reduces the temporal dimension to 64 chronological outputs, all with their own prediction vector pn. Where n ∈ 1...64. In the most simple form,

the prediction vectors per segment are simply summed and normalized to probabilities for classifying the emotion in the given audio sample. This results in a total prediction vector P as shown in equation 3.2.

P = softmax(X

n

pn) (3.2)

However, [33] shows that not every part of the audio sample is equally important for the classification of emotion. Therefore an attention mechanism is added to the prediction vectors. Using an extra fully connected layer on the outputs, every segment receives a single attention value an, also called the importance value. By applying the

softmax activation function over these values in the temporal dimension a probability per segment is obtained. These probabilities per segment are used to take a weighted sum over all the prediction vectors, producing a weighted prediction vector Pa for

(24)

Pa= softmax(

X

n

an· pn) (3.3)

Section 5.1 shows that using the attention mechanism in the DEN model results in a higher accuracy. Section 5.5 presents a visual illustration of the mechanism in action during the conducted experiments.

3.2.5 Batch Normalization

Figure 3-3 shows the distribution of all raw audio waveforms in the training set. Figure 3-4 shows four individual audio samples, with the raw audio waveforms on the left and the distributions of that waveform on the right. Although the raw audio input is ranged between -1 and 1 most of the values are situated close to 0. Even in the first example where there is a lot of variation in amplitude, most samples are in the range (−0.5, 0.5). Improving this distribution for fitting a model could be done by hand, but in the spirit of creating a complete end-to-end solution, the input is left as is. To improve this distribution by analyzing the data, batch normalization [21] is applied to the input data and all consecutive layers. Batch normalization reduces the covariance shift in the layers by subtracting the mean and dividing by the standard deviation of a batch of training examples. Simultaneously a trainable mean parameter β and standard deviation parameter γ are introduced to learn these parameters for use in evaluation when the information of a full batch cannot be used. Using the trainable mean and standard deviation parameters of the batch normalization the model will learn to normalize the distribution of the input data and hidden units of the consecutive layers. Algorithm 1, taken from [21], shows how batch normalization is used with a layer that has input values x1...m. It shows how the batch mean

and variance are used to normalize the value and how these relate to the trainable parameters γ and β.

(25)

Algorithm 1 Batch Normalizing Transform, applied to input x over a mini-batch

Input: Values of x over a mini-batch: B = x1...m;

Parameters to be learned: γ, β Output: yi = BNγ,β(xi) i = 1...m µB = 1 m m X i=1 xi // mini-batch mean σB2 = 1 m m X i=1 (xi− µB)2 // mini-batch variance ˆ xi = xi− µB pσ2 B+ // normalize

yi = γ ˆxi+ β ≡ BNγ,β(xi) // scale and shift

Figure 3-3: Distribution of all raw audio waveforms in the training set. The upper image shows the plain distribution, whereas the lower image shows it in log space.

(26)

Figure 3-4: Four examples of a raw audio waveform and the distribution of that waveform. Note that the frequency axis (y-axis) of the distributions are in log space.

3.2.6 Loss function

To train the model a loss function is required that can deal with the four emotion class problem. The output is a probability distribution over the four classes. Therefore the categorical cross entropy loss is used. Equation 3.4 shows the loss function, y indicates the truth value of the current class and p indicates the predicted probability for that class. Since only one class is correct the value of y is binary, either 0 or 1. When all values of p are correct the value of the loss function would be 0 if not, the value is higher than 0.

(27)

3.3 Implementation

The DEN model is implemented using the Keras framework [8]. All the techniques used in the model have a standard implementation in this library, leading to a straight-forward implementation.

Furthermore, to show the applicability in real-time a web browser version was implemented. This version was made using the TensorflowJS library [4, 3]. This is a JavaScript library that allows for WebGL accelerated execution of both Tensorflow and Keras models. This implementation showed real-time analyses of the microphone audio from the user. The tool is also capable of recording a sample of audio and show an in-depth analysis. Furthermore, it was designed as a data collection tool, but insufficient data was collected to use this for training. Figure 3-5 and 3-6 show screenshots of the web implementation.

Figure 3-5: Web browser implementation of the DEN network, showing a real-time probability distribution over the emotion classes.

(28)

Figure 3-6: Web browser implementation of the DEN network, showing an in-depth analysis of a full audio sample.

(29)

Chapter 4 Experimental Setup

This chapter describes the setup of the experiments conducted in chapter 5. Section 4.1 gives insight in the data used for these experiments. Section 4.2 describes both the metrics used in the experiments and the way the training phase for these experiments is set up.

4.1 Data

For the experiments performed in this research two datasets are used, IEMOCAP [5] and MSP-IMPROV [6]. Section 4.1.1 and 4.1.2 describe the two datasets, section 4.1.3 gives some remarks on the datasets. Section 4.1.4 gives more insight into the pre-processing of the data, and section 4.1.5 reflects on the way the data is used during the training of the model.

4.1.1 IEMOCAP

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an estab-lished dataset for emotion recognition tasks. It consists of recordings of ten speakers grouped in five sessions. In every session, a male and female actor have multiple conversations, scripted or improvised. The recordings are segmented at utterance level. Each utterance is annotated by at least three human annotators. The emotion

(30)

annotation with the highest vote wins. From this dataset only the neutral, happy, excitement, sad and angry categories are used. The happy and excitement classes are combined into one happy category as in [5, 13]. This results in a dataset with the following number of utterances in each class: 1708 neutral (30%), 1636 happy (30%), 1084 sad (20%), 1103 angry (20%). The length of the utterances vary between 500 milliseconds to 30 seconds.

4.1.2 MSP-IMPROV

The MSP-IMPROV is a dataset similar to IEMOCAP. It contains recordings of twelve speakers grouped in six sessions as in the same fashion as IEMOCAP. Again every session features one male and one female actor. Not only scripted and improvised session were recorded, also the natural conversations of the actors in between takes are used. The dataset is annotated in the same way as IEMOCAP. The recordings are segmented at the utterance level and scored by at least three human annotators. The final annotation for the utterance will be most frequent annotation by the human annotators. Natural conversation is part of the set, these causal encounters are rarely sad or angry, hence the classes are not equally distributed. The dataset has the following number of utterances in each class: 3477 neutral (45%), 2644 happy (34%), 885 sad (11%) and 792 angry (10%). The length of these utterances vary between 400 milliseconds to 30 seconds.

4.1.3 Notes on the datasets

Both datasets are recorded in a similar fashion. In each session, two actors, one male, and one female, will have either a scripted or non-scripted conversation. These conversations are recorded in a studio setting with almost zero noise. As there are no more than six sessions per dataset, the number of different voices is limited. All the actor are native speakers and use American-English.

Although the dataset is a perfect match for the problem at hand, it must be noted that the recordings are not reflecting a real-life situation. In such a situation

(31)

the participants are unlikely to be all native American-English speakers, recorded in a studio-like setting with no background noise.

Another important note to make is the way in which the dataset is annotated. These datasets do not only consist of recorded audio but also features video recordings and motion data. The annotators did not score the utterances on the audio alone, they were also shown the video recording. These video recordings may include more clues to what emotion the actor is expressing. This research uses solely the audio data, hence it cannot access all the data that was used by the annotators to judge the samples.

4.1.4 Data Pre-processing

As mentioned in chapter 3, the aim of this research is to provide a model that re-quires as few pre-processing steps as possible. If the audio is provided in stereo, it is normalized to a mono format, furthermore, if the audio is not already sampled at a 16kHz rate, the audio is resampled.

4.1.5 Data for Experiments

Both datasets are divided into a training, validation, and test set with each 70%, 5% and 25% of the data respectively. The test set is solely used after training is finished and never during training. The validation set is used to stop a training session early before overfitting on the training set.

4.2 Setup

This section describes the setup of the experiments conducted in chapter 5. Section 4.2.1 describes the metrics used in the subsequent experiments. Section 4.2.2 describes how the training is set up for the experiments.

(32)

4.2.1 Metrics

The experiments described in chapter 5 all use two forms of accuracy metrics, weighted and unweighted [33, 7]. Accuracy is simply measured by the number of correctly classified utterances divided by the total number of utterances. Weighted accuracy (WA) is the mean accuracy over all samples in the test set. Unweighted accuracy (UA) is the mean over the accuracy per emotion class. Equations 4.1 and 4.2 show the mathematical formulas for this. The naming of these metrics is unconventional, especially unweighted accuracy is more often indicated as per-class accuracy. But to keep consistency with the compared papers the WA and UA terms are used.

N ∈ Z Number of categories C = {c1...cN} Set of categories

tn ∈ cn, n ∈ N Correctly classified samples in category n

|cn| Total number of samples in category n

WA = Pn i=1tn Pn i=1|cn| (4.1) UA = 1 N N X n=1 tn |cn| (4.2)

Important to note here is that the weighted accuracy is highly dependent on the distribution of classes in the dataset. For instance, consider a problem where one class of a multi-class classification problem takes up 60% of the used dataset. If the model just predicts that particular class, the weighted accuracy of that would be 60%. If this problem is measured using unweighted accuracy the result is the accuracy per class divided by the number of classes. A single class would have 100% class accuracy, all the other classes have a class accuracy of 0%. Assuming a four-class problem this would result in only 25% unweighted accuracy.

(33)

4.2.2 Training

To train the model the categorical cross entropy loss function is used as mentioned in 3.2.6. The optimization is done using the ADAM optimizer [23] with an initial learning rate of 0.001. Every 14 epochs the learning rate is lowered by factor 10. The training data is fed into the network in batches of size 32. After each epoch of training data, the model is evaluated using the full validation set. The model with the lowest loss value on the validation set is saved and used for the final evaluation on the test set. The model is trained for a total of 55 epochs.

These samples are 4.096 seconds of audio with a singular emotion label. Longer sequences are split up into parts of 4.096 seconds. If a sample is smaller than 4.096 seconds or a during the split a smaller part is left, the samples are zero-padded to have the same size as the full-length samples. During training, the zero padded parts of these samples are ignored in the loss function by introducing a sequence length variable that nulls out the loss over these artificial padded values.

4.2.3 Inference

Evaluation of the model is done using the output probabilities of the network. Samples of 4.096 seconds are fed into the model, the label with the highest probability in the output layer is considered the predicted class. For evaluation, these samples are split in the same way as they are for training purposes.

When using this model for evaluation with a stream of audio a moving window is used. At a specific interval, the last frame of 4.096 seconds can be evaluated. This allows for real-time evaluation, for instance, to visualize the probability distribution for the four emotion classes while a conversation is recorded as shown in the web implementation described in section 3.3.

(34)

Chapter 5 Results & Discussion

In this chapter, the experiments and results of the DEN model are presented. Section 5.1 describes the experiments that compare the different technical aspects of the pro-posed model. Section 5.2 describes the experiments that compare different parameter settings for the proposed model. Section 5.3 describes the experiments that compare the proposed model with state-of-the-art approaches. Section 5.4 describes the ex-periments comparing the real-time performance of the model. Section 5.5 describes some visual representations of the attention mechanism obtained from model.

5.1 Ablation Study

The following experiments are conducted to justify the choices for the techniques used in the DEN model. The three main pillars of the architecture are the dilations in the convolutional layers, the attention mechanism, and batch normalization. All three have an intuitive justification for being used for the problem at hand, described in section 3.2. The experiments compare the performance of the architecture when one of the techniques mentioned is left out of the model to the performance of the full model.

Section 5.5 explains figures 5-3 and 5-4 that depict a more in-depth analysis of how the attention helps classification of a couple of audio samples. Since the application of the attention mechanism is easy to visualize, it can give a simple and intuitive view

(35)

Dilations Attention Batch Normalization WA on IEMOCAP UA on IEMOCAP 3 3 - 44.9% 39.8% - - ₃ 57.0% 57.1% 3 - ₃ 59.1% 62.0% - 3 3 60.1% 60.2% 3 3 3 62.1% 62.1%

Table 5.1: Results of the DEN model broken down per technique used in the model

of how the model uses it.

Table 5.1 shows the results of the DEN model with different techniques applied. Of the three highlighted techniques, the batch normalization has the most significant impact on the performance of the model, as can be seen in the first two rows of the table. Only applying the batch normalization yields roughly a 57% accuracy on both metrics without even using dilations or the attention mechanism. The other two techniques have a much smaller, but significant impact on the performance. Both dilations and attention applied individually on top of batch normalization yield a signification jump of both the weighted accuracy as well as the unweighted accuracy. However, combined, the three techniques achieve the highest accuracy, 62.1% on both metrics.

5.2 Parameter tuning

This section shows the results of the parameter tuning of the DEN model. Three parameters with a big impact on the techniques used in the model are compared, the number of segments, the dilation rate of the convolutional layers and the size of the last two fully connected layers of the model. Table 5.2 shows the results of this experiment. The performance is best for the model with 64 segments and a dilation rate of 2. For the fully connected layers, there are two models that both have superior performance on one of the two metrics used. Since the performance is comparable, the model with the least parameters is chosen to use in the other comparisons. This is the model with 64 segments, dilation rate of 2 and fully connected layers with 256 and

(36)

Segments Dilation Rate Units FC layers WA on IEMOCAP UA on IEMOCAP 16 2 256 - 128 59.8% 60.0% 32 2 256 - 128 59.6% 59.5% 64 2 256 - 128 62.1% 62.1% 128 2 256 - 128 60.9% 61.2% 64 1 256 - 128 60.1% 60.2% 64 4 256 - 128 59.8% 60.3% 64 8 256 - 128 58.8% 59.0% 64 2 512 - 256 62.0% 62.9% 64 2 512 - 512 61.0% 61.8%

Table 5.2: Results of the DEN model with different parameters

128 units achieving a 62.1% accuracy on both the weighted and unweighted metric.

5.3 Comparison with State-of-the-Art Approaches

The next experiments conducted are to compare the proposed DEN model to the state-of-the-art approaches. These works all use RNN-based approaches to classify the audio. The results by Mu et al. [33], Chernykh et al. [7] and Latif et al. [26] are compared to the DEN model. These approaches all report their results on the IEMO-CAP dataset. To provide a baseline for the MSP-IMPROV dataset the Chernykh et al. model is reproduced and run on that dataset. These approaches report the highest performance in terms of accuracy on the complete test set. As Chernykh et al. already mentioned, a higher score was reported by Lee and Tashev, but they only use a subset of the IEMOCAP dataset. Therefore no fair comparison can be drawn from their work.

Table 5.3 shows the results on the IEMOCAP and MSP-IMPROV datasets for different approaches. For Chernykh et al. [7] both the original and the reproduced version are reported. The human performance on the IEMOCAP dataset from [7] is stated as well. The human performance is far from perfect which can be justified by the fact that the classification of emotion is a very subjective task.

(37)

ap-a better number in their work. However, for this ap-approap-ach, the unweighted ap-accurap-acy of 56.4% is significantly lower than the DEN approach.

Comparing the results to the human performance shows that while the percentage correctly classified samples are far from perfect, the DEN model does approach human performance. Paper WA on IEMOCAP UA on IEMOCAP WA on MSP-IMPROV UA on MSP-IMPROV Chernykh et al. 54.0% 54.0% - -Reproduced [7] 54.0% 54.0% 52.1% 33.9% Mu et al. 64.1% 56.4% - -Latif et al. 58.08% - - -DEN Model 62.1% 62.1% 60.5% 47.5% Random 25.0% 25.0% 25.0%. 25.0% Human [7] 69.0% 70.0% -

-Table 5.3: Results of DEN versus related work, random and human performance.

(a) Reproduced Chernykh et al. (b) DeepEmoNet

Figure 5-1: Confusion matrices from the reproduced Chernykh et al. model and the DeepEmoNet model

Figure 5-1 shows the confusion matrices for both the reproduced Chernykh et al. and DEN model. This figure illustrates how the DEN model has an above average

(38)

performance for all classes, whereas the Chernykh et al. model mostly struggles to correctly classify happy samples. Furthermore, comparing the two images, the DEN model shows a lower misclassification rate over all classes.

In chapter 3, the confusion matrix for human performance on the IEMOCAP dataset was shown. Figure 5-2 shows this confusion matrix next to the confusion matrix for the DEN results. Even though the human matrix shows an overall better performance, some interesting parallels can be identified. Both show a difficulty in classifying the neutral class, for both this class is frequently misclassified as sad. Furthermore, both figures show high correct classification rates for sad and angry and low confusion between these particular classes. The parallel between these human and machine classification suggest a high correlation in the acoustic interpretation of the classified audio samples.

(a) DeepEmoNet (b) Human

Figure 5-2: Confusion matrices for the DeepEmoNet model and the human perfor-mance on the IEMOCAP dataset (From [7]).

5.4 Real-time performance of DEN

The following experiments conducted show the real-time applicability of the DEN model. The techniques used in the DEN model allow the model to be executed in real-time. However, to be able to actually use it in real-time the computational performance should allow for this. In this experiment both the reproduced model

(39)

by Chernykh et al. and the DEN model are tested on a MacBook Pro with a 2,8 GHz Intel Core i7 processor and 16 gigabytes of RAM. The models are tested from the moment the raw audio waveforms are loaded in memory. This means that no I/O operations could interfere with the performance. Table 5.4 shows the average time needed to classify a single audio sample of 4 seconds and 4.096 seconds for the Chernykh et al. and DEN model respectively. The feature extraction time per sample and execution time of the model per sample is stated separately.

The results show that both approaches are capable of processing the audio samples in real-time. The time it takes to process a sample is for both approaches far less than the length of the samples used. When considering the DEN model in a conversational context the last four seconds could be evaluated every 42 milliseconds, more than 20 times per second. It is safe to say that such an evaluation speed would feel natural and real-time. When comparing the two approaches it is clear that the execution time of just the model is faster for the approach by Chernykh et al.. However, this model has a pre-processing step that extracts features from the raw audio waveform. When including this pre-processing step the DEN model has superior computational performance.

Feature extraction (s/sample)

Model (s/sample) Total (s/sample)

Reproduced [7] 0.254 0.022 0.276

DEN model - 0.042 0.042

Table 5.4: Performance of DEN model and reproduced [7] on a MacBook Pro with a 2,8 GHz Intel Core i7 processor and 16 gigabytes of RAM. Samples are of 4 seconds and 4.096 seconds for the approaches respectively. Results are shown in average seconds per sample.

5.5 Attention Mechanism Visualization

In figures 5-3 and 5-4 a visual representation of the attention mechanism is depicted. The different figures all illustrate a different situation where the mechanism acts on.

(40)

Figure 5-3a shows how the mechanism correctly assigns importance to several happy parts in the sample. These highly scored parts are is the excited Oh my God! part of the sample, indicating that it correctly focuses on what part of the sample sounds most happy.

Figure 5-3b shows how the attention mechanism focuses on one particular segment in order to correctly classify the sample. The mechanism correctly focuses the atten-tion to the agitated much part of the sample. Furthermore, this sample shows that despite that the words in the sample are not angry, the tone of the voice is correctly classified as angry.

Figure 5-4 shows that on a very quiet sample the attention mechanism correctly assigns high importance to the first utterance of the sad expression. This illustrates that the attention mechanism does necessarily focus on volume, but finds a deeper understanding.

(41)

Figure 5-3: Analyses of single audio samples with 64 segments. The color of the upper bar indicates the predicted emotion, the color of the lower bar indicates the truth value. The color of every individual segment indicates the emotion classification for that part. The black bars on the bottom of every segment indicate the importance value per segment.

(42)

(43)

Chapter 6 Conclusion

This research focuses on the task of Speech Emotion Recognition (SER). The aim was to develop a machine learning model capable of classifying emotion in real-time with as few pre-processing steps on the raw audio data as possible. To tackle this the DeepEmoNet (DEN) is presented in this research. This architecture is evaluated and compared to other state-of-the-art approaches for SER on two frequently used datasets for this problem.

The DEN architecture achieves a 62.1% weighted accuracy on the IEMOCAP dataset. Keeping in mind that the human performance for the same task is roughly 70%, this answers the first research question stated in this research. Is it possible to train a machine learning model able to classify emotion in speech from raw audio? Yes, the DEN architecture is able to classify emotion in speech using raw audio data as input.

Not only can the first research question be positively answered. The model also shows comparable or superior performance compared to state-of-the-art approaches. Two metrics are used to measure performance, the model outperforms all com-pared approaches on unweighted accuracy, and outperforms all but one approach on weighted accuracy. It can be argued that in the experiments performed the un-weighted accuracy gives a more fair comparison since it is not dependent on the distribution of classes in the datasets.

(44)

task it manages to outperform most other approaches. The DEN architecture uses the raw audio input, a one-dimensional waveform consisting of amplitudes forming the audio. All other compared approaches in this research make use of explicit feature extraction before analyzing the data. The DEN architecture has no such feature extraction, using a combination of dilated causal convolutional layers, pooling in the temporal dimension, and batch normalization the model is able to handle the raw audio input. The lack of explicit feature extraction results in a more simple and straightforward evaluation pipeline.

The second aim of this research was to develop a model that is able to classify emotion in real-time. It is hard to define real-timeness since this depends highly on the hardware specification of the system evaluates the model. However, this research shows that 4 seconds of audio can be evaluated in 42 milliseconds on a laptop without GPU acceleration. When considering this in a conversational context the last four seconds could be evaluated every 42 milliseconds (more than 20 times per second) which answers the second research question. Is it possible to achieve real-time perfor-mance in emotion classification in speech with a machine learning model? Yes, the DEN architecture allows for real-time evaluation of emotion in speech.

As with the accuracy, the computational performance was compared to state-of-the-art approaches. Since these approaches do not report on this performance only one of the approaches, reproduced in this research, is compared. When solely comparing the machine learning models of both approaches the computational time of the DEN architecture is higher. However, the DEN architecture is faster by a factor 5 when the pre-processing steps are taken into account. The lack of pre-processing steps was one of the specific aims of this research and it contributes to the goal of developing a real-time system.

6.1 Future Work

This research presents the DeepEmoNet architecture achieving state-of-the-art per-formance on the task of Speech Emotion Recognition (SER). There are several notes

(45)

to make in this research that could be interesting for future works.

In chapter 4 several notes were made on the datasets used in this research. The data is recorded in a studio-like setting with a limited number of speakers, all speaking American-English. Since the model learns to model this data, it is hard to achieve high accuracy in real-world settings where the variety in speakers is much higher. Therefore, it would be interesting to create a more real-life like dataset with imperfect recordings and a variety of native and non-native English speakers that would reflect a more real-world setting and style of conversation.

The same chapter mentions the style of annotating the data. The data was an-notated by multiple annotators were the highest number of votes for a class gets selected. Furthermore, the annotators where not only annotating the data based on the audio recordings, the datasets also feature video recordings. This gives the anno-tators more information about the emotion expressed than the model has. A dataset for a subjective task such as SER will always have compromises in the annotation style. A good improvement would be to have a dataset solely annotated on audio with a large annotators panel to obtain the truest label for samples.

This research uses the DEN architecture for SER, however, the architecture is not bound to this task. In future works, the same architecture could be used on other audio tasks, or even signaling tasks. Chapter 1 already mentions the related problem of detecting sarcasm in speech. Furthermore, one could think of other audio related tasks such as music genre classification or noise detection. This research does not address how the network scales for problems that require analyses of longer audio samples, this could be explored in the future.

(46)

Bibliography

[1] Amazon alexa. URL https://developer.amazon.com/alexa.

[2] Apple siri. URL https://www.apple.com/ios/siri/.

[3] Tensorflowjs. URL https://github.com/tensorflow/tfjs.

[4] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-enberg, Dandelion Man´e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.

[5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language re-sources and evaluation, 42(4):335, 2008.

[6] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. Msp-improv: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 8(1):67–80, 2017.

[7] Vladimir Chernykh, Grigoriy Sterling, and Pavel Prihodko. Emotion recognition from speech with recurrent neural networks. arXiv preprint arXiv:1701.08071, 2017.

[8] Fran¸cois Chollet et al. Keras. https://keras.io, 2015.

[9] Neri E Cibau, Enrique M Albornoz, and Hugo L Rufiner. Speech emotion recog-nition using a deep autoencoder. Proceedings of the XV Reuni´on de Trabajo en Procesamiento de la Informaci´on y Control (RPIC 2013), San Carlos de Bar-iloche, 2013.

(47)

[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learn-ing, 20(3):273–297, 1995.

[11] Prajakta P Dahake, Kailash Shaw, and P Malathi. Speaker dependent speech emotion recognition using mfcc and support vector machine. In Automatic Con-trol and Dynamic Optimization Techniques (ICACDOT), International Confer-ence on, pages 1080–1084. IEEE, 2016.

[12] Ltsc Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael L Seltzer, Geoffrey Zweig, Xiaodong He, Jason D Williams, et al. Re-cent advances in deep learning for speech research at microsoft. In ICASSP, volume 26, page 64, 2013.

[13] John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. Progressive neural networks for transfer learning in emo-tion recogniemo-tion. arXiv preprint arXiv:1706.03256, 2017.

[14] Holger Giese and Sven Burmester. Real-time statechart semantics. TechReport tr-ri-03-239, University of Paderborn, 2003.

[15] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.

[16] Wolfgang A Halang, Roman Gumzej, Matjaz Colnaric, and Marjan Druzovec. Measuring the performance of real-time systems. Real-time systems, 18(1):59– 68, 2000.

[17] Kun Han, Dong Yu, and Ivan Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition. In Proceedlearn-ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[19] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[20] Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan. Speech emotion recognition using cnn. In Proceedings of the 22nd ACM international conference on Multimedia, pages 801–804. ACM, 2014.

[21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

(48)

[22] Yelin Kim and Emily Mower Provost. Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 3677–3681. IEEE, 2013.

[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[25] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event forecasting with neural networks at uber. In International Con-ference on Machine Learning, number 34, pages 1–5, 2017.

[26] Siddique Latif, Rajib Rana, Junaid Qadir, and Julien Epps. Variational au-toencoders for learning latent representations of speech emotion. arXiv preprint arXiv:1712.08708, 2017.

[27] Jinkyu Lee and Ivan Tashev. High-level feature representation using recurrent neural network for speech emotion recognition. 2015.

[28] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. arXiv preprint arXiv:1803.04831, 2018.

[29] Wootaek Lim, Daeyoung Jang, and Taejin Lee. Speech emotion recognition using convolutional and recurrent neural networks. In Signal and information process-ing association annual summit and conference (APSIPA), 2016 Asia-Pacific, pages 1–4. IEEE, 2016.

[30] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, pages 6232–6240, 2017.

[31] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long short term memory networks for anomaly detection in time series. In Proceedings, page 89. Presses universitaires de Louvain, 2015.

[32] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. Automatic speech emotion recognition using recurrent neural networks with local attention. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2227–2231. IEEE, 2017.

[33] Yawei Mu, Luis A Hern´andez G´omez, Antonio Cano Montes, Carlos Alcaraz Mart´ınez, Xuetian Wang, and Hongmin Gao. Speech emotion recognition using convolutional-recurrent neural networks with attention model. DEStech

(49)

Trans-[34] Chin Kim On, Paulraj M Pandiyan, Sazali Yaacob, and Azali Saudi. Mel-frequency cepstral coefficient analysis in speech recognition. Computing & In-formatics, pages 2–6, 2006.

[35] Davis Yen Pan. Digital audio compression. Digital Technical Journal, 5(2):28–40, 1993.

[36] Emily Mower Provost. Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 3682–3686. IEEE, 2013.

[37] Daisuke Sakamoto, Takayuki Kanda, Tetsuo Ono, Hiroshi Ishiguro, and Norihiro Hagita. Androids as a telecommunication medium with a humanlike presence. Geminoid Studies: Science and Technologies for Humanlike Teleoperated An-droids, pages 39–56, 2018.

[38] Benjamin Schuster-B¨ockler and Alex Bateman. An introduction to hidden markov models. Current protocols in bioinformatics, 18(1):A–3A, 2007.

[39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[40] Christina Sobin and Murray Alpert. Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy. Journal of psycholinguistic research, 28(4):347– 365, 1999.

[41] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-novich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[42] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

[43] Chungyi Wang and Quincy Wu. Information hiding in real-time voip streams. In Multimedia, 2007. ISM 2007. Ninth IEEE International Symposium on, pages 255–262. IEEE, 2007.

[44] Zhu Xuan, Chen Yining, Liu Jia, and Liu Runsheng. Feature selection in man-darin large vocabulary continuous speech recognition. In Signal Processing, 2002 6th International Conference on, volume 1, pages 508–511. IEEE, 2002.

A Real-Time Convolutional Approach To Speech Emotion Recognition

MSc Artificial Intelligence

Master Thesis