Towards Laughing Machines: Comparing Methods for Generating Laughter Sounds

(1)

Towards Laughing Machines:

Comparing Methods for

Generating Laughter Sounds

Madli Uutma 11400617

Master’s thesis Credits: 18 EC

Master’s in Information Studies: Data Science track

Internal Supervisor Thomas Mensink University of Amsterdam External Supervisor Zoltan Szlavik IBM Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam 2017-08-21

(2)

Towards Laughing Machines: Comparing Methods for

Generating Laughter Sounds

Madli Uutma

University of Amsterdam

madli.uutma@gmail.com

ABSTRACT

This paper aims to compare two methods for laughter gener-ation: training a model discriminatively and with a GAN. A feed-forward neural network and a recurrent neural network are trained discriminatively to predict the next 0.25 seconds of laughter based on previous sequences. The best performing model is then trained with a GAN to complete the same task. After the models have been trained, a longer sample of audio is generated using a seed sequence. The value of this work lies in proposing a method to compare discriminative training with training with a GAN. We use the AVLaughCycle dataset for the training of both networks and the IBM laughter dataset to validate the results. We make initial steps in optimizing both models for laughter generation, generate laughter-like samples and suggest future steps to generate laughter effectively using neural networks.

1. INTRODUCTION

The goal of the research presented in this paper is to ex-plore and compare two methods for audio sequence prediction. Specifically, it aims to investigate creating laughter sounds by training an RNN model in two ways - training the model discriminatively and with a generative adversarial neural net-work (GAN) to predict laughter sequences. After the model has been trained to predict a sequence based on previous se-quences, it is possible to generate a long burst of laughter with using only one short seed sequence.

Laughter is an essential part of communication. By investi-gating how laughter can be generated by machines, we may enhance communication between machines and people. There are several benefits for effective communication between ma-chines and humans. For example, the likelihood of adopting new technologies is dependent on the user’s attitude towards the technology [1] and the attitude can be improved with effective communication. Effective human-machine communi-cation is actively studied. For instance, steps are made to help machines understand humans better [17], make robots sound more natural [15], and look more friendly [19]. However, to the best of our knowledge no papers have been published about synthesizing laughter. The communicative purpose of laughter is illustrated by several experiments in social psychol-ogy. Experiments show that people are several times more likely to laugh if they are around others [39]. This indicates that the purpose of laughter lies in communicating one’s emo-tions. Laughter has shown to promote group cohesion [7] and, if it is not overused, encourage the listener to express

happiness [28]. Producing original laughter might make the machine more personable and thus, enhance human-machine communication.

Feed-forward neural networks might be a viable solution for predicting sequences. These networks take an input x and, though intermediate computations, output a vector y. They are often used for prediction tasks. For example, predicting the prices of real estate [27], predicting the risk of heart disease [34] and call volumes in a call center [14]. The problem of laughter generation can also be formulated as a prediction task. If the network learns to accurately predict laughter sequences, it is able to generate a long sequence of laughter by using a single input and continuously feeding the outputs of the network as inputs. Feed-forward neural networks thus provide a baseline for predicting laughter sequences but as they do not take into account the previous samples, they are unable to learn complicated patterns in the data.

Recurrent neural networks perform better at predicting se-quences compared with FFNNs. RNNs [31] are a family of neural networks used for processing sequential data, such as audio sequences [10]. Like feed-forward networks, RNNs accept an input vector x and produce an output vector y. How-ever, the outputs of an RNN are influenced not only by one input, but by a history of outputs the network has generated in the past like shown in Figure 1. RNNs have previously been used for text [12, 16, 36] and speech generation [45, 41]. RNNs and FFNNs can be trained discriminatively or with a GAN. When a model is trained discriminatively, the model’s goal is to minimize the error between the predicted outcome and the ground truth. However, the goal of the GAN is to gen-erate samples by taking a training set and learning to represent an estimate of the training set [11, 9]. The process of training a GAN is illustrated in Figure 2. The network consists of two sub-networks: a generator and a discriminator. The gen-erator transforms meaningless input, usually uniform noise, into samples of laughter. The generator passes the samples over to the discriminator. The discriminator receives both: the generated samples of laughter and real samples of laughter. It is then trained to discriminate between the two. The goal of the generator is to "fool" the discriminator and maximize the discriminator’s loss function by producing realistic sam-ples. The goal of the discriminator is to minimize the same loss function by examining the laugh and accurately determin-ing whether the laugh was generated or real. As the models train against each other, both networks’ performance should

(3)

Figure 1. Generating sequences with an RNN compared with an FFNN. The blue arrows on the figure illustrate how the inputs (on the left side) are transformed to outputs (on the right side) with an FFNN. The RNN uses the same transformation but in addition, the outputs of the hid-den layers are passed as input to the hidhid-den layer at the next timestep and they are used in the prediction of the next sequence. The added component of the RNN model is illustrated with black arrows. In the generation phase, a seed sequence (in this case the word "Generating") is inserted to the trained network. The network predicts the output for the seed sequence (in this case "audio") and inserts this as the input to the RNN. This continues until the end point to the sequence which is specified by the user.

increase and eventually the generated samples are virtually indistinguishable from real samples [9].

The problem of comparing two different methods for training is interesting because GANs have a relatively new approach to training a network and not much is known about how it com-pares to other methods for training. GANs were introduced in 2014 and they have mainly been used to generate realistic images, such as pictures of faces [6], cartoon characters [20] and animals [42]. Few papers have been published about using a GAN to generate audio but previous research has encour-aged to exploring this possibility [30]. GANs have previously been used together with an RNN [43, 23, 8] but to the best of our knowledge, previous research has not compared these two networks in an audio generation task.

The goals between discriminative training and training with a GAN can be made similar using conditional GANs (cGANs) [22]. In order to compare the RNN with the GAN, both networks need to have a goal of predicting a sequence based on a given sequence. cGANs have successfully been able to generate images of faces [6], digits [22] and rooms [38]. Thus, this network might also be a good tool for synthesizing laughter.

This thesis will explore and compare the two aforementioned methods - discriminative training and training with a GAN for laughter generation. Two models will be trained discrim-inatively - an FFNN and an RNN. The best model of these two will also be trained with a GAN. In attempt to optimize both training methods, the following research questions were set. The subquestions are motivated in further detail in the Experiments section.

1. How does the output of the feed-forward neural network de-pend on the number of neurons in the hidden layers and the

standardization method of data when the model is trained discriminatively?

(a) Which number of hidden neurons in the network leads to the smallest validation error?

(b) What is the effect of standardizing data using the stan-dard deviation and mean of the respective dataset, in-stead of using the standard deviation and the mean of the training set?

2. How does the output of the recurrent neural network depend on the number of neurons in the hidden layers and the activa-tion funcactiva-tions when the model is trained discriminatively?

(a) Which number of hidden neurons in the network leads to the smallest validation error?

(b) What is the effect of adding activation functions to the dense layers of the network?

3. How does the output of the recurrent neural network change when the model is trained with a cGAN?

4. Which model achieves the lowest test loss when predicting the next timestep of audio?

5. What characterizes the performance of both models when generating longer sequences using seed sequences from the training set and a previously unseen dataset?

2. RELATED WORK

The following sections explore how previous research in au-dio generation have motivated this project about comparing discriminative training with training with a cGAN for audio generation.

2.1 Generating audio using an RNN and an LSTM unit

This paper is partly inspired by a music generation model called GRUV which is based on an RNN [26]. GRUV was trained to generate popular music. Many other music genera-tors [2, 18, 3] use MIDI files as input which can only be used to transfer note sequences for electronic musical instruments. In contrast, GRUV uses wave files as input which can also be used to generate other types of audio, including laughter sounds and thus, this research was used to build the RNN model used in this project.

A limitation to the simple RNN model is known as the "vanish-ing gradient" problem where the weights of the earlier layers are updated slowly because during backpropagation, the gra-dients of the earlier layers become very small. The problem can be overcome by using a Gated Recurrent Unit (GRU) or a Long Short-Term Memory Unit (LSTM) instead of a simple RNN. According to [26], an LSTM unit was able to produce music that was more natural compared with the GRU unit. Thus, in this research the vanishing gradient problem was countered with an LSTM unit.

A major issue concerning RNNs is the tendency for these models to overfit [5]. Previous research has found reducing the number of hidden neurons is an effective tool for avoiding overfitting [44]. This motivated research questions number 1(a) and 2(a) about finding the optimal number of hidden neurons for the RNN and the FFNN.

(4)

Figure 2. Generating sequences using a GAN(top) and a cGAN(bottom)

2.2 Generating audio using a GAN

Deep convolutional generative adversarial networks have been used to successfully generate images [30]. An attempt has also been made to apply this architecture to generating spec-trograms of piano music [21]. This attempt was not successful, most likely due to the fact that slight changes in the spectro-gram can have a great influence on the entire sound signal. Thus, spectrograms are not used to generate audio in this project.

Continuous recurrent neural networks with adversarial train-ing (C-RNN-GAN) have successfully combined an RNN with a GAN to generate classical music [23]. In contrast, the pur-pose of this project is to compare an RNN with a GAN by keeping the architecture of the generator consistent with the architecture of the RNN.

GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs [30]. GANs are difficult to train because the cost functions are non-convex, the parameters are continuous, and the parameter space is extremely high-dimensional [32].

2.2.1 Generating audio using cGANs

Conditional GANs differ from regular GANs in that some information is used to condition the discriminator and the generator. For example, when a regular GAN would learn to generate numbers, it is possible to condition GANs to generate specific numbers by feeding the class labels to the generator and the discriminator [22]. In our research, the networks are conditioned by an audio sample at time t to encourage the generator to generate an audio sample at time t + 1. This is explained in further detail in section 3.2.

In several articles, cGANs have successfully been able to generate images of faces [6], digits [22] and rooms [38]. In all of these works, successful results have been obtained when the generator is provided with two inputs: noise and conditioning data. However, in [13], it is observed that the generator learns just to ignore the noise. Therefore, in [13] only conditioning data is given to the generator and noise is provided only in

Figure 3. PCM encoding of a single sine wave

the form of dropout. By using this approach, also several detailed images have been created. Noise is usually provided to the generator so that it would not produce deterministic outputs [13]. In this project, the purpose of the GAN is to produce a very specific output. Therefore, the task at hand is deterministic and not adding a noise vector as an input to the generator is justified.

3. METHOD

This section will firstly discuss how predictions of audio se-quences can be made with an RNN and a FFNN. We then compare two ways of training these models: discriminatively and with a GAN. Finally, we discuss how a longer sequence of laughter can be generated with the trained model.

3.1 Predicting audio sequences

Modeling a laughter sequence with a neural network can be done by estimating an audio sample based on a previous se-quences. This problem can be solved by using FFNNs or RNNs. First, experiments were made with a simple FFNN to find a baseline to measure the performance of the RNN. With an FFNN the problem can be defined as follows: given a vector Xtat a time interval t, what is the most likely vector

Xt+1at time t + 1. The new sequence represents the best guess

for the sequence at the next time interval.

In contrast, the RNN does not only make a prediction based on the previous sequence, but a number of previous sequences. With an RNN, the problem is defined as follows: Given a set of vectors Xt, Xt−1...X0at time intervals t,t − 1...0, generate the

most likely vector Xt+1at time t + 1. The difference between

the two models is also illustrated in Figure 1.

The model for the RNN was based on [25]. The model has 2 fully-connected layers and one LSTM layer. The standard LSTM layer is used to prevent vanishing gradients [29]. The FFNN was set up so that it is comparable to the RNN. The LSTM layer of the RNN was substituted with another fully connected layer in the FFNN.

(5)

The sequences of audio which are used for prediction are rep-resentations of sound waves. In this work, laughter sounds were represented digitally with pulse-code modulation (PCM) values. These values correspond to the loudness of the sound and they contain information about the amplitudes of the wave-forms in the audio [35].

Figure 3 describes how a sine wave (red curve) is transformed into PCM values (blue dots). In the PCM stream, the amplitude of the signal is sampled at uniform intervals - in this paper 16000 times per second, a frequency of 16kHz. The regular intervals are indicated with the vertical lines. For each sample, the closest value on the y-axis is chosen. Thus, the PCM stream creates a discrete representation of the input signal. Each sample is rounded to the nearest integer value in the range +/- 32768.

3.2 Training the models

We explore two methods for training our models. First the model was trained discriminatively – a prediction was made based on observed variables with an FFNN and an RNN. Sec-ondly, the model was trained by using a GAN.

Loss functions

The loss function used for the discriminative models was the mean squared error(MSE):

J(θ ) = 1

T

∑

_t (Xt− ˆXt)

2_. ₍₁₎

In the formula, T denotes the timesteps, which is the specified number of input samples x used to predict the next time step y. In the case of FFNN, the number of timesteps is 1. Xt

denotes the predicted sequence for a given time and ˆXtdenotes

the ground-truth frequency representation for the given time. The MSE loss function was used because it has previously been used to successfully generate music in [25]. The MSE loss function allows us to reconstruct the audio clip based on training samples.

The MSE is not usually used in GANs because the goal of the GAN is not to reconstruct the audio sample but to generate a new sample by learning the representation of the training data [10], [11]. As the final objective of this work is to also generate laughter, not only predict it, the loss function of the GAN can make it more successful in understanding the distribution of laughter clips and creating a sample which fits into the distribution. Thus, a GAN can be more successful in generating original laughter. However, when conditioning the GAN, we keep it comparable with the discriminative models by making the optimal solution to reconstruct the laughter clip. This is done by using a laughter vector Xt as input

to the generator and passing the discriminator two pairs of inputs. The formation of the two input pairs is illustrated in Figure 2. The first pair of input contains two sequences: Xt+ Xt+0.25s. These sequences occur consecutively in the

training data and this pair is called the "true" pair. The other pair represents a sequence which is followed by the generated sequence: Xt+ G(Xt). This pair is called the "fake" pair. In

order to optimize its loss function, the generator must generate a sequence that is identical to the next timestep of audio. We expect that the combination of a binary cross-entropy loss with conditioning information will lead to the cGAN being able to predict laughter sequences in a more generalized way. The binary crossentropy loss is implemented on the discriminator in the following way:

J((xt, yt)Nt=1, D) = N

∑

t=1 ytlog D(xt)− N

∑

t=1 (1−yt) log (1 − D(G(xt))) (2) where Xt is the input vector to the generator which contains

real laughter. Xt+ Xt+0.25s is a vector containing a pair of

real laughter clips. Xt+ G(Xt) is a vector which contains a

true laugh and a vector that is generated by the generator. N signifies the number of total samples given to the discriminator. The labels which signify whether the input vector contains a generated part is given by yt. As the ground truth only contains

labels that are 0 or 1, only one term of the formula is necessary to calculate the loss for each pair of laughter that is passed as input to the discriminator. If the label is 1 ("real"), only the first term is relevant. If the label is 0 ("generated"), only the second term is relevant.

The loss function described in [9] is used for the generator. This is similar to the cross entropy loss function described ear-lier but all of the examples that are passed to the discriminator during the training of the generator are labeled as "real". Thus, only the first term is relevant. The generator’s loss function is the following:

J(G) = −1

2

∑

log D(Xt+ G(Xt)) (3) The optimal point for the GAN is where both - the generator and the discriminator - have the lowest losses [32].

Training procedure

The structure of the discriminative models were described in section 3.1. The structure of the RNN will be used as the structure of the generator of the GAN. The structure of the discriminator has to be comparable to the structure of the generator. In [13], the discriminator has almost as many layers as the generator. Thus, a feed-forward neural network with 3 fully connected layers was used for the structure of the discriminator. In [13] the number of hidden units in the layers of both networks are varied between layers. To maintain simplicity and the similarity between this research and [25], in this work the number of hidden neurons for both networks was kept consistent throughout the layers.

Without a discriminator added to the models, the training procedure is the following:

1. All input sequences are passed through the model. 2. The (initially uniformly random) weights are assigned and

(6)

3. The loss is calculated

4. The weights of the network are updated.

This process is repeated until the lowest loss is found for a validation set. Training the RNN model is more complex compared with training an FFNN because the RNN requires a set of inputs that are used to compute the output of the model, like described in 3.1.

Compared with [25], several hyperparameters were changed for training. The learning rate was set to 0.05 because ini-tial tests showed that a higher learning rate makes the model quickly overfit. Instead of 40 timesteps, like in [25], only 8 timesteps were used when the RNN was trained. This decision was made because GRUV was aiming to generate music and an average song is several minutes long. At the same time, the average length of a burst of laughter in our dataset was around 2.6s (stdev = 3.1s). As the target sequence was shorter, we also had to use fewer timesteps for prediction.

The activation function used in the first dense layer was a Rectified linear unit (ReLU) because this is commonly used for feed-forward neural networks. The activation functions used in the second and third layers were hyperbolic tangent (TanH). The second layer had a tanh activation function because this is the activation that was used for the LSTM layer in [25]. The third layer had a tanh activation function so that outputs could be positive and negative.

The training procedure of the generator of the GAN is sim-ilar to the training procedure of the discriminative models. However, with the GAN, only some sequences are used to gen-erate outputs and update weights and the generator’s loss is computed through the discriminator which means that the dis-criminator has to be trained before the generator. The cGAN training process is the following:

1. The training set was divided into a set of batches. Like in [13], one batch consisted of 10 samples.

2. The first batch of audio samples were passed to the generator as input.

3. Using (initially randomly uniform) weights, the generator created outputs.

4. Real samples with the same size as the generated outputs were extracted from the concatenated laughter audio file. 5. Two sets of data were produced for the discriminator: where

the inputs of the generator were concatenated with the gen-erated outputs, and where the inputs were concatenated with the real outputs. This is illustrated in Figure 2.

6. Labels for the discriminator were added: 0-s were added as labels to the vectors with generated parts and 1-s to the vectors with the real parts.

7. The discriminator was trained based on the outputs and labels from the previous two steps.

8. The generator was trained like the discriminate models outlined earlier in this section, with the following changes: (a) The inputs were passed to the network that connects the discriminator and the generator, instead of just the generator network.

Table 1. FFNN losses per each number of hidden dimensions - standard-ized based on the training set’s mean and standard deviation

Hidden units Validation loss Training loss

100 0.1860 0.7918

200 0.1860 0.7989

400 0.1861 0.7955

1000 0.1860 0.7925

Table 2. FFNN losses per each number of hidden dimensions - standard-ized based on each dataset’s mean and standard deviation

100 1.027 0.9984

200 1.027 0.9927

400 1.027 1.0006

1000 1.027 1.0025

(b) Like in step 4, the generator’s input and prediction were concatenated and passed over to the discrimina-tor. This time no "real" pairs were be passed to the discriminator.

(c) The discriminator weights were set as not trainable during the time when the generator is trained. Later, the discriminator’s weights were set as trainable once the generator’s weights were updated.

9. This process is repeated with the next batch

3.3 Generating laughter from a seed sequence

In this work laughter is generated using the same method as in [25]. After the neural networks have been trained, laughter can be generated by providing a seed sequence St, St−1, ...S0

representing audio waveforms at time intervals t,t − 1, ...0 as input to the neural network. The network predicts the output sequence Pt+ 1, Pt, ...P1based on the seed sequence.

In this sequence P represents an audio waveform at a time interval. We call this sequence the predicted sequence. Once the predicted sequence is generated, it is used as an input sequence in the neural network. From the output sequence that is then generated, the last output vector is added to the predicted sequence and the new predicted sequence will be used to predict the next sequence of audio. This continues until the user specifies an end point to the sequence.

According to [25], larger seed sequences produce better results compared with shorter seed sequences and generating a se-quence that is approximately three times larger than the initial seed sequence tends to produce a coherent musical output that does not result in generation loops. Thus, based on 2s of audio, a clip of 6s is created. The generated sequence is converted into a wave file.

4. EXPERIMENTS AND RESULTS

Preliminary experiments were conducted with three laughter databases. In addition new data was generated by modifying the pitch, speed and volume of the laughter samples according to [33]. Preliminary tests showed overfitting on this dataset, even though all the test, training and validation sets had been constructed by using stratified random sampling. Even when using linear regression and regularization, the overfitting per-sisted. We hypothesized that the overfitting might be due

(7)

to how the data were modified or inconsistencies between datasets. To better understand the cause of overfitting, only one dataset without any modifications was used for training in the next experiments.

Preliminary tests also showed that when data were transformed using the Fourier transform, a simple feed-forward neural network showed overfitting. When only absolute values of the transformed data were used, the validation loss started decreasing with simple models. However, in the context of this thesis, the absolute values of the transformed data could not be used because to the best of our knowledge, audio cannot be reproduced by only using the absolute values. To simplify the data even further, the PCM values were used in the next experiments.

The following experiments are divided into three parts. We first explore the result of predicting a timestep of audio with only a feed-forward neural network to establish a baseline for the RNN. Secondly we build an RNN model according to [26] and measure how well a sequence of audio can be predicted compared to the FFNN. We finally add a discriminator to the RNN model to see how this changes the RNNs capacity to predict the next sequence of audio. The RNN and GAN are compared by measuring their accuracy of predicting a timestep of audio. In addition, longer samples of audio are generated by the trained networks.

4.1 Experimental setup

4.1.1 Technical details and hardware

All experiments were run on IBM Softlayer cloud using a GPU: Nvidia Tesla K80 graphics card with 1GB memory and 2 cores. The python script was written in Keras. A version of the GRUV model that has been made compatible with newest Keras was used [26].

4.1.2 Dataset

For the following experiments two datasets were used: • AVLaughterCycle dataset [37] - Labeled laughs from

indi-vidual participants watching videos or acting out bursts of laughter. In total, the dataset contains 1854 laughs from 24 participants.

• IBM meetings dataset - Meetings with groups of 3 - 5 people were recorded and spontaneous bursts of laughter were labeled. The dataset consisted of 186 laughs from a variety of people. A more detailed description of the dataset can be found in [24]. The dataset is not publicly available. The AVLaughterCycle dataset was split into a training, test and validation set. Laughter samples from participants 1 - 16 formed the training set, laughter samples from participants 17 - 20 formed the validation set and participants 21 - 24 formed the test set. The IBM dataset was used for additional validation. This division of data was preferred to randomly selecting data to the training, test and validation set because this allows to explain the performance of the different datasets in case they perform significantly differently.

All laughs that were less than 0.5s long were discarded. The inputs and outputs were comprised of 0.25s segments of audio. There was no overlap between different inputs. For each input,

there was a corresponding output at the next time interval. In [25], the audio clips were concatenated together and the last clip of one song had the first clip of the new song as a corresponding output. We wanted to make sure that the neural network learns to represent a continuous laughter burst instead of concatenated bursts of laughter. Thus, the final 0.25s of audio in a laughter burst were not used as an inputs because they did not have a corresponding output at the next timestep. The train-test-validation split was approximately 70%-12%-16% respectively. The training set consisted of 9875 samples (42 minutes), validation set of 1705 samples (7 minutes), and test set of 2233 laughter samples (9 minutes) of laughter. In the IBM dataset there were 2661 samples (11 minutes) of laughter and it was used as another way of validating the model. The PCM units were transformed to the range -1 to +1 by divid-ing all values with the absolute maximum value. The datasets were then standardized using the Z-score standardization. The model was trained with using 2s chunks of the training data that had been divided into 8 timesteps. During training, there was a corresponding 0.25s audio vector for every 0.25s input vector. As in [25], all samples were 0.25-second-long. However, the samples consisted of fewer values because [25] had a higher sampling rate. The sampling rate used in GRUV was 44100Hz and a 0.25s vector of audio contained 11025 values. In our project the sample vectors were 4000 values long.

4.1.3 Evaluation

The performance of all models was estimated on the valida-tion set. The models were evaluated by calculating the mean squared error between the ground truth and the predicted se-quence as in [25].

To save computational resources, the validation losses were calculated in every 10 epochs, instead of calculating the loss at every epoch. This was done for all models to make sure that the models are compared on equal grounds.

4.2 How does the output of the feed-forward neural net-work depend on the number of hidden neurons and the standardization method of data when the model is trained discriminatively?

First, the next sequence was predicted with a feed-forward neu-ral network. The purpose of this experiment was to estimate how well a sequence can be predicted with an architecture that does not contain an LSTM unit. As in [25] the RMSprop algorithm was used for optimization. The SGD algorithm was also tested but we found that the RMSprop algorithm gave slightly smaller validation losses (0.18603) when testing with 100 hidden neurons compared to SGD (0.19328).

4.2.1 Optimal number of neurons in the hidden layer

Finding the optimal number of neurons in the hidden layer of the feed-forward neural network is necessary to build a model that does not overfit. The number of optimal hidden neurons was investigated under two conditions. First, the mean and standard deviation of the training set were used to standardize the validation data as recommended in [4]. However, in this

(8)

Figure 4. Training and validation losses for an FFNN model with 100 hidden neurons when the validation set has been standardized using the standard deviation and mean of the training set

case, the validation and test set were not constructed using the exact same distribution of audio. Thus, another experi-ment was done where the mean and standard deviation of the validation set was used for standardization.

Standardizing data using the standard deviation and mean of the training set

Table 1 contains the lowest validation losses for each number of hidden neurons. As can be seen from the the table, the number of hidden neurons had a slight effect on the training loss but not a significant effect on the validation loss. While the training loss was consistently around 0.8, the validation loss was around 0.18. Figure 4 depicts the validation loss with 100 hidden neurons. As can be seen from the figure, the validation loss decreases rapidly and stabilizes at around 5 epochs while the training loss decreases more slowly. The same pattern of decrease can be seen when testing with 200, 400 and 1000 hidden neurons.

It is unusual for the validation loss to be smaller compared to the training loss. This phenomenon was inspected by vi-sualizing the predictions of the network. Figure 5 illustrates the difference between the predictions and the actual values in the validation set. The figure shows that the amplitude of the predicted signal does increase a little when the amplitude of the true output signal increases. However, the network mostly predicts values that are close to 0. The validation error, which is smaller than the training error, might be occurring due to the validation set having a smaller variance compared to the training data. The standard deviation for the training set was 4110.5 PCM units while it was only 1600.7 PCM units for the validation set. To see whether the small validation error was caused by the differences in the variance of the datasets, another experiment was performed where the standard de-viation and mean of the respective datasets was used in the standardization process.

Standardizing data using the standard deviation and mean of the corresponding dataset

Each dataset was standardized individually, using the dataset’s standard deviation and mean. This way, the relative changes in the validation set would be comparable to similar changes in

Figure 5. Prediction of the validation set for 8 seconds of audio compared to the real results for the same time period

Figure 6. Comparison of the losses in the train, validation, test and IBM datasets using the FFNN model with 100 hidden neurons when the vali-dation set has been standardized using the standard deviation and mean of the respective dataset

the training set. Table 2 shows the results of using these scores as input to the neural network. As can be seen from the table, number of hidden neurons did not change the lowest validation loss. In all conditions, the validation loss was around 1.0269. In this experiment, the complexity of the model did not have an effect on the validation loss indicating that overfitting was caused by another factor.

As another way of validating the data, the IBM dataset and the test set were used. Using another dataset for validation allows us to deduce how well the model can be generalized for laugh-ter recorded in other conditions. Figure 6 illustrates the losses for the training, validation and IBM datasets. As can be seen from the graph, neither the training nor the validation losses decrease over time for any of the datasets. This means that the feed-forward model is not learning to predict sequences more accurately over time.

The first experiments showed that the validation loss does not change depending on the number of hidden neurons. After the data was standardized based on each dataset’s individual standard deviation and mean, the validation loss was more similar to the training loss. Thus, in the following experiments this version of the dataset was used.

(9)

Table 3. RNN losses per each number of hidden dimensions - using acti-vation functions on dense layers

100 1.0273 0.9192

200 1.0276 0.8929

400 1.0858 0.6702

1000 1.0267 0.9449

Table 4. RNN losses per each number of hidden dimensions - not using activation functions for dense layers

100 1.0453 0.8438

200 1.0638 0.7645

400 1.0864 0.6669

1000 1.1424 0.5029

4.3 How does the output of the recurrent neural network depend on the number of neurons in the hidden lay-ers and the activation functions when the model is trained discriminatively?

The following experiments investigated the effect of adding an LSTM unit to the model. The model was kept as consistent as possible to [25] to replicate the results. Also, the RNN model was kept consistent with the FFNN model to compare results of both models.

4.3.1 Optimal number of neurons in the hidden layer

In the feed-forward model the number of hidden neurons did not have an effect on the training or validation losses. To see if the number of neurons have an effect for the RNN and if there is an optimal number of neurons that prevents the model from overfitting, an experiment was conducted in which the number of hidden neurons were varied.

The optimal number of neurons in the hidden layer was inves-tigated under two conditions. First, a model was used which contained the same activation functions as had been previously used in the FFNN. This experiment was conducted to make the FFNN and the RNN comparable with each other. Secondly, the activation functions for the dense layers were removed to make the model consistent and comparable with [25]. Activation functions added to the dense layers

The model was built using a ReLU activation in the first dense layer and a tanH activation function in the last dense layer for reasons described in section 3.2.

Table 3 illustrates that the lowest validation loss (1.0267) oc-curred with 1000 hidden neurons. The validation loss is very similar to the validation losses achieved earlier fith the FFNN model. The model is slightly overfitting as the training loss is consistently lower than the validation loss. The bottom graphs in figures 7 and 8 show how the training and validation losses change across time. As can be seen from Figure 7, the validation losses are slightly increasing, and the lowest vali-dation loss for 1000 hidden neurons is reached at 20 epochs. The training losses are decreasing as indicated in Figure 8. In contrast, when training with the FFNN model, like described in , the losses for the training and validation set stayed the

Figure 7. Comparison of the validation losses with a different number of hidden neurons using the RNN with activation functions(top) and with-out activation functions (bottom) on the dense layers

same throughout the training process. This means that the LSTM unit caused the change in the validation and training losses. The model was learning but not information that could be generalized to the validation set.

No activation functions added to the dense layers

In [25], the activation functions of the dense layers were re-moved to perform affine transformation. Affine transformation is used when points are being transformed between different dimensionalities. The transformation preserves collinearity and ratios of distances between points [40]. In the following experiment, the model was kept consistent with [25] and the activation functions were removed.

Table 4 shows that the lowest validation loss (1.0453) occurred with 100 hidden neurons. This validation loss was still lower than the one which was achieved when activation functions were used for the fully-connected layers. The model is greatly overfitting. As the results in the table show, the training loss is a lot lower compared to the validation loss, reaching 0.5029 with 1000 hidden neurons.

The bottom graphs of figures 7 and 8 show how the training and validation losses change across time. As can be seen from the bottom graph in Figure 7, the validation losses are quickly increasing, and the lowest validation losses occur with 100 hidden neurons. The training losses are quickly decreasing as indicated in the bottom graph in Figure 8. As the model overfits even more compared to the model where activation

(10)

Figure 8. Comparison of the training losses with a different number of hidden neurons using the RNN with activation functions(top) or without activation functions (bottom) on the dense layers

functions were added to the dense layers, this model was not used for further experiments.

In conclusion, the RNN models were more promising com-pared to the FFNN model because the change in losses in-dicated that the models are able to learn over time. Both of the RNN models showed signs of overfitting but the model where activation functions were used in the dense layers was overfitting less. With this model 1000 hidden neurons had the smallest validation loss but validation losses for all hidden layer sizes were similar. Thus, this was the model that was used in the GAN in the coming experiments.

4.4 How does the output of the recurrent neural network change when the model is trained with a cGAN?

The first experiment showed that the FFNN model did not learn over time. As previous experiments showed that the RNN model with activation functions perform the best, this model was used as the generator of the GAN. The lowest val-idation loss with the RNN was achieved with 1000 neurons as indicated in Table 3 but as the difference between different conditions was small, 400 hidden neurons were used to save computational resources. The goal of the following experi-ments is to see if adding a discriminator to the RNN enhances training.

As can be seen from Figure 9, both subnetworks trained against each other without either network’s loss approaching 0. This indicates that the networks were balanced. Figure 10 shows

Figure 9. Comparison of the discriminator’s and generator’s losses dur-ing traindur-ing

Figure 10. Validation loss of the cGAN

the validation scores when the generator’s weights are loaded into an RNN. As can be seen from the figure, the validation losses remain similar across the training process. The lowest validation loss 1.0275 was achieved at 250 epochs. This means that the model did not learn to predict the next timestep over the course of training. This might have happened because the generator was not powerful enough to predict the next sequence on its own. However, the results of this experiment show that adding a discriminator to the RNN model prevented the RNN from overfitting as previous experiments without the added discriminator had showed increasing validation losses.

4.5 Which model achieves the lowest test loss when pre-dicting the next timestep of audio?

The best RNN model was trained discriminatively and with a GAN. The performance of the models after these training con-ditions was compared on two test sets: the AVLaughCycle test set and the IBM dataset. As can be seen from Figure 11, the GAN model is performing better on both datasets compared to the RNN model but neither one of the models is improv-ing throughout the course of the trainimprov-ing. The RNN model is likely getting worse because the model is overfitting. The GAN model is also not improving over time. This indicates that the model did not learn to predict the next timestep in a sequence of data.

(11)

Figure 11. Performance of the cGAN and the RNN on the test set and the IBM dataset

4.6 How do the models perform when generating longer sequences?

It is possible that instead of representing the next timestep, the GAN model learned to represent something else which is meaningful in terms of laughter generation. Individual pieces of generated audio were examined to find what the GAN started generating.

To visualize the output of the GAN and the simple RNN model, both models were given the task to produce an 8-second long audio clip based on 2 seconds. Weights at 400 epochs were chosen to illustrate the differences of both models because during this time, the models had the most different loss results.

4.6.1 Generating audio from a seed sequence from the training set

Figure 12 shows how the GAN model and RNN model per-formed when given a 2-second-long seed sequence form the training data. As can be seen, both models generate audio that is consistently close to zero but the amplitude of the GAN model is higher compared to the RNN model. This pattern can also be seen when seed sequences were used from test data.

4.6.2 Generating audio from a seed sequence from the IBM dataset

Figure 13 shows how the GAN model and RNN model per-formed when given a 2-second long seed sequence form the IBM dataset. Again both models generate audio that is con-sistently close to zero and the amplitude of the GAN model is higher compared to the RNN model. However, with this dataset, the RNN model also deviates from its average pattern at 1.5 seconds.

4.6.3 Qualitative observations

When listening to the generated audio clips of both models that are generated using random seed sequences from the training and IBM datasets, it becomes apparent that the GAN model creates audio clips of noise. However, the RNN model does occasionally produce audio clips that sound like laughter.

5. CONCLUSIONS AND FUTURE WORK

In this research we have introduced a way to compare discrim-inative training with training with a GAN when predicting

Figure 12. Generation of 6 seconds of audio from a training set seed sequence using a GAN (top) and an RNN (bottom)

Figure 13. Generation of 6 seconds of audio from a IBM dataset seed sequence using a GAN (top) and an RNN (bottom)

(12)

audio sequences. Both models reached similar validation scores but the RNN without a discriminator indicated overfit-ting. Only the RNN model without a discriminator produced samples which sounded laughter-like.

When training the RNN in a discriminative way, we could not reproduce the results in [25] as the model quickly overfit on our dataset. There might be several reasons for this. For example, our dataset was not as big as the one used in [25], the sample vectors were shorter which means that they might have made it more difficult to learn patterns in the data. Our datasets also proved to have large differences in variation. In addition, the model might be more suited for learning sequences that are longer, such as music, rather than laughter clips where the desired output is only a couple of seconds long.

We found that the RNN learned throughout the course of training while the FFNN model did not. In either case, the number of hidden neurons did not have a significant effect on the validation loss. By applying activation functions to the dense layers, the model was less likely to overfit on our data. We also found that when the validation set was standardized based on its respective mean and standard deviation, instead of the mean and standard deviation of the training or validation set, the loss became more similar to the training loss. When training the RNN network with a GAN, we were able to train for more than 400 epochs without either one of the subnetworks converging. However, using the architecture we had, it did not accurately start predicting the next timestep of audio. This can be due to the RNN model overfitting which was used as the generator in the GAN. This can also be due to the specific architecture we used to build the cGAN.

As the generator of the GAN used an RNN model that was overfitting and the GAN did not show overfitting, we can deduce that adding the discriminator to the RNN prevented the model from overfitting. Experimentally we have also validated our results on the IBM dataset, showing that the models perform equally on their validation data and a new dataset.

In conclusion, this work explored the first steps in generat-ing laughter usgenerat-ing neural networks. More work is needed to optimize both models. To optimize the RNN, tests can be done with more data that has a higher sampling rate and by standardizing each of the laughter bursts individually, instead of standardizing them on the dataset level. To optimize the GAN, it might be beneficial to add another objective to the net-work. Previous approaches to conditional GANs have found it beneficial to mix the GAN objective with a more traditional loss, such as MSE [13]. The goal of the discriminator remains unchanged, but the generator is asked to not only fool the discriminator but also to be near the ground truth output.

6. ACKNOWLEDGEMENTS

Firstly, I would like to thank both of my excellent supervisors who have created a stimulating environment for learning: I am very thankful to Zoltan Szlavik who has made sure that the process of writing a thesis is also enjoyable. His creative and fun approach to complications has often left me inspired

to see problems from a unique angle and make use of cat memes when appropriate. His expertise, not only in machine learning, but in many other areas of technology and science have often lead me to think about how to broaden the scope of this research and apply the findings of this research elsewhere. I am equally thankful to Thomas Mensink whose skill and knowledge in machine learning inspires me to continue learn-ing about the subject after the completion of this thesis. I was often astonished to witness how quickly he was able to pinpoint potential problems, suggest new approaches and struc-ture ideas. He made sure that the stress of writing a thesis is optimal while keeping up high expectations of himself and others.

I am also grateful for everyone at the Centre of Advanced Studies at IBM, especially Nikita Galinkin for interesting discussions about this project and Jihong Ju for guiding me through using IBM Softlayer Cloud. Thank you for all students who completed their MA thesis under Thomas Mensink’s supervision - I learned a lot from you all.

(13)

7. REFERENCES

1. David T Bill. 2003. Contributing Influences on an Individual’s attitude towards a new technology in the Workplace. Media, PA: Liquid Knowledge Group, Ltd (2003).

2. Adam Roberts Elliot Waite, Douglas Eck and Dan Abolafia. 2016. Project Magenta. (2016).

https://magenta.tensorflow.org/

3. Marcin Tomczak Matthew Johnson Jamie Shotton Feynman Liang, Mark Gotham. 2016. BachBot. (2016).

http://bachbot.com/

4. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.

5. Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems. 1019–1027.

6. Jon Gauthier. 2014. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester2014, 5 (2014), 2. 7. Matthew Gervais and David Sloan Wilson. 2005. The

evolution and functions of laughter and humor: A synthetic approach. The Quarterly Review of Biology 80, 4 (2005), 395–430.

8. Arnab Ghosh, Viveka Kulharia, Amitabha Mukerjee, Vinay Namboodiri, and Mohit Bansal. 2016. Contextual rnn-gans for abstract reasoning diagram generation. arXiv preprint arXiv:1609.09444(2016).

9. Ian Goodfellow. 2016. NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv preprint arXiv:1701.00160 (2016).

10. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

11. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.

12. Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013). 13. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A

Efros. 2016. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016).

14. Mona Ebadi Jalal, Monireh Hosseini, and Stefan Karlsson. 2016. Forecasting incoming call volumes in call centers with recurrent Neural Networks. Journal of Business Research69, 11 (2016), 4811–4814.

15. Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo, Yusuke Ijima, Kaoru Hiramatsu, and Kunio Kashino. 2017. Generative adversarial network-based postfilter for

statistical parametric speech synthesis. In Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2017). 4910–4914.

16. Andrej Karpathy. 2015. The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy blog (2015).

17. James Kennedy, Séverin Lemaignan, Caroline Montassier, Pauline Lavalade, Bahar Irfan, Fotios Papadopoulos, Emmanuel Senft, and Tony Belpaeme. 2017. Child speech recognition in human-robot interaction: evaluations and recommendations. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 82–90. 18. Ji-Sung Kim. 2016. Deep Jazz. (2016).

https://deepjazz.io/

19. Jin-Gyu Lee, Bo-Hee Lee, Ju-Yeong Jang, Ja-Young Kwon, Keum-Hi Mun, and Jin-Soun Jung. 2017. Study on Cat Robot Utilization for Treatment of Autistic Children. International Journal of Humanoid Robotics (2017), 1750001.

20. Yifan Liu, Zengchang Qin, Zhenbo Luo, and Hua Wang. 2017. Auto-painter: Cartoon Image Generation from Sketch by Using Conditional Generative Adversarial Networks. arXiv preprint arXiv:1705.01908 (2017). 21. Bartosz Michalak. 2016. DCGAN and spectrograms.

(2016).http://deepsound.io/dcgan_spectrograms.html/

22. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint

arXiv:1411.1784(2014).

23. Olof Mogren. 2016. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904(2016).

24. Marjolein Nanninga, Yanxia Zhang, Nale

Lehmann-Willenbrock, Zoltán Szlávik, and Hayley Hung. 2017. Estimating Verbal Expressions of Task and Social Cohesion in Meetings by Quantifying Paralinguistic Mimicry. In The 19th International Conference on Multimodal Interaction (ICMI 2017). (accepted). 25. Aran Nayebi and Matt Vitelli. 2015. GRUV: Algorithmic

Music Generation using Recurrent Neural Networks. (2015).

26. Matt Pearson. 2016. GRUV. (2016).

https://github.com/mattpearson/GRUV

27. Steven Peterson and Albert Flanagan. 2009. Neural network hedonic pricing models in mass real estate appraisal. Journal of Real Estate Research 31, 2 (2009), 147–164.

28. Robert R Provine. 1992. Contagious laughter: Laughter is a sufficient stimulus for laughs and smiles. Bulletin of the Psychonomic Society30, 1 (1992), 1–4.

29. Andrew Pulver and Siwei Lyu. 2016. LSTM with Working Memory. arXiv preprint arXiv:1605.01988 (2016).

(14)

30. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).

31. David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, and others. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.

32. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems. 2234–2242. 33. Jan Schlüter and Thomas Grill. 2015. Exploring Data

Augmentation for Improved Singing Voice Detection with Neural Networks.. In ISMIR. 121–126.

34. Aditya A Shinde, Sharad N Kale, Rahul M Samant, Atharva S Naik, and Shubham A Ghorpade. 2017. Heart Disease Prediction System using Multilayered Feed Forward Neural Network and Back Propagation Neural Network. International Journal of Computer Applications 166, 7 (2017).

35. Juliu O Smith III. 2011. Spectral audio signal processing. W3K publishing.

36. Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 1017–1024.

37. Jérôme Urbain. 2011. AVLaughterCycle Database. (2011).http:

//www.tcts.fpms.ac.be/~urbain/#AVLaughterCycleDatabase

38. Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision. Springer, 318–335.

39. Mathias Weber and Oliver Quiring. 2017. Is It Really That Funny? Laughter, Emotional Contagion, and Heuristic Processing During Shared Media Use. Media Psychology(2017), 1–23.

40. Eric W Weisstein. 2004. Affine transformation. (2004). 41. Zhizheng Wu and Simon King. 2016. Investigating gated

recurrent networks for speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5140–5144. 42. Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi

Parikh. 2017. LR-GAN: Layered recursive generative adversarial networks for image generation. arXiv preprint arXiv:1703.01560(2017).

43. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2016. Seqgan: sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473 (2016). 44. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.

2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329(2014).

45. Heiga Zen and Ha¸sim Sak. 2015. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 4470–4474.