Distinguishing rhythmic classes by using autoencoders

(1)

Distinguishing rhythmic

classes by using autoencoders

(2)

Layout: typeset by the author using LA_TEX.

(3)

Distinguishing rhythmic classes by

using autoencoders

Dries Fransen 11041250 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. Dr. T.O. Lentz

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

3.2.2 Low-pass Filter . . . 15 3.2.3 MFCC . . . 16 3.2.4 Zero-padding . . . 16 3.2.5 Outliers . . . 16 3.3 Algorithm . . . 17 3.3.1 Goal . . . 17 3.3.2 Architecture . . . 17 4 Results 20 4.1 CNN autoencoder . . . 20 4.2 LSTM autoencoder . . . 21 5 Discussion 22 5.1 Interpretation of the results . . . 22

5.2 Possible reasons . . . 23

5.2.1 Complexity . . . 23

5.2.2 Zero-padding . . . 23

(5)

6 Conclusion 25

(6)

Chapter 1 Introduction

Infants can, to a certain extent, tell apart different rhythmic classes. Researching this can provide us with more insight into the process of learning a language, as well as into the development of infants.

It was Mehler et al. (1988) who first started research on the ability of infants to distinguish rhythmic classes. They did this by exposing the infants to tapes in both their mother tongue and in a foreign language, after which they found that the infants reacted more strongly to the recordings in their native language. Sev-eral others have recreated this experiment with similar results (Nazzi, Bertoncini, & Mehler, 1998; Moon, Cooper, & Fifer, 1993). These experiments have all been done with actual infants as subjects, who usually were no more than two weeks old. However, rather than about infants, this thesis will be about the way in which Artificial Intelligence (AI) can learn to distinguish languages in a similar fashion. For this thesis, I will aim to replicate one of the experiments conducted by Nazzi et al. (1998). The replication will differ in the sense that I will be replacing infants with AI. The experiment by Nazzi et al. (1998) involves the exposure of infants to clips in four different languages: English, Dutch, Italian and Spanish. The con-clusion of this experiment was that infants difficulties distinguishing English from Dutch and Italian from Spanish, but that it was easier for them to separate Dutch and English from Italian and Spanish. The theory upon which this experiment was based is that Dutch and English belong to the same rhythmic class and thus they have many aspects in common. The same is true for Italian and Spanish. The goal of this thesis is to create four ”infants”, or Neural Networks (NNs), each representing one of the above mentioned languages, which will be done by training each NN with one of the languages. The specific NN that will be used in this experiment is an auto-encoder - an NN capable of compressing and decompressing data. The plan is to feed processed audio data to each of the four auto-encoders, where every encoder represents to a specific language. The hypothesis is that this

(7)

will ensure the auto-encoders learn something about the prosodic information em-bedded in the languages, which is information similar to what infants hear when they are still in the womb. The next step is to check whether the auto-encoders will be able to distinguish the four languages in the same way the infants in the experiment by Nazzi et al. (1998) were able to do.

If teaching AI to tell apart rhythmic classes works out well, this will shed more light on the way in which the infants were able to differentiate between languages and rhythmic classes. The reason AI is good for explaining the learning process is that with real human infants, it is difficult to probe them for the type of informa-tion they actually extract. On the other hand, with neural networks, the process is in principle fully accessible. In this way, teaching an AI to discriminate between languages will benefit the advancements of linguistic science.

(8)

Chapter 2 Background

2.1 Prosody

Fujisaki (1997) defines prosody as follows:

Prosody is the systematic organization of various linguistic units into an utterance or a coherent group of utterances in the process of speech production. Its realization involves both segmental and suprasegmental features of speech, and serves to convey not only linguistic information, but also paralinguistic and non-linguistic information.

In other words, prosody refers to how linguistic units, e.g. syllables, are organised within a sentence when it comes to the timing between those units. Segmental and suprasegmental features of speech refer to the smallest audible fragment of speech, and a combination of those fragments, respectively.

Infants are able to hear the difference between languages belonging to different rhythmic classes such as English and Spanish. Rhythmic classes refer to the way in which timing is used in between units of speech (Nazzi et al., 1998), and they are an element of the prosodic make-up of languages. Infants do, however, have difficulties distinguishing languages from the same rhythmic class, such as Spanish and Italian. It is probable that this is due to prosody. Some important features that compose prosody are intensity, speed and pitch.

Of these three properties, intensity will be filtered out when preprocessing the data, since preprocessing will have the data be normalised, which will discard any information regarding the intensity of the data. However, of all properties, intensity can be seen as the least important for distinguishing languages from each other. Intensity is used in speech to convey emotion, among other things (Abelin & Allwood, 2000); however, this should not play a role in the dataset used for this project (section 3.1): while all speech contains emotion at least to a certain

(9)

extent, the emotion within the dataset used in this project should be relatively similar across all clips in the dataset. In addition, intensity is an unreliable cue when trying to model rhythm.

Pitch essentially relates directly to the frequency of sound: a high-pitched sound refers to a sound with a high frequency. Prosodic information can be found in how the pitch of sentences or other units of speech changes over time. This change differs between languages (Roach, 1982).

Speed, or rhythm, does play a large role in the distinction between rhythmic classes. In this case, the two rhythmic classes that are being distinguished are different when it comes to the timing of the syllables in the sentences. English and Dutch are generally regarded as being stress-timed, while Italian and Spanish are usually said to be syllable-timed. It should be noted however that these groupings are not universally agreed upon, and that arguments can be made for assigning Spanish to be stress-timed, for example (Roach, 1982).

Nespor, Shukla, and Mehler (2011) state the difference between stress-timed and syllable-timed languages as follows:

[...] languages would differ according to which chunks of speech must have similar durations, i.e. must be isochronous. The requirement of isochrony would hold between syllables in Spanish, and between interstress intervals in English. This proposal accounted for the fact that the syllables of Spanish or Italian, but not those of English or Dutch, are similar in quantity.

As such, Spanish and Italian were referred to as syllable-timed, since the syllables of those languages are approximately equal distances apart. Dutch and English were referred to as stress timed, as they both had equal timing between the stressed parts of the sentences. This information should be extracted by the algorithms, and can be used to tell the rhythmic groups apart.

An example of dividing languages into classes is that English and Spanish both end sentences by having a descending pitch in the last part of the sentence. However, a difference between them is that Spanish has a regular, flat pitch shape before the drop in pitch that denotes the sentence ending, whereas English usually has a low and falling pitch before the ending (Delattre, Olsen, & Poenack, 1962).

2.2 MFCC

MFCC stands for Mel Frequency Cepstral Coefficients. MFCCs are commonly used in sound processing as representations of auditory input. In summary, the

(10)

Figure 2.1: A sound signal and its Fourier transform (from learn.adafruit.com).

Figure 2.2: A visualised example of an MFCC (from medium.com).

MFCC is a way to represent the way in which humans perceive frequency. The fre-quency cepstrum is computed by first taking a Fourier transform of the frefre-quency spectrum.

A Fourier transform is a way of decomposing a signal, in this case a sound signal, into its constituent frequencies. This can be seen in Figure 2.1, where a sound sig-nal is present on the left, and the Fourier transform of that sound sigsig-nal is visible on the right. Every bar in the Fourier transform represents one of the frequencies that make up the sound signal (Nair, 2018). The length of bars correlates with how much the respective frequency is present in the sound signal: longer bars mean that the frequency is present more.

With the Fourier transform calculated, to get the MFCC it is needed to take the log of the magnitude of this transform, and then to take the spectrum of this log via a cosine transform. The resulting spectrum is called the cepstrum.

(11)

adjusting frequency relative to human interpretation of the sound. Since frequency is perceived on a logarithmic scale, a difference of 100 Hz is not the same to the human ear on every pitch level. For example, a difference of 100 Hz constitutes an entire octave when it is the difference between 100 and 200 Hz, but it is about a twelfth of an octave when it constitutes the difference between 1300 and 1400 Hz. The Mel scale aims to capture these changes and relate the measured frequency to the perceived frequency. Frequency in Hz is converted to the Mel scale is by using Equation 2.1, where m and f stand for mels and frequency in Hz, respectively, mels being the unit of measure for the Mel scale.

m = 2595 log₁₀(1 + f

700) (2.1)

As can be seen in Figure 2.2, the MFCC has three dimensions. The x-axis relates to the time of an audio fragment, the y-axis relates to the MFCC-coefficients, and the z-axis tells us about how intense the sound is for that coefficient at that time. An MFCC coefficient is nothing more than a range of Mel frequencies. In Fig-ure 2.2, every coefficient could represent a range of 200 Mel. This would mean that the first coefficient represents the range of 0-200 Mel, which is 0-100 Hz, that the second coefficient represents the range of 200-400 Mel, being 100-275 Hz, the third coefficient representing 400-600 Mel which is 275-500 Hz, etc.

The colour of the time frames for each coefficient represents the intensity of that frequency range at that time. A dark red colour in the fifth frequency range at time frame 200 means that the intensity of 740-1000 Hz frequencies is high. In other words, there is relatively much auditory information present with those frequen-cies. MFCCs store the frequency, or pitch, of audio files. The speed of speaking can be recovered by looking at the change in time in the MFCC. In this project, MFCCs will be used as input representations of the audio files.

2.3 Autoencoder

The goal of this project is to replicate the way in which unborn infants learn to distinguish languages. Since infants learn to speak a language by hearing it re-peatedly, it is presumable that auditory information leaves its mark in the infant brain in the form of basic patterns for a language, and that these patterns would in turn be used for reproducing language. A possible connection between these patterns and the goal of the project could be that it is theoretically possible to teach an AI these patterns and have it reproduce sentences of a language. This could be done with autoencoders, which is what this project will attempt.

Autoencoders are NNs that specialise in encoding data automatically. In the-ory, they can encode any type of numeric data. In this sense, "encoding" means

(12)

Figure 2.3: A representation of an autoencoder (from towardsdatascience.com).

compressing the data to make it as small as possible, while retaining as much infor-mation as can be retained. The goal of autoencoders is to make their output look as much like their input as possible. This results in having to make a compressed representation of the input. They consist of two parts: an encoding part and a decoding part.

The encoding part of the autoencoder is a regular neural network of any type. For example, if one would like to encode pictures of cats by using an autoencoder, a Convolutional Neural Network (CNN, section 2.4) would be the most useful in the encoding part of the autoencoder. The general structure of the encoding part of the autoencoder is that of a bottleneck. It starts off by having as many neurons as there are inputs, and then starts having less and less neurons in each subse-quent layer until a size is reached where the information loss is minimal while the compression, i.e. the relation between the input layer and the smallest layer, is as large as possible.

Next up in the autoencoder structure is the decoding part. In essence, the decod-ing part of the autoencoder is a mirrored version of the encoddecod-ing part. Its input is the smallest and last layer of the encoding part, and then the layers start becom-ing progressively larger, mirrorbecom-ing the encodbecom-ing NN, until finally the input size for the encoding part has been reached. This way, the structure of an autoencoder represents an hourglass figure, with equal input and output layer shapes, while the middle is much smaller as to allow for compression of the input, as can be seen in Figure 2.3.

The way autoencoders are trained is by providing a file as input, and setting the same file as the target output. The autoencoder should then automatically learn the underlying patterns in the input file and use those to encode the input as well as possible. Like other types of NNs, this requires large amounts of data to train well.

(13)

2.4 Convolutional Neural Networks

Figure 2.4: An example of the first step of a convolution (from towardsdata-science.com).

By using a Long Short Term Memory (LSTM, section 2.5), the data is interpreted as a series of data points with time intervals between them. However, it is also possible to interpret the data in the MFCC as if it were an image rather than a time series, containing several visual features. One such feature with respect to the actual data could be a pitch accent in someones voice, represented by a sort of parabola in the MFCC, going from the lower frequency bands to the higher frequencies and then down again. If such features could be captured and encoded, it could ensure the autoencoder can determine which features are unique to a language or rhythmic class.

One way to learn patterns in image data is by using Convolutional Neural Networks (CNNs). These are specialized NNs that can use convolutions to detect patterns in images.

CNNs work with different layers. The first layer used is a convolution layer. This is the layer that applies convolutions to the input image. A convolution is essentially the summation of the result of a pointwise matrix multiplication between the input image and a convolution kernel, which is another matrix. Figure 2.4 contains a visualisation of a convolution with a small matrix. The convolution kernel is the core of the CNN. A CNN contains many such kernels in a layer. One kernel in the first layer could represent a line going from the lower left of a picture to the upper right, another could represent a semi-circle, and the next convolution layer could combine the two into a parabola.

(14)

Convolution layers are usually paired with a max-pooling layer, which is a layer that takes the maximum value in an n*m neigbourhood. This is done to reduce the dimensions of the image, making it easier to extract features relevant to the identification task at hand. Such a max-pooling layer will then be followed by another convolution layer, then another max-pooling layer, etc.

The expectation of using CNNs on MFCCs is that the CNNs will uncover some of the underlying structures of the MFCCs related to the prosody of the languages studied. If this succeeds, it may become possible to group languages based on what rhythmic class they belong to by using the structures extracted by the CNNs.

2.5 Long Short-Term Memory

Figure 2.5: A representation of an LSTM cell. The numbers are referred to and explained in section 2.5 (from colah.github.io).

Since the data for this project is speech data, the autoencoder will have to use NNs that can capture change in a signal over time. This is because speech data, unlike for example visual data, is sequential, i.e. it has a specific order that is important to the meaning of the data. As such, it is crucial for an NN to be able to cope with sequential information. A type of NN that are able to do this are Recurrent Neural Networks (RNNs).

Traditional, basic NNs work by mapping input to an output by fitting a complex function to the data. RNNs, on the other hand, not only take into account the current input, but also the previous input. In other words, rather than having connections between the layers only like basic NNs do, RNNs have connections

(15)

within the layers as well as between the layers. As such, to produce an output for a node, it considers the input for that node as well as the output of the previous node. In this way, RNNs become quite good at handling sequential data.

A problem with RNNs is that only nodes in the same layer that are directly next to one another are connected. In practice, this means RNNs do have the capacity to handle sequential data, given that it is not too complex. However, when it comes to longer data sequences, RNNs do poorly: as the gradient that passes through layers shrinks, RNNs become unable to process these well. As such, they are not able to capture long-term relations in the data. This is where Long Short-Term Memories (LSTMs) come into play.

LSTMs were conceived by Hochreiter and Schmidhuber (1997) in an attempt to create a neural network that can process longer time series than regular RNNs are capable of. LSTMs are essentially RNNs with long-term memory capabilities. Where the RNN node has two inputs, namely the data input and the previous node input, an LSTM cell has three inputs, as can be seen in Figure 2.5. One of the inputs is the original data input, while the other two inputs come from the previous cell.

When looking at Figure 2.5, the upper input handles the cell state. Depending on the new data input and the lower previous cell input, the upper input can pass information from cell to cell. In this way, the upper line acts as the long-term memory of an LSTM layer that can be modified by both other inputs.

The lower left input of Figure 2.5 handles input that comes from the previous cell in a layer, much like a traditional RNN would. It traverses the LSTM cell together with the bottom input, which comes from either the raw data or from a previous layer. Together, they can modify the long-term memory if this is necessary. In addition, they are modified by the long-term memory as well.

The numbered parts in Figure 2.5 represent gates of the LSTM cell. Gate 1 is the forget gate of the cell. The input from the previous hidden state as well as from the current input pass through a sigmoid function, which scales the data between 0 and 1. The resulting matrix then meets the cell state and modifies it, essentially deciding what parts of the long-term memory are relevant and what parts should be forgotten.

Gate 2 is the input gate. The input from the previous hidden state and the current input are both squeezed through a sigmoid function and a tanh function, the latter of which scales the data between -1 and 1. The results of both functions are then concatenated via pointwise multiplication in the input gate. After this, the input gate data meets the forget gate data at gate 3, updating the cell state via pointwise addition.

The fourth gate is the output gate, which determines what information from this cell will be passed on to the next layer as well as to the next hidden state in the

(16)

current layer. It is acquired by squeezing the previous hidden state input and the current input with a sigmoid function and then doing a pointwise multiplication with the cell state, which will at that point have been squeezed by a tanh function. The result of this is the output of the LSTM cell.

(17)

Chapter 3 Method

For this thesis four different autoencoders have been created, one for each language - English, Dutch, Spanish and Italian. The reason for this is that according to previous research (Nazzi et al., 1998), Dutch and English have been difficult to tell apart by infants, as have Spanish and Italian. This means that first of all, auditory data has to be acquired, in each of the four languages, and it will have to be processed to ensure it can be used as input for learning algorithms. Secondly, learning algorithms should be trained with the gathered data, one for each of the languages. This means there will be four algorithms, each representing an infant whose mother tongue is one of the four languages. Finally, data from all of the languages will have to be put through all of the networks. For example, the network trained on English will be exposed to Italian, Spanish and Dutch, to simulate an infant hearing a foreign language, as well as to English itself, to create a baseline.

3.1 Data

The data used in this project has been obtained by downloading the English, Dutch, Spanish and Italian datasets from the Common Voice dataset. These datasets are quite different in size. The English dataset is 38 GB, whereas the rest is much smaller: 5 GB for Spanish, 3 GB Italian and only 884 MB for the Dutch dataset.

The datasets consist of .mp3 speech fragments, with each entry comprising one sentence. In addition, the dataset contains two files which list all the validated sentences and all invalidated sentences respectively. These files both contain the following:

(18)

• The number of upvotes this entry has received: if a sentence receives an upvote, it means a contributor to the dataset has validated this clip to be correct, and representative of the written sentence. A clip needs two upvotes minimally in order to be considered as validated.

• The number of downvotes: if the number of downvotes is equal to or greater than the number of upvotes, it is invalidated.

• The age group of the speaker, which goes in increments of ten years, i.e. twenties, thirties, etc.

• The gender of the speaker. • The accent of the speaker.

Note that the last three properties can be empty, because it is not required to specify age, gender or accent in order to be able to contribute to the dataset. For this project, only validated data was used.

3.2 Data Preprocessing

Having acquired the data, it needs to be processed to be usable by the NNs. There are several steps to this. Firstly, selecting and splitting the data, to balance the datasets. Next, applying a low-pass filter to simulate a womb-like environment. Thirdly, creating MFCCs from the clips. After that, zero-padding the data to mitigate the difference in length between all clips. And finally, removing outliers from the data.

3.2.1 Selection and Split

Firstly, the datasets had to be made equal in size, in order to create equal datasets for every language. This was necessary due to the difference in size of the datasets. For this, a function in Python was written that analysed the datasets and tried to balance the gender and age data, ensuring the division in gender was about 50/50 and the division in age was equal for every age group as well. Having done this, all datasets were scaled down to contain around 4000 entries each. They were also split in train, test and validation sets with a ratio of 70, 15 and 15 respectively.

3.2.2 Low-pass Filter

After the selection and splitting of the data, all clips had to be converted to .wav files, in order for them to be compatible with the python library Scipy (scipy.org),

(19)

which was used to process the files further. Scipy was employed to apply a low-pass filter to the data. These filters are used to eliminate all frequencies above a certain threshold, allowing only low frequencies to pass through. Low-pass filtering is essential for what this project is trying to achieve, which is simulating infants learning to recognise languages while in the womb. It is generally accepted that the womb acts as a low-pass filter with a cutoff frequency of around 400 Hz (Burnham, Kitamura, & Lancuba, 1999; Nazzi et al., 1998). Similarly, this frequency will be used to filter the speech clips, ensuring they mimic what an unborn child would hear. For applying this filter to the clips, the butter filter from the scipy.signal python library will be used.

3.2.3 MFCC

Next up, the filtered data has to be turned into a format that is easily interpretable by a machine learning algorithm. For this, using the raw .wav format will not do, since it is too large to allow for fast computation. An MFCC (section 2.2) allows for having an amount of relevant features that is easily usable. The python library Librosa (librosa.org) was used to extract the MFCCs from the audio files. In this case, the MFCC contained 12 frequency bands, or MFCC coefficients. The standard sampling rate of 22050 Hz was used. As such, the MFCCs produced by the function librosa.feature.mfcc had a dimension of (file_length x 12). The largest MFCC of the English dataset had a size of (315 x 12), for perspective.

3.2.4 Zero-padding

The algorithms chosen require that all sentences be equal in length. In the case of the chosen dataset, audio files differ in length, and thus so do their respective MFCCs. To mitigate this mismatch, all audio files were padded with zeroes at the front until they have the same length as the longest audio clip. However, it was later discovered that the autoencoders would require all clips of all languages to be the same length, since all languages had to fit all autoencoders. As such, the longest clip over all languages was determined to have a length of 350 timesteps, and so all clips were padded to that length.

This zero-padding takes place at the beginning of sentences. The beginning of sentences does not contain as much distinctive prosodic information, while the ending usually contains properties that are unique to the rhythmic class.

3.2.5 Outliers

Outliers are considered as any data point of which there was only one in the current timestep, or that had no data points on the neighbouring timesteps. When it comes

(20)

to training the NNs in the autoencoder, it is crucial that the length of the data is as equal as possible. A python function was written discards outliers. Using this function, it was possible to reduce the maximum data length by 1₄, while keeping 98% of the data in the process, and at the same time removing what can reasonably be considered as all outliers.

3.3 Algorithm

3.3.1 Goal

As stated before, the autoencoders that will be created will all represent an in-fant from one of the four aforementioned languages. The goal is to train all of them using their respective languages, and then to use each of the autoencoders to predict languages they have not been trained with. After this, the results will be compared and a conclusion can be drawn. There will be two different sets of autoencoders, one made with LSTMs and one with CNNs, for eight autoen-coders total. This means there will also be two different sets of results, since it is impossible to compare the two types of autoencoders due to the difference in architecture.

3.3.2 Architecture

The autoencoders will be created with the Keras (keras.io) python library, which is a wrapper for TensorFlow (tensorflow.org). First up is the explanation of the autoencoder that uses CNNs, and then the autoencoder that employs LSTMs. CNN autoencoder

The autoencoder that uses CNNs (section 2.4) has four layers in the encoding part and five layers in the decoding part. Keras’ convolution layers require the data to be three-dimensional, colour being the third dimension. This is because CNNs are used to deal with coloured pictures. In order to satisfy this requirement, the numpy (numpy.org) function np.repeat was used to place all individual values into separate arrays, reshaping the originally two-dimensional array into a three-dimensional array. As a result, the shape of the data is (350 x 12 x 1).

The encoding part of the CNN autoencoder begins with a 2d convolution layer that has a kernel size of (10 x 3). Its activation is the relu function, a function that allows positive values to pass through unrestricted, while converting negative values to zero. To ensure the data remains of the same shape after the convolution layer, the padding of the layer is set to ’same’, which means zero-padding takes place. This does slightly distort the meaning of the data, as the zeroes are actually

(21)

fake data points, but in this way, the dataset becomes much easier to manage by the autoencoder.

The second layer of the CNN encoder is a max-pooling layer. This is a layer that simply takes the maximum value in a specified neighbourhood. In this case, the neighbourhood is set to (2 x 2). As such, this layer reduces the data to a quarter of its original size. After this layer, there is another convolutional layer with the same specifications as the first convolutional layer, and then another max-pooling layer with its neighbourhood set to (5 x 2). This layer concludes the encoding part of the autoencoder. As such, the encoded size of the data is (35, 3, 8).

The decoding part of the autoencoder starts with a convolutional layer that is the same as both other convolutional layers. It is followed by an upsampling layer, which is a layer that does the opposite a max-pooling layer does: rather than reducing the data in size, it places every data point into the current data array in a specified neighbourhood. In this case, the neighbourhood is (5 x 2), which means that every value that was in the completely encoded layer of the autoencoder now can be seen 10 times after this layer.

The upsampling layer is followed by another convolutional layer that is the same as every other. This layer is followed by an upsampling layer that upsamples in a neighbourhood of (2 x 2), quadrupling the size of the data resulting from the previous layer. The final layer of the decoding part is a convolutional layer. Rather than the relu function, this layer has a tanh as activation, which scales data between -1 and 1. The result of this layer is another (350 x 12 x 1) array. Having created the network architecture, it was possible to create four different autoencoders with this architecture, one for each language used, i.e. English, Dutch, Italian and Spanish. These were subsequently evaluated by passing the data of all languages through all autoencoders.

LSTM autoencoder

The autoencoder that uses LSTMs (section 2.5) has six layers in total, all of which are Keras’ LSTM layers. LSTM layers require a two-dimensional input. This suits the data, since it has a shape of (350 x 12), and thus it was possible to train the LSTM autoencoders without having to modify the shape of the data.

The first LSTM layer reduces the shape of the data from (350 x 12) to (350 x 10). It uses a tanh activation, to ensure the data is between -1 and 1. The next layer reduces the y-dimension of the data to 8, also with a tanh activation. The layer following that reduces the y-dimension to 6. This concludes the encoding part of the LSTM autoencoder. It should be noted that, as the LSTM layers are succeeded by other LSTM layers, their variable "return_sequences" has been set to True, as to allow for the timesteps to be passed on by every LSTM layer as well. The decoding part of the autoencoder mirrors the encoding part. All activations

(22)

are tanh, and the layer sizes are reversed. The first layer upscales the data shape to (350 x 8), the second decoding layer upscales the y-dimension to 10, and the last layer of the LSTM autoencoder ensures the data shape is (350 x 12) again. This results in a network that has an input and an output that both require a data shape of (350 x 12).

All LSTM autoencoders were trained for 30 epochs, i.e. passes through the dataset. After this period of time, no significant improvement seemed to manifest from more training.

(23)

Chapter 4 Results

4.1 CNN autoencoder

Since all data was the same length due to zero-padding (subsection 3.2.4), it could be passed through every autoencoder without any problems. This was done by using the model.evaluate function provided by Keras. This function returns the loss acquired by running data through the algorithm. In this case, the loss func-tion used was the Mean Squared Error funcfunc-tion, loss representing the difference between the desired result and the actual result. As such, it is desirable to have a loss that is as low as possible.

Autoencoder EN DU IT SP Input Language EN 1.00 0.78 0.99 1.05 DU 0.95 0.68 0.87 0.96 IT 1.37 0.91 1.18 1.21 SP 1.39 0.95 1.21 1.23 Table 4.1: The MSE loss of passing all data through all CNN autoencoders. (* e-03)

The results of this first experiment can be seen in Table 4.1. There are two things that are striking. The first is that in all cases, English and Dutch seem to do better than Spanish and Italian. In other words, even the infants that have been trained on Spanish or Italian are better at recognising English and Dutch than they are at telling their own mother tongue apart from other languages belonging to other rhythmic classes. This is not in line with the predictions of this thesis, which were based on the observations done by Nazzi et al. (1998).

(24)

recognise, as well as the autoencoder that has the lowest loss on all of the input languages. In terms of infants, this means that Dutch infants would have learned Italian prosody better than Italian infants would have learned Italian prosody, while the Dutch infants have never been exposed to Italian speech before. This seems like an unlikely thing to have happen in practice.

4.2 LSTM autoencoder

The results of the LSTM autoencoder were gathered in the same way as was done with the CNN autoencoders. The overall picture of the results of the LSTM

Autoencoder EN DU IT SP Input Language EN 0.96 0.70 1.06 1.00 DU 0.82 0.51 0.86 0.82 IT 1.00 0.68 1.00 1.00 SP 1.05 0.73 1.05 1.04 Table 4.2: The MSE loss of passing all data through all LSTM autoencoders. (* e-03)

autoencoder is similar with respect to Dutch performing well on every language’s autoencoder, but slightly different when it comes to the performance of language groups on autoencoders. As can be seen in Table 4.2, English language input now performs about as well as Italian and Spanish when it comes to running it through any autoencoder. This is in contrast with the performance of English language input into the CNN autoencoders (section 4.1), where it performed slightly better than both Italian and Spanish, but worse than Dutch.

(25)

Chapter 5 Discussion

The expectation of both the CNN and the LSTM autoencoders was that there would be some visible distinction between how they processed languages of dif-ferent rhythmic classes, such as having all languages perform well on their own autoencoders, but poorly on those of other rhythmic classes. In reality, such a distinction did occur with the CNN autoencoder, at least for half of the languages. For the LSTM autoencoder, the only significant distinction was made by the Dutch language as well as the Dutch autoencoder.

5.1 Interpretation of the results

For the CNN autoencoders, a separation in the results occurred between the rhyth-mic class of English and Dutch on one side, and the class Italian and Spanish are in on the other side. All of the autoencoders did worse when it came to encoding and decoding Italian and Spanish than they did encoding Dutch and English. For the English autoencoder, the average loss of English and Dutch was 0.975, whereas the average loss of Italian and Spanish was 1.38. This means that on average, English and Dutch had 70% of the loss Italian and Spanish had after being passed through the English autoencoder. For the Dutch autoencoder, the average loss of English and Dutch compared to Italian and Spanish was 79%. For Italian it was 78% and Spanish reached 82%. From this, it seems like the English autoencoder is the best at telling the two language groups apart, whereas the others are almost equally adept at this.

The LSTM autoencoders had a quite different outcome compared to the CNN autoencoders, in the sense that only the Dutch autoencoder and language differed from the others by a significant margin.

(26)

(a) A prediction of the CNN autoencoder. (b) A prediction of the LSTM autoencoder.

Figure 5.1: Two predictions of the same MFCC frequency band done by the two English autoencoders. The predicted data is also English. The blue line represents the original data while the orange line represents the prediction.

5.2 Possible reasons

5.2.1 Complexity

The expectations of the results were that there would be a clear distinction in the loss of both rhythmic groups, with Italian and Spanish both performing well on the Italian and Spanish autoencoders while doing poorly on the Dutch and English autoencoders, and vice versa for Dutch and English. There are several possibilities as to why the results differ from the expectations. The first possibility is that the model throws away too much information. Looking at Figure 5.1a, this does seem to be the case for the CNN autoencoder. There is a lot of oscillation in the actual data, but the prediction does not seem to model this oscillation nearly as much as it should have done. The autoencoder has not learned any patterns important to the prosody of the language it is trying to predict, and only models the general curvature of the pitch in the data, as can be seen in Figure 5.1a. This could be because the model is not complex enough, and as such, it would not have enough layers to model more complex features in the data. This could be solved by adding more layers; however, this is no longer inside the scope of the project.

5.2.2 Zero-padding

When assessing Figure 5.1b, it definitely looks way different from Figure 5.1a. The main difference is that the prediction done by the English LSTM autoencoder consists of two straight lines, one modelling the situation where the data is zero and one modelling the situation where it is not zero. For the non-zero part of the

(27)

data, the autoencoder seems to approximate the average value of the data, rather than any specific features. It looks like the LSTM autoencoders have learned to model the presence of sound, no matter what kind of sound there is.

A reason for this behaviour could be the difference in length of the data. The average length in recorded time frames of the dataset is 105 units for the Dutch dataset, 140 units for the English dataset, 157 for the Spanish dataset and 169 for Italian. There is a huge discrepancy between the average length of the clips in the Dutch dataset and all others. This could very well account for the difference in performance of the Dutch data in relation to all others: due to zero-padding, the Dutch dataset contains on average 245 zeroes, whereas the Italian dataset contains only 181. As can be seen in Figure 5.1, zeroes are easy to predict for the autoencoders. As such, the first part of every clip of the Dutch dataset, being all zeroes, is much easier to model for the autoencoders than the other datasets are. An easy fix to this problem could be to not employ zero-padding on the data; however, both CNNs as well as LSTMs require the data to be of a predefined size. This means that it will be impossible to train CNNs and LSTMs without zero-padding. It could be possible to create a different autoencoder for every unique data length value, but this would be a very arduous process. In addition, a lot more data would be needed, because many more autoencoders would have to be trained. Considering the small size of the Dutch dataset in relation to the other datasets, this would exclude Dutch from the experiment, which is not an acceptable outcome.

Another option that could mitigate not using zero-padding would be to truncate the beginning of the data until a specified general data length has been reached. The objection to this is that it removes data that is potentially important to retain, since the project is about learning prosody, which requires a complete sentence structure.

5.2.3 Accents

A final reason the experiment turned out the way it did could be due to accents of the speakers, although this is unlikely. The dataset contains data of speakers from all over the world. As such, the accents present in the data are not of just one country. It is possible that certain accents of the languages used have prosodic elements that differ from other accents within the language. This would mean that the autoencoders get mixed signals from the data, almost as if it were learning two languages at once. The dataset does display the accent of the speaker, so theoretically, it could be possible to disallow the autoencoders to be trained on certain accents. However, in practice, I think the effect of accents on the overall performance of the autoencoders is negligible, and thus decided not to do anything with different accents.

(28)

Chapter 6 Conclusion

The original objective of this thesis was to replicate the experiments done by Nazzi et al. (1998) with autoencoders rather than infants. The reason for this was that it would be interesting to see whether autoencoders could learn to recognise a specific rhythmic class when trained with audio clips similar to the sound infants would hear while in the womb. In addition, using autoencoders is be interesting as they can be decomposed to learn what parts of speech are important to learn to make such distinctions. Creating data similar to speech heard in the womb was done by using low-pass filters on sound fragments from people speaking sentences in English, Dutch, Italian and Spanish. After several other preprocessing steps, including padding the audio with zeroes to ensure they all were the same length, the autoencoders were trained. There were two autoencoders for each language, one that employed CNNs and one that employed LSTMs.

Having trained them, they were evaluated by attempting encode and decode clips from every language with every autoencoder. This was done to simulate infants hearing both their mother’s tongue as well as a foreign language. The CNN au-toencoders all learned to follow the overall pitch of the sentences quite well, but failed to learn anything specific to the prosody of the languages they were trained with. As such, they were determined to be unable to tell rhythmic classes apart. One of the probable reasons for this is that they were simply not complex enough to model the nuances of the prosody of languages.

The LSTM autoencoders also failed to model the languages’ prosodic features. Instead, they modelled the zero-padding at the beginning of the sentence, and an average of the overall pitch for the nonzero values in the audio files. One of the probable reasons for this is that there were too many zeroes, and thus the LSTM autoencoders considered them to be part of the most defining features of the data. While CNNs did model the general curvature of the data, elements of following the zero-padded parts of the data could also be observed, as the CNNs did not

(29)

oscillate as much as the non-zero data did. The problems concerning zero-padding could not be prevented, as zero-padding was necessary to allow the data to be used with LSTMs and CNNs, as they require a fixed data size.

To conclude, this project has not achieved the original goal, which was to train autoencoders to tell apart languages belonging to different rhythmic groups. How-ever, problems regarding this goal have been gathered, and these can be taken into account for future research.

(30)

References

Abelin, Å., & Allwood, J. (2000). Cross linguistic interpretation of emotional prosody. In Isca tutorial and research workshop (itrw) on speech and emo-tion.

Burnham, D., Kitamura, C., & Lancuba, V. (1999). The development of linguistic attention in early infancy: The role of prosodic and phonetic information. In Proceedings of the 14th international congress of phonetic sciences (pp. 1197–1200).

Delattre, P., Olsen, C., & Poenack, E. (1962). A comparative study of declarative intonation in american english and spanish. Hispania, 45 (2), 233–241. Fujisaki, H. (1997). Prosody, models, and spontaneous speech. In Y. Sagisaka,

N. Campbell, & N. Higuchi (Eds.), Computing prosody: Computational mod-els for processing spontaneous speech (pp. 27–42). New York, NY: Springer US. doi: 10.1007/978-1-4612-2258-3_3

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780.

Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29 (2), 143–178.

Moon, C., Cooper, R. P., & Fifer, W. P. (1993). Two-day-olds prefer their native language. Infant behavior and development , 16 (4), 495–500.

Nair, P. (2018). The dummy’s guide to mfcc. Retrieved 2020-06-16, from https:// medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd

Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by new-borns: toward an understanding of the role of rhythm. Journal of Experi-mental Psychology: Human perception and performance, 24 (3), 756.

Nespor, M., Shukla, M., & Mehler, J. (2011). Stress-timed vs. syllable-timed languages. The Blackwell companion to phonology, 1–13.

Roach, P. (1982). On the distinction between ‘stress-timed’and ‘syllable-timed’languages. Linguistic controversies, 73 , 79.

Distinguishing rhythmic classes by using autoencoders

Distinguishing rhythmic

classes by using autoencoders

Distinguishing rhythmic classes by

using autoencoders

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Prosody

2.2

MFCC

2.3

Autoencoder

2.4

Convolutional Neural Networks

2.5

Long Short-Term Memory

Chapter 3

Method

3.1

Data

3.2

Data Preprocessing

3.2.1

Selection and Split

3.2.2

Low-pass Filter

3.2.3

MFCC

3.2.4

Zero-padding

3.2.5

Outliers

3.3

Algorithm

3.3.1

Goal

3.3.2

Architecture

Chapter 4

Results

4.1

CNN autoencoder

4.2

LSTM autoencoder

Chapter 5

Discussion

5.1

Interpretation of the results

5.2

Possible reasons

5.2.1

Complexity

5.2.2

Zero-padding

5.2.3

Accents

Chapter 6

Conclusion

References