Using auto-encoders to classify languages based on prosody

(1)

Using auto-encoders to classify

languages based on prosody

(2)

(3)

Using auto-encoders to classify

languages based on prosody

Replicating language classification in infants

Remco F. Jacobs 11860316

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor T.O. Lentz PhD

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Prior research has shown that infants are able to discriminate between languages based on prosodic cues just after leaving the womb. This the-sis looks at the possibility of replicating prosodic grouping of spoken language as done by infants. Auto-encoders are trained on representing speech, which theoretically should lead to a representation also incorpo-rating prosody. This representation is used to categorize English, Dutch, Spanish and Italian. Reconstruction accuracies from the auto-encoders give no significant data to support infant language classification. It is concluded that auto-encoders do not contain the necessary architecture to replicate mental processes with regard to spoken language.

(5)

1 Introduction

Before birth, infants are already exposed to sounds such as language. Before these sounds reach the child, they have to pass through the womb. This filters out some of the information but prosodic elements, such as intonation, tone, stress and rhythm, still makes it through.

Previous research has established that infants are able to learn these prosodic elements before birth, by listening to their parents. Other studies have found that newborn brains use prosodic information to group sounds and lay a foundation for language development. Mehler et al.(1988) observed that languages can be classified based on their rhythm, which is called prosodic grouping. Although it is proven, there is a lot to learn as to what causes prosodic grouping.

To shed more light on how infants might extract prosodic information, this study will try to model infants’ exposure to language in the prenatal phase. This will be established by using auto-encoders to try to capture the prosodic elements of spoken language without human supervision. The research question is: “Can we replicate the distinctions infants make within linguistic classes and by doing so, get more insight in the patterns that they may use to distinguish languages?”.

The languages being used are Dutch, English, Spanish and Italian, as Nazzi et al.(1998) has shown that infants were able to distinguish English and Dutch from Spanish and Italian. Auto-encoders will be used to try to represent utterances from each language in a simplified way and reconstruct them from this representation. The auto-encoders will be made up of multiple LSTM layers. These will then be used on the other languages to test whether or not the encoders have captured prosodic parts of their original languages. It is expected that reconstruction will work on languages that belong to the same prosodic group but will not work for languages that belong to different prosodic groups.

This thesis is divided into four chapters. First, a literature review will be provided in which background information on prosody, prior research, auto-encoders and LSTMs is presented. After that, the methodology will give an overview of the involved techniques and the framework of

(7)

the experiment. The results will then highlight the main findings of the implemented models. These findings will be evaluated based on the validation loss and will then be further examined in the discussion.

(8)

2 Literature Review

2.1 Prosodic language classification

2.1.1 Prosody classification by infants

The earliest literature focuses on infants’ abilities to distinguish different languages shortly after leaving the womb. For the infant, a critical task is distinguishing between noises in its surroundings and human speech. Furthermore, they have to deal with variations within the human voice as a result of changes in speaking rate, accents and distinct speakers to discriminate languages. It was already known that infants have re-markable capacities in perceiving speech such as discriminating phonetic contrasts and dealing with variations in intonation and different voices. Being able to differentiate between languages is key in learning their un-derlying structures which is necessary for learning a language. It was discovered that infants are able to distinguish languages as long as they have some degree of familiarity with one of them (Mehler et al., 1988). Additionally, experiments with low-pass-filtered versions of the samples showed the same results. This suggests that prosodic cues may be at the basis of infant language discrimination.

It is found that associating sounds with meaning is a critical aspect in language acquisition (Jusczyk, Cutler, & Redanz, 1993). Preceding this, the infant first has to learn how to segment utterances into individ-ual words. Experiments suggested that infants’ language preference for predominant stress patterns develops as a result of increasing familiar-ity with prosodic features of a language. Further experimentation with low-pass filtered input showed that the preference lies in the prosodic structures of words. It is shown that infants prefer more stress patterns with a higher probability which is a prerequisite for lexical development. Later experiments suggest, using low-pass filtered sound, that isochrony plays a significant role in infants’ ability to discriminate between lan-guages (Nazzi, Bertoncini, & Mehler, 1998). English and Dutch were contrasted with Spanish and Italian while the mutual pairs were not, be-cause English and Dutch are both stress-timed while Spanish and Italian

(9)

are both syllable-timed. This shows that prosody, specifically rhythmic information plays a role in classifying language utterances and that sen-tence discrimination is class-specific rather than language-specific.

More recent research learned, using NIRS experiments, that newborn brains use prosodic information to perceive and organize utterances(Nawal, Nazzi, & Gervain, 2016). These prosodic biases are, however, only present for patterns found in their native languages. These are the first results to show prosodic grouping right after birth, implying that prosodic grouping is already present before birth, laying a foundation for language develop-ment.

2.1.2 Prosody

Prosody is a part of linguistics that concerns itself with the continuity of spoken language. It exists of features such as intonation, tone, stress and rhythm, called suprasegmentals. A number of attributes contribute to composing these suprasegmentals, the most important ones being fre-quency, intensity and duration (Sluijter, van Heuven, & Pacilly, 1997). It is known that infants are able to learn prosody by using the womb as a low-pass filter. A low-pass filter filters sound with a cut-off frequency at 400Hz, meaning that frequencies above that will be removed. Isochrony is an aspect of prosody that encompasses the rhythmic division of a lan-guage. There are three ways in which a language can be categorised according to isochrony (Roach, 1982):

• Syllable-timed, in which the duration of every syllable is equal • Mora-timed, in which the duration of every mora is equal

• Stress-timed, in which the interval between two stressed syllables is equal

In this thesis, syllable-timed (Spanish and Italian) and stress-timed (English and Dutch) languages were be compared. Mora-timed languages are not be considered.

(10)

2.2 Technical background

2.2.1 MFCC

The Mel-Frequency Cepstrum(MFC) is a representation of sound, that captures how sound is physically processed by the human ear. An MFC consists of a collective of Mel-Frequency Cepstrum Coefficients(MFCCs) which can be acquired by applying a non-linear transformation to the frequency. These MFCCs can be used as features in speech recognition as the bands are spaced out evenly on the Mel scale, which scales frequencies in order to more closely resemble the way in which humans perceive sound.

2.2.2 Auto-encoders

An auto-encoder is a type of unsupervised artificial neural network that is used to generate a compressed representation of data and recreating the input with as little error as possible. This is done by taking the input, creating an encoded representation of that input and then reconstructing the original input using only the encoded representation. It is often used to reduce the dimensionality of data or as a means of removing noise. Auto-encoders are mostly used as a means of denoising data, as generative model or as means of translation. (Badr, 2019)

The architecture of an auto-encoder consists of three parts as seen in figure 1:

• The encoder, a neural network that compresses the input • The code, the compressed representation of the input

• The decoder, a neural network that recreates the input using solely the code

(11)

Figure 1: The architecture of an auto-encoder. (Gad, 2020) 2.2.3 LSTMs

A Long Short-Term Memory(LSTM) is a type of Recurrent Neural Net-work(RNN) that is used to account for temporal structure within an input sequence. RNNs are able to do this by making every computation dependent on the previous computation. This is done using feedback loops which feed part of the hidden states back with the input of the next timestep, as seen in figure 2. What separates LSTMs from other RNNs is the inclusion of a memory cell. This cell is used to maintain information from previous steps better and for longer periods of time without adding more complexity (Hochreiter & Schmidhuber, 1997). As prosodic elements are made up of sequential structures, LSTMs have potential to work well for prosody extraction.

Figure 2: An overview of the feedback loop in RNNs. The loop on the left can be "unrolled" like on the right to see that every timestep is dependant on the previous. (Mittal, 2019)

(12)

3 Methodology

3.1 Data

To answer the question whether infant language classification can be replicated, data from different languages was used. This data consisted of English, Dutch, Spanish and Italian datasets with every datapoint containing one sentence of spoken language. Each sentence had lables with variables, mainly age, gender and accent. These three variables were used when downsampling the data into datasets of equal and workable sizes while maintaining an even distribution of the variables to prevent biases which could have been present in the original dataset. These downsampled datasets were split into train, validation, and test sets with ratios of 70%, 15% and 15% respectively.

The data was converted from a .mp3 format into a .wav format to ac-cess sampling rates and individual timesteps. The sampling rate was kept at 22000 which was done to maintain as much information as possible. In theory, a sampling rate above 1000 is needed to preserve individual syllables, which is necessary for for distinguishing between syllable-timed and stress-timed isochrony.

An interesting thing to note is that there was a significant difference in sentence length across the languages. As seen in table 1, the syllable-timed languages have a longer maximum sentence length than stress-timed languages. During preprocessing, it was found that the same is true for average sentence length.

Language Maximum length

English 245

Dutch 186

Spanish 255

Italian 307

Table 1: The maximum sentence length of each language in terms of the number of MFCCs

(13)

3.2 Preprocessing

The data had to be preprocessed before it could be used in training the auto-encoders. The first step of preprocessing involved low-pass filtering. This simulates the filtering of sound by maternal tissue and thus approx-imates the input to an unborn child in the womb. As seen in figure 3, high frequencies have been filtered out to while low frequencies remain.

Figure 3: The partial result of low-pass filtering one of the sentences. After low-pass filtering had been applied, MFCCs were calculated for every sentence. This was done to simulate human auditory perception, which is necessary in order to produce the same results as an infant. The MFCs consist of 12 coefficients per timestep, with a variable amount of timesteps per sentence. A normalized version of such an MFC can be seen in figure 4

(14)

Figure 4: A visualization of normalized MFCC values as used in the input, with on the x-axis time in seconds and on the y-axis the 12 MFCCs.

3.3 Modelling

To replicate the experiment done by Nazzi et al.(1998), a separate auto-encoder was used for every language with the auto-encoder and decoder con-sisting of two LSTM layers with connected hidden layers. Separate auto-encoders were used to simulate children with a single native language. Double LSTM layers have been chosen to increase the accuracy while maintaining a training time that fitted within the timeframe of the ex-periment.

To account for issues that occured when training on data with vari-able lengths, the data was padded. For every language, the maximum sentence length was used as a length to which every other sentence in the set had to be padded with zero values. The motivation for this was to conserve information while making minimal adjustments to the data. While padding the data, it was discovered that there was a significant dif-ference in sentence length across the languages, with English and Dutch having shorter sentences than Spanish and Italian.

To prevent datasets consisting of a majority of zeros due to padding, outliers in terms of sentence length were removed. A distribution of sentence length was used to determine the outlying 10% of the data, which can be seen in figure 5. This data was removed to significantly

(15)

reducing the amount of padding and prevent the auto-encoders from learning only zeros.

Figure 5: A visualisation of the removal of outliers in terms of sentence length

Experimentation with the latent dimension, the length of the code vector, showed that a value of 500 yielded the most compression with-out significant loss in reconstruction accuracy. This would make for an average compression of 8 times. For the intermediate dimension, the di-mension of the vector between the LSTMs, the average of the input and latent dimension was chosen to distribute the compression and decom-pression evenly between both the LSTMs.

For the optimizer, a choice had to be made between Adadelta, Ada-grad and RMSprop. RMSprop was chosen because it is the best option for learning with small batch sizes. On top of that are Adagrad and Adadelta better suited for use with sparse data, which is not the case here. (Ruder, 2016)

As activation function for the LSTM layers, the sigmoid function was used. This was done since this was the only available activation function that allowed for a finite range of input values, which was necessary to

(16)

done while maintaining the distribution of the data.

Each auto-encoder was trained on its respective language until the validation loss showed no more improvements, which was in each case after two to three epochs. A batch-size of 10 was used to decrease runtime without significant impact on the accuracy.

Validating the models using the test sets uncovered a problem re-garding the length of the data. Each model was trained on a set of pre-determined length which corresponded to that specific language. These unique values per language meant that each test set had to be filtered according to the length of the auto-encoder it was going to be tested on. As seen in table 2, languages from syllable-timed languages have sig-nificantly less sentences in their test sets. At first sight, it seems incorrect that the number of sentences in the test sets can rise above the amount used with their respective auto-encoder. However, this can be explained by the trimming of the test sets during preprocessing. Sentences that were filtered out during training could not be used on the respective auto-encoder, but could be used on auto-encoders that allowed for longer sentences.

Auto-encoders

English Dutch Spanish Italian

Test sets

English 505 408 514 539 Dutch 649 617 651 656 Spanish 438 316 449 481 Italian 352 242 366 409

Table 2: The number of usable sentences in the test sets of each language across different auto-encoders

(17)

4 Results

Looking at a visualization of a data point before and after reconstruc-tion, as seen in figures 4 and 6, reveals that the input has irregularities whereas the output is completely smooth. In addition to that, in the input sequence, one could identify a gradient on the vertical axis. In the output sequence, however, a gradient seems to lie on the horizontal axis. It can also be seen that a bar of zero-values is present in the input as well as in the output.

Figure 6: A visualization of normalized MFCC values after reconstruc-tion.

Converting the output of the auto-encoders back to a listenable for-mat revealed that the reconstructed data contained significantly less in-telligible information than in the input. The generated sound resembled a continuous low tone with virtually no audible fluctuations.

The performance of the auto-encoders was measured using the mean squared error between the input sequences and the output sequences of each auto-encoder on each language. In table 3, it can be seen that all results seem to lie close to 0.0675. An interesting thing to note is that there is no notable difference in performance between neither the auto-encoders nor the languages.

(18)

Auto-encoders

English Dutch Spanish Italian

Languages

English 0.0671 0.0799 0.0679 0.0614 Dutch 0.0588 0.0580 0.0592 0.0625 Spanish 0.0685 0.0817 0.0682 0.0581 Italian 0.0736 0.0779 0.0748 0.0616

Table 3: The results of validating each auto-encoder with the test sets of each language. The numbers represent the mean squared errors between the input and the output.

(19)

5 Discussion

5.1 Interpretations of the results

When inspecting figures 4 and 6, it is clear that the input contains a lot more fluctuations than the reconstruction. This is probably due to the fact that the model is not elaborate enough to capture prosody. It could be that LSTMs or auto-encoders are not the optimal choice but a more reasonable explanation is that more architectures have to be tried out before LSTMs or auto-encoders can be ruled out.

Comparing figures 4 and 6, one of the differences is the direction of the gradient that is present. There is currently no explanation as to why this is the case.

Looking at table 3, one of the first things that stands out is the uniformity. All loss percentages are very similar. This indicates that the auto-encoders have learned something which seems to be universal for every language. This could be a general characteristic of prosody that is present in all four languages. It may have even learned a characteristic of language in general.

Another possible explanation is that the auto-encoders are trying to minimize the loss without regard for maintaining prosodic features. Use of the loss function could even have been detrimental since maintaining prosodic features simply does not contribute a lot to minimizing the loss function. A slight shift in the reconstruction could have a big impact on loss while barely affecting prosody. Discriminating spoken language from other sounds is something humans excel in and it could be that the auto-encoders all have separately learned to do this task instead of learning prosody. This would explain the low loss percentages with no distinguishment between different languages.

In either case, the results suggest that the auto-encoders have not achieved the same representation as infant brains.

(20)

5.2 Limitations

It is important to note that some test sets had to be shortened to be usable on the auto-encoders. Removal of sentences that are too long could have removed certain parts of prosody that are more present in utterances of longer lengths. This would mean that the ratio of some parts of prosody were lower in the test sets that they would have been in the train sets.

Additionally, the choice of padding could possibly have influenced the training of the auto-encoders. Having a certain amount of constant values in front of every data point can bias a network into thinking that it is a common case. This can be seen in the results as a black bar at the beginning of the reconstructed MFCC values. The addition of padding also causes the length of all sentences to be the same. This removes the uniqueness of sentence length which is also thought to have a connection to prosody.

During testing, the test sets had to be sampled based on the input length that the respective auto-encoder was trained on. This means that the test set of one language could be slightly different across different auto-encoders. It is a possibility that the accuracies are slightly shifted as a result of this. However, it is highly unlikely that this difference has any significant consequences.

To make training possible, the input sequence, which consisted of 2-dimensional data, was internally transformed into a 1-dimensional vec-tor representing the code vecvec-tor. This conversion, and the reversal, could have potentially created a bottleneck for learning prosodic elements. In this model, the auto-encoder is expected to both reduce the dimension-ality and learn prosodic elements while dimensiondimension-ality reduction is a process that could be regarded as a seperate step. Reducing the dimen-sionality of the data before training could yield results that tell us more about prosodic grouping.

(21)

5.3 Future research

Future research would be needed to establish whether or not is is possi-ble to replicate prosodic grouping as done by infants. This study looked at using auto-encoders while it may as well be possible to use other AI techniques to succeed in prosodic grouping. An option for future re-search would be to implement classification algorithms to try and group languages according to their according prosodic grouping. This does how-ever require predefined classes which would be the different languages.

An alternative would be to make use of clustering algorithms instead. This could be done by using features such as MFCCs with clustering algorithms to classify individual sentences. By using a method like this, it would even be possible to create a gradient on which utterances can be placed instead of having a fixed number of classes, which would also make it possible to differentiate between dialects as sub-classes. This could open up a new perspective of looking at the similarity between languages.

5.4 Conclusion

This thesis looked at the possibility of using auto-encoders to replicate prosodic language classification as done by infants. Four auto-encoders were used to test whether it is possible to classify English, Dutch, Spanish and Italian as observed by Nazzi et al.(1998). It is learned that the current model is doing something that does not correspond to the way infants seem to process human speech. The fact that the architecture in this study fails to successfully extract prosody does not mean it is impossible for another architecture.

(22)

References

Badr, W. (2019, April). Auto-encoder: What is it? and what is it used for? (part 1). Retrieved from https:// towardsdatascience.com/auto-encoder-what-is-it-and-what -is-it-used-for-part-1-3e5c6f017726

Gad, A. (2020, January). Image compression using autoen-coders in keras. Retrieved from https://blog.paperspace.com/ autoencoder-image-compression-keras/

Hochreiter, S., & Schmidhuber, J. (1997, December). Long short-term memory. Neural Computation, 9 (8), 1735-1780. doi: https://doi .org/10.1162/neco.1997.9.8.1735

Jusczyk, P., Cutler, A., & Redanz, N. (1993, June). Infants’ preference for the predominant stress patterns of english words. Child Devel-opment, 64 (3), 675-687. doi: https://doi.org/10.2307/1131210 Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., &

Amiel-Tison, C. (1988, July). A precursor of language acquisition in young infants. Cognition, 29 (2), 143-178. doi: https://doi.org/ 10.1016/0010-0277(88)90035-2

Mittal, A. (2019, October). Understanding rnn and lstm. Re-trieved from https://towardsdatascience.com/understanding -rnn-and-lstm-f7cdf6dfc14e

Nawal, A., Nazzi, T., & Gervain, J. (2016, November). Prosodic grouping at birth. Brain and Language, 162 , 46-59. doi: https://doi.org/ 10.1016/j.bandl.2016.08.002

Nazzi, T., Bertoncini, J., & Mehler, J. (1998, July). Language dis-crimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology Human Percep-tion & Performance, 24 (3), 756-766. doi: https://doi.org/10.1037/ 0096-1523.24.3.756

Roach, P. (1982). On the distinction between ’stress-timed’ and ’syllable-timed’ languages. Linguistic Controversies,, 73-79.

Ruder, S. (2016, Januari). An overview of gradient descent optimiza-tion algorithms. Retrieved from https://ruder.io/optimizing

(23)

-gradient-descent/

Sluijter, A., van Heuven, V., & Pacilly, J. (1997, August). Spectral balance as a cue in the perception of linguistic stress. Journal of the Acoustical Society of America, 101 (1), 503-513. doi: https:// doi.org/10.1121/1.417994

Using auto-encoders to classify languages based on prosody

Using auto-encoders to classify

languages based on prosody

Using auto-encoders to classify

languages based on prosody

Replicating language classification in infants

Abstract

Contents

1

Introduction

2

Literature Review

2.1

Prosodic language classification

2.2

Technical background

3

Methodology

3.1

Data

3.2

Preprocessing

3.3

Modelling

4

Results

5

Discussion

5.1

Interpretations of the results

5.2

Limitations

5.3

Future research

5.4

Conclusion

References