Sequence Learning and Prosody: Predicting prosodic boundaries using time-series pitch data

(1)

Sequence Learning and Prosody:

Predicting prosodic boundaries using time-series

pitch data

Giovanni G. Kastanja 10467149

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. T.O. Lentz

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

Prosody research suffers from a lack of high-quality prosodically annotated speech data. One of the proposed methods to generate more data is by using machine learning techniques to algorithmically annotate speech data. To be successful using this approach a better understanding is needed of how acoustic cues (pitch, energy, duration) relate to different prosodic structure. In this paper we will focus on how pitch relates to prosodic boundaries. Two datasets will be build using speech data from the Corpus Gesproken Nederlands. The datasets will be build using the pitch data within the scope of a word and within the scope of a phrase. Several different Long Short-Term Models (LSTM) will be used to train on these datasets and their performance will be compared. The results show that the LSTMs performed better when using the word-scope dataset compared to the performance when using the phrase-scope dataset. However the results also show that there is low or almost no predictive value in pitch-data for prosodic boundaries when training using these scopes.

(3)

1 Introduction

In this study we will research the relation between pitch values of human speech and the occurrences of prosodic boundaries in a sentence. This research is part of the domain of Natural Language Processing (NLP). The field of Natural Language Processing is becoming more prominent in society today. NLP research concerns itself with understanding the mechanisms that underlie language, communication and speech.

A particular area of interest in the domain of NLP is prosody. The literature does not agree on what the precise definition of prosody is. The definitions range from one extreme ”those phenomena that involve the acoustic parameters of pitch, duration, and intensity” to the other extreme of ”any phenomena that involve phonological organization at levels above the segment.”[21]. In this paper we will use the loose definition of prosody: ”Every part of speech that is above the level of phonemes”.

That is what prosody is. But what is the function of prosody in speech? Prosody plays an important part in our communication. A sentence is more than just a collection of words. The prosodic part of a sentence allows the listener to decode important information. Information that can not be conveyed through words alone. The prosodic information of a sentence allows a listener to decode among others the emotional state of the speaker, irony and sarcasm. It also groups important information together allowing the speaker to disambiguate an otherwise ambiguous sentence.

To illustrate the importance of prosodic information the sentence ”I didn’t do that” will be used. This sentence can mean different things depending on where the speaker puts emphasize. The sentence ”I didn’t do that” has a different meaning than ”I didn’t do that ”. With the emphasis placed on the words in italics. The first sentence implies that the ’I’ didn’t do that but someone else did. The second sentence implies that the ’I’ didn’t do the ’that’ to what is referred to in the sentence, but the ’I’ did do something else. Emphasizing different words is a function of one of the main features that is part of prosodic structure. The main structures are Melody, Lexical Stress, Timing and Pauses & Breaks. The acoustic cues that underlie these strategies are pitch, energy and duration[22]. For example, the I in the first example can be given prominence by using a relatively higher pitch, more energy and a longer duration than the surrounding words, with a pause as an extreme case indicating a break of the normal flow of speech. What is less clear is how the variations of these acoustic cues give rise to different prosodic structures and events. Understanding this relation remains one of the challenges of prosody research [31].

The current problem for research in this area is that there is not enough quali-tatively high prosodically-annotated data available[31]. Right now the only way to generate new data is to annotate speech data by hand. This process requires a lot of time and money. On one side for the training of the annotators and on the other side for actually annotating new data. Two proposed ways of increasing high-quality annotated data is by either sharing more data, by reducing the cost of annotation or by algorithmically annotating speech-data [31]. Machine learning and specifically neural networks might be good candidates for solving this problem. Neural nets have played an important role in other areas of NLP like machine translation, sequence labeling and sentence classification [11].

Previous research that used machine learning techniques for predicting prosodic events were done by Jeon & Liu [16]. They researched how well a neural net and other machine learning algorithms performed in predicting phrase boundaries. They achieved an accuracy of 91%, 10 percent higher than chance. Qian et. al. [30] added to that research by using a Conditional Random Field and found that the performance was comparable to the neural net of Jeon & Liu. The research of Bijl[2] showed that the method employed by Jeon & Liu and Qian et. al. could also be used to predict phrase boundaries in the Dutch language. These studies used all of the acoustic cues (pitch, duration, energy) to try to predict the prosodic structure of a sentence. An

(5)

interesting line of questioning might be how strong the predictive power is of each these variables individually. Bijl also attempted to answer this question in her study but was unsuccessful with the Mulitlayer Perceptron that she used. Understanding these individual relations between acoustic cues and certain prosodic events might give more insight into the relationship between these cues and the prosodic structure. The focus of this study will be to research the relation between pitch and prosodic boundaries. Pitch is one of the cues that might predict a phrase boundary. The pitch moves a certain way before a phrase boundary occurs, this movement is called a boundary tone. There isn’t just one way a boundary tone is formed. Because there is variation there might be an underlying structure that these variations have in common. Several studies showed that all features of speech above the level of phonemes are patterns in time [20][22].

Recurrent neural nets (RNN) excellent candidates for solving these types of prob-lems. The architecture of RNN allow it to learn temporal relations between data. In this paper a type of RNN called a Long Short-Term Model (LSTM)[15] will be used. LSTMs are a type of RNN that are well suited for problems where time-series data is being used with longer temporal dependencies[15]. They also do not have the problem of the vanishing & exploding gradient[14][1] that ’regular’ RNNs are prone to. At the time of this writing no research has employed an LSTM to predict the prosodic structure of a sentence. Therefore it is also unclear with which parameters the LSTM will perform optimally. Part of this research problem is then to also find the best parameters with which the LSTM performs optimally.

Speech data from the Corpus Gesproken Nederlands (CGN) [28] will be used. CGN is currently the only Dutch corpus with prosodically annotated speech data. The CGN contain 900 hours of speech fragments. The fragments are annotated by a Dutch and Flemish annotator. Over 120.00 words are annotated in the Dutch part of the corpus. Using Praat [3] this part of the CGN will be used to extract the pitch-data.

The main question of this research will be:

”Can prosodic boundaries be predicted using time-series pitch data?”.

The sub-question will be:

• ”How well does pitch-information predict prosodic boundaries within the scope of one word?”

• ”How well does pitch-information predict prosodic boundaries within the scope of a phrase?”

• ”What parameters allow the LSTM to perform optimally when predicting prosodic boundaries with pitch information?”

In the next part the theory will be shown on which this research is build. After that the research method will be outlined. In the third part the results of the experiments will be shown and discussed. The paper will end with a conclusion and discussion of the results.

(6)

2 Theoretic Foundation

In the following section a few terms will be explained. First we will explain what prosody entails. After that pitch will be explained. Then we will explain the role of prosodic pauses in a sentence. After that the Long Short-Term Model will be explained. This section will end with a summary of the relevant previous research on algorithmically predicting prosodic structure and events.

2.1 Prosody

Successful communication depends on more than just the sequence of words,syllables and phonemes. Important information like the emotional state of the speaker, sarcasm & irony and ambiguities cannot be accurately resolved with this sequence alone. The parts of speech that cannot be defined in terms of the word-sequence is called prosody. In this research the loose definition: ”Every part of speech that isn’t at the level of phonemes and below” is used. The main features of prosody[25] are:

• Melody: Using different variations in pitch to make certain parts of sentence more prominent.

• Lexical Stress: Differentiating between words that are phonetically equivalent • Timing: The lengthening and shortening of words and syllables.

• Pauses & Breaks: Interrupting the normal flow of speech to make a point or group together information

The way these features are expressed differ in different modes of speech. The two modes of speech are planned and spontaneous speech. The CGN contains speech fragment from both modes of speech. Planned speech are available in the form lec-tures and news-bulletins. Spontaneous speech examples are available in the form of telephone conversations among others. The prosodic features are also influenced by three acoustic cues. These acoustic cues are pitch, energy and duration[22]. In this study the focus will be on pitch.

2.2 Pitch

What is referred to as pitch is actually called the fundamental frequency or F0, but

they are used interchangeably in the literature. Hereafter it will just be referred to as pitch. Speech sounds are produced by modulating the air-flow in the vocal folds making them vibrate and producing pitch. The frequency of these vibrations are measured in Hertz. During speech the values of the pitch are modulated by the speaker. The range of values between which these values can be modulated are limited by physical features of the speaker, mainly the vocal tract and size of the vocal folds. The way a speaker modulates the pitch can give a different meaning to a sentence. In American-English this occurs with a high-pitched end of a sentence, this indicates that a question is being asked.

The pitch values can be extracted and analyzed from speech using programs such as Praat [3]. These programs extract pitch data in the form of a pitch contour. A graph with the pitch values graphed along a time-axis, see figure 1. As can be seen in the figure 1 the pitch contour is discontinuous. These interruptions of the contour occur because of voiceless consonants such as ’p/k/t’. The vocal folds do not vibrate when these consonants are spoken. Although the pitch contour is discontinuous a listener still perceives it as continuous. Only when the interruption are longer than 250 ms are the interruptions registered. Interruptions that are longer than 250 ms are called pauses or breaks [26].

(7)

Figure 1: Pitch Contour With Phrase Boundaries (taken from [29]

2.3 Prosodic Pause

A pause is an interruption of the flow of speech that lasts more than approximately 250 ms. A prosodic pause differs from a normal pause because it generally consists of one or more of the following elements: a pause in a sentence, a boundary tone preceding the pause, and a lengthening of the word before the pause[19]. In Figure 1 the prosodic pauses are denoted with an %. The boundary tones are indicated with the red rectangles. The function of a prosodic pause is that it provides information on the syntactic structure of a sentence. For example, when a prosodic pause occurs together with a syntactic boundary (e.g. between a clause and a phrase) it shows that certain information in the sentence belongs together. In the following example the sentences have a different meaning depending on where the prosodic pause (indicated by a ’||’) is located :

1. ”Ugly men and women.”

2. ”Ugly men || and women.”

The first sentence is interpreted to mean that both the men and the women are ugly. The second sentence is interpreted to mean that only the men are ugly and the women are not (adapted from [4]).

2.4 Prosodic Annotation

To make real progress in understanding the mechanisms that govern prosody in speech it is important to have prosodically labeled speech data. There are different conven-tions for annotating speech data. The studies that inspired this one used corpora annotated using the Tones and Break Indices-system. The ToBi-system is a tiered annotation system, with the tiers being tones, word, break and comments. There are four types of breaks in this system. The corpus that is used in this study, Corpus Gesproken Nederlands(CGN). The system used in the CGN only distinguishes be-tween two types of breaks, strong breaks and weak breaks. Strong breaks are defined as severe interruptions of the flow of speech Strong breaks, indicated with ’||’, are defined as severe disruptions to the normal flow of speech[5]. An example of a strong break can be found in the following sentence ”I did it || and so did you.” Weak breaks are defined as clearly audible interruptions of speech, but from these interruptions it is clear that the words (or parts of a word) around the break aren’t connected as one would expect them to[5]. An example of weak break can be heard in the following sentence: ”you know | this was a|ma|zing.”(examples adapted from [5])

(8)

Figure 2: CGN: Annoation of sentence

2.5 Long Short-Term Model

The Long Short-Term Model is a modified version of a Recurrent Neural Net(RNN). RNN’s differ from regular neural nets because there are connections among hidden units that can represent a time-delay[10]. These connections allow the model to ’re-member’ information about past data. This remembering’ allows the model to learn relation between data that is apart in time and learn their relations. These capabili-ties make these neural nets promising candidates for tasks such that are sequential in nature, such as speech processing, sequence labeling and sentence segmentation[11]. Despite their success and advantages with regards to sequence based tasks, there is an important limitation of conventional RNN’s. RNN’s have a difficult time learning long-term temporal dependencies. This problems occurs because RNN’s use Back-propagation Through Time (BPTT) or Real Time Recurrent Learning (RTRL) to update weights in the network. The error at the end of the network is back-propagated through the network. Due to the chain-rule the error can become increasingly small and finally 0 if the error is smaller than 1, if the value is larger than 1 the value can easily grow so large that it causes buffer-overflows and crash the model[14]. These problems are called the vanishing- and exploding-gradient problems respectively, they are further elaborated upon in [14] One of the proposed solutions for this problem was the Long Short-Term Model introduced by Hochreiter and Schmidhuber in 1997[15]. The LSTM solved the problem of learning long-term dependencies and therefore the problem of the exploding- and vanishing-gradient. They did this by introducing a memory unit that contained three different gates. They control the information vec-tors that travel through the network. These gates are called forget gates, output gates and input gates. In Figure 3 the architecture of a memory unit can be seen. The forget gate is colored blue and is responsible for telling the network which in-formation to forget when new inin-formation is presented. The input gate is colored green and is responsible for deciding what new information gets encoded into the cell when new information is presented. The output gate is colored red and is responsible for deciding which of the information that is encoded into the cell get propagated to the cell in the next time step[15]. These gates together help prevent the problem of the vanishing- and exploding-gradient problems by enforcing a constant error-flow through each memory unit and thereby not allowing the gradient to grow or shrink uncontrollably.

(9)

2.6 Previous research

In 1993 Hirschberg [13] used a decision tree based system that achieved 82.4% accu-racy when predicting prosodic structure. They conditioned their model using word-level prosody and tried to predict whether a break should be marked or not. This development was important because it showed that it was possible to accurately pre-dict prosody from syntax. This study will also attempt to mark breaks correctly but pitch-data will be used instead of syntactic information. Following attempts were made by Wightman and Ostendorf. They built a system for the automatic labeling of prosodic events [36]. Their approach was novel because before researchers didn’t use all acoustic cues to predict the prosodic labels. The model that they used was a deci-sion tree combined with a Markov model. This was used to build a system that used pattern recognition to label a sequence of words. Using this model they achieved an accuracy of 71% for detecting where boundaries occurred. In 1996 Ross and Ostendorf used a model to predict prosodic for text that was tagged with part-of-speech data. They used a multi-level prosodic hierarchy and for each level they used a decision tree with Markov assumptions. One of the prosodic events that they attempted to classify was the boundary tone classifier. As said before this a boundary tone can indicate the occurence of a prosodic pause. With this study they achieved a correct classification of 82.5% of the data. [32] Chen et. al.[8] build an acoustic-prosodic model based on Gaussian Mixture models that conditioned on both the acoustic and the syntactic models coupling them together and use them as a maximum likelihood recognizer. They build using given that conditioning prosody on the syntactic representation of a string can reduce the entropy of syntactic prosodic models [7]. In 2008 Sridhar et al.[34] used a maximum entropy framework to algorithmically label prosodic events. They labeled prosodic events on the word-level as opposed to the syllable level used by Chen et.al. Their approach resulted in a correctly labeling boundary tones 84% of the occurences. They also tested their method using word-level conditioning but with this approach they achieved an accuracy of 82%. In 2009 Jeon & Liu [16] trained different machine learning algorithms and compared their performance in the task of predicting prosodic events on the syllable level. They compared Neural nets, decision trees, maximum entropy and support vector machines. They trained on the Boston Radio Corpus. The task was to predict tone, phrase boundaries and breaks. The neural net performed this task the best with an accuracy of 91% when trained on acoustic information. In 2010 Qian et. al. used the same method as Jeon & Liu to test the performance of a Conditional Random Field model (CRF). The results were comparable to that of Jeon & Liu. The CRF achieved an accuracy of 92.11% for predicting prosodic breaks. In 2018 Bijl [2] researched if the results and method could also be used to predict prosodic events in Dutch. She used a multi-layer per-ceptron and the pitch, energy and duration of each word to research this. With this she achieved an accuracy of 85 %. Showing that the method could be employed using the Dutch language

3 Materials

3.1 Data Preparation

There are two types of breaks that are annotated in the CGN, strong breaks and weak breaks. For this research no distinction will be made between breaks. We are only interested in the occurrence of breaks,we are not interested in what kind of break occurs. From the two categories of speech data that are in the CGN only planned speech will be used. Using Praat[3], a computer program for analyzing, synthesizing and manipulating speech, the pitch-contour will be extracted for each of the planned-speech fragments. From the resulting set of pitch-contours two different time series datasets will be build. One will contain time-series pitch data within the scope of a word. The other dataset will contain the time-series pitch data within the scope of a

(10)

phrase of about 1 second.

The time-series training set will be build by dividing each of the pitch-contours into 50 equal time-slices. The will return a sequence of values. Each entry in this sequence has an associated pitch-value. The output associated with each sequence will be a ’0’ or a ’1’. A ’0’ means that no pause occurred at the end of the sequence and a ’1’ indicates that a pause did occur.

The resulting datasets will be very imbalanced. An imbalanced dataset has a hard time learning the structure of positive examples. To deal with this a technique called Synthetic Minority Oversampling Technique (SMOTE) will be used. This tech-nique uses re-balancing techtech-niques for imbalanced datasets to improve minority class recognition [6].

3.2 Method

It is unclear which parameters allow the LSTM to perform optimally. The parameters that will be chosen for testing will be chosen based on the available hardware capa-bilities. Considering the available hardware capabilities early-stopping will also be implemented. This will ensure that models where the loss-function does not improve after N iterations will stop training prematurely. The parameters that will be tested are:

Neurons:

Influences the learning capacity of the network. In general more neurons allows the LSTM to learn more complex structure from the data. This comes at the price of longer learning times and also might make it easier for the model to overfit.

Batch-size:

The batch-size refers to the size of the subset of the data that the model uses to during each iteration. A greater batch-size might reduce computational costs, while potentially sacrificing performance.

Layers:

More layers allow a model to learn more complex relations within the data. More layers is necessarily better, because too many layers might lead to overfitting. Other parameters that are relevant but that will not vary between each configuration are:

Activation functions:

Two different activation-functions will be used in our model. The hidden layers will use a RELU activation function. Several studies have shown that RELU-activation functions have sped up training time [9][12][24]. The output-layer of the model will have an sigmoid-activation function. Sigmoid functions output a number between 0 or 1. An output higher than 0.5 will signify that the model predicts a pause.

Optimization Algorithm:

Optimization algorithms allow models to more efficiently reduce the error rate. There are two metric that are important when choosing which optimization algorithm to choose: speed of convergence and generalization[33]. These metric influence the training time and the error rate of the model. In this study lower training times are deemed more important. Among the most popular optimization algorithms that reduce training time is the Adam optimization algorithm [18]. This algorithm is computationally efficient and has little memory requirements. These features make it a suitable candidate considering the limited hardware capabilities that are available.

Dropout:

(11)

With dropout random nodes in the network are ’dropped’ along with their connec-tions. This procedure prevents that these nodes condition too much on the data. The configurations of our model will use a dropout of 20%. This number means that during training a random 20% of the nodes in a will be dropped.

Epochs

Epochs is the number of iterations that the model will run before stopping training. Early stopping makes varying on the number of epochs irrelevant.

Loss-function:

The loss-function that will be used is called Binary Cross-Entropy. This loss-function is generally used with a binary classification and a sigmoid activation function as output. This loss-function is able to accuratly calculate the loss based on binary cat-egorical data.

The LSTM that will be used will be build using the python library, Keras, using Python 3.5 on a Macbook Pro 2018 , 2,3 GHz Intel Core i5 8 GB 2133 MHz LPDDR3.

3.3 Performance Metrics

The data set that is used is imbalanced and the accuracy of the model might give a skewed view of performance. For example, when 90 of 100 data points are negative and a model predicts only negative examples, the accuracy of the model is 90%. This does not give us enough insight into the performance of a model. To evaluate the performance of the model we will instead mainly look at precision, recall and F1-score. The recall is defined as the percentage of positive examples that are returned by the model. Recall is calculated as follows:

Positive examples returned

Positive examples returned + Positive examples not returned

The precision of a model is the likelihood that a when a positive example is predicted by the model this prediction is correct. The precision is calculated as follows:

Positive examples correctly returned Total Positive examples returned

The F1-score is a weighted mean of the precision & recall and summarizes these scores in a single value. The F1-score is calculated as follows:

2 * Precision * Recall Precision + Recall

(12)

4 Experiment 1: Word Scope

In Figure 4 the results from the experiments are shown. The best performing config-uration was an LSTM with 2 layers, 50 neurons in first layer, 50 neurons in second layer and a batch size of 25. The F1-score of this configuration was 13.87, still very low. The most variation seemed to occur when the number of neurons changed. More layers did not increase the performance of the LSTM. For LSTM’s with 1 and 2 layers the performance was best with 50 neurons in each layer. When using 5 layers the F1-score dropped to 0.00 for all configurations. In appendix A. the loss-functions can be seen for each of the configurations.

Layers Epochs Neurons Neurons 2-5 Batch size Chance Accuracy Precision Recall F1-score 1 50 100 nvt 10 89.52 89.52 0.00 0.00 0.00 1 50 100 nvt 25 89.52 89.44 0.00 0.00 0.00 1 50 100 nvt 50 89.52 89.52 0.00 0.00 0.00 1 50 100 nvt 100 89.52 88.28 8.11 1.15 2.01 1 50 50 nvt 25 89.52 89.52 0.00 0.00 0.00 1 50 50 nvt 100 89.52 89.36 25.00 0.76 1.48 1 50 50 nvt 50 89.52 86.72 10.23 3.44 5.14 1 50 50 nvt 100 89.52 87.80 17.91 4.58 7.29 2 50 50 25 25 88.24 88.12 20.00 0.34 0.67 2 50 50 25 50 88.24 88.24 0.00 0.00 0.00 2 50 50 25 100 88.24 88.24 0.00 0.00 0.00 2 50 50 50 25 88.24 78.64 13.19 14.63 13.87 2 50 50 50 50 88.24 81.44 8.33 5.78 6.83 2 50 50 50 100 88.24 88.2 0.00 0.00 0.00 5 50 50 50 50 88.52 88.52 0.00 0.00 0.00 5 50 50 50 50 88.52 88.52 0.00 0.00 0.00 5 50 50 25 50 88.52 88.52 0.00 0.00 0.00 5 50 50 25 50 88.52 88.52 0.00 0.00 0.00

Figure 4: text experiment

5 Experiment 2: Phrase Scope

The results of this experiment are shown in Figure 5. The LSTM using the pitch-data within the scope of a phrase performed worse overall than the LSTM when using the pitch-data within the scope of a word. The only configuration that achieved a positive F1-score was an LSTM with 1 layer with a batch-size of 10 and 50 neurons. This configuration achieved an F1-score of 0.27, also very low. In appendix B. the loss-functions can be seen for each of the configurations.

(13)

Layers Epochs Neurons1 Neurons2-5 Batch size Chance Accuracy Precision Recall F1-score 1 50 100 nvt 10 89.25 89.24 0.00 0.00 0.00 1 50 100 nvt 25 89.25 89.25 0.00 0.00 0.00 1 50 100 nvt 50 89.25 89.25 0.00 0.00 0.00 1 50 100 nvt 100 89.25 89.27 0.00 0.00 0.00 1 50 50 nvt 10 89.25 89.27 100.00 0.14 0.27 1 50 50 nvt 25 89.25 89.25 0.00 0.00 0.00 1 50 50 nvt 50 89.25 89.25 0.00 0.00 0.00 1 50 50 nvt 100 89.25 89.25 0.00 0.00 0.00 2 50 50 25 25 89.25 89.25 0.00 0.00 0.00 2 50 50 25 50 89.25 89.25 0.00 0.00 0.00 2 50 50 25 100 89.25 89.25 0.00 0.00 0.00 2 50 50 50 25 89.25 89.25 0.00 0.00 0.00 2 50 50 50 50 89.25 89.25 0.00 0.00 0.00 2 50 50 50 100 89.25 89.25 0.00 0.00 0.00 5 50 50 50 50 89.43 89.43 0.00 0.00 0.00 5 50 50 50 50 89.43 89.43 0.00 0.00 0.00 5 50 50 25 50 89.43 89.43 0.00 0.00 0.00 5 50 50 25 50 89.43 89.43 0.00 0.00 0.00

Figure 5: text experiment

6 Conclusion

In this paper a method was proposed for using an LSTM to attempt to get more insight into the predictive value of pitch data for predicting prosodic pauses. Understanding the relations between the individual acoustic cues and prosodic structures might result in a better understanding of how prosodic structures are formed. To carry out this research speech-data from the Corpus Gesproken Nederlands was converted to time-series pitch data. Two different datasets were built from the pitch-data. One dataset used the pitch data within the scope of a word and the other dataset used the pitch data within the scope of a phrase. The use of an LSTM was proposed to train on this dataset. Beforehand it was not known which parameters allowed an LSTM to perform optimally. Therefor multiple configurations of the LSTM were experimented with. There were two rounds of testing for each of the datasets. The best configuration was tested more extensively. The result of this method showed the LSTM performed equal or worse than chance on both datasets. Even though the performance for both datasets was low, the LSTM’s using the dataset using the pitch data from the scope within a word performed better than the LSTM’s using the pitch-data within the scope of a phrase. Three sub-questions were formulated to help answer the main-question. The first sub-question was:

• ”How well does pitch-information predict prosodic boundaries within the scope of one word?”

Based on the result of the first experiment it can be said that there is low predictive value in the pitch-data within the scope of one word. The second sub-question was:

• ”How well does pitch-information predict prosodic boundaries within the scope of phrase?”

Based on the results of the first experiment it can be said that there is almost no predictive value in the pitch-data within the scope of a phrase. The third sub-question was:

• ”How well does pitch-information predict prosodic boundaries within the scope of phrase?”

(14)

Considering the results of the first experiment. The parameter that introduced the most variation of the F1-score was the number of neurons used. The other parameters did not influence the performance of the LSTM with the same magnitude.

Our main question was to see if prosodic boundaries could be predicted using time-series pitch data. Considering the answers to the sub-questions it can be said that there might be predictive value inside pitch data for predicting phrase boundaries. Even though the performance of the LSTM was low with both datasets. The LSTMs that used the dataset containing pitch-data from within the scope of a word performed better than the LSTMs that used pitch-data from within the scope of a phrase.

7 Discussion

The results of our experiments indicate that there might be predictive value inside pitch data when predicting prosodic boundaries. The performance of the LSTMs when using the pitch-data from within the scope of a word was overall better than the performance of the LSTMs using the pitch-data from within the scope of a phrase. This also suggests that different scopes hold different predictive values. For future studies it might be interesting to see how well the LSTMs perform when pitch-data from within the scope of a syllable or phoneme is used.

Previous studies used all acoustic cues to predict different prosodic structures. Even though our results were inconclusive, they suggest that there might be potential in researching the individual relation between an acoustic cue and prosodic structure using sequence learner.

The overall low performance of the LSTM might also suggest that there exists an interdependence between the acoustic cues and prosodic structure. As said before prosodic pause consists of a pause, a boundary tone before the pause and a lengthening of the word before the pause. For future studies it might be interesting how different combinations of these variables are able to predict prosodic boundaries. Time-series pitch-data could be combined with a variable signifying the lengthening of the word. This might lead to better results.

As can be seen in Appendix A and B a lot of the configurations were prone to overfitting. Several parameters of the LSTM, other than the ones we trained on, could be changed to address this problem. One of these parameters was the optimization algorithm that was used. In our configurations Adam optimization was used, another optimization algorithm that could be used was Stochastic Gradient Descent (SGD) [33] or the more recent AdaBound [23]. SGD offers a model that generalizes better but at the cost of longer training times. Adabound is an optimization algorithm that offers a middle way between the lower training times of Adam and the better generalization offered by SGD. For future studies it might be interesting to compare the performance using a different optimization algorithm. Another parameter that could counteract overfitting was the Dropout-rate. The dropout that was used in this study was set to 0.2. A higher dropout might have improved the performance of certain configurations of LSTMs. Not all overfitting can be attributed to the choice of the parameters. Another potential source for overfitting was the relatively small size of the available training set. Using a larger training set might improve performance in future studies. A different step towards LSTM optimization is offered by Karpathy, Johnson & Fei[17]. They demonstrate that there exist interpretable neurons inside LSTM networks. These neurons were demonstrated to hold information on patterns and dependencies inside the training set. For future studies these neurons might help identify relations between different acoustic cues and prosodic structure when training LSTMs.

(15)

References

[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. Learning long-term depen-dencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.

[2] D Bijl. Automatic prosodic boundary annotation in dutch speech based on acous-tic features using supervised learning. 2018.

[3] Paul Boersma et al. Praat, a system for doing phonetics by computer. Glot international, 5, 2002.

[4] Sara B¨ogels, Herbert Schriefers, Wietske Vonk, and Dorothee J Chwilla. Prosodic breaks in sentence processing investigated by event-related potentials. Language and Linguistics Compass, 5(7):424–440, 2011.

[5] Jeska Buhmann, Johanneke Caspers, Vincent J van Heuven, Heleen Hoekstra, Jean-Pierre Martens, and Marc Swerts. Annotation of prominent words, prosodic boundaries and segmental lengthening by non-expert transcribers in the spoken dutch corpus. In LREC, 2002.

[6] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelli-gence research, 16:321–357, 2002.

[7] Ken Chen and Mark Hasegawa-Johnson. Improving the robustness of prosody dependent language modeling based on prosody syntax dependence. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pages 435–440. IEEE, 2003.

[8] Ken Chen, Mark Hasegawa-Johnson, and Aaron Cohen. An automatic prosody labeling system using ann-based syntactic-prosodic model and gmm-based acoustic-prosodic model. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–509. IEEE, 2004.

[9] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE inter-national conference on acoustics, speech and signal processing, pages 8609–8613. IEEE, 2013.

[10] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

[11] Kilian Evang, Valerio Basile, Grzegorz Chrupa la, and Johan Bos. Elephant: Sequence labeling for word and sentence segmentation. In EMNLP 2013, 2013.

[12] Kazuyuki Hara, Daisuke Saito, and Hayaru Shouno. Analysis of function of rectified linear unit used in deep learning. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2015.

[13] Julia Hirschberg. Pitch accent in context predicting intonational prominence from text. Artificial Intelligence, 63(1-2):305–340, 1993.

[14] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, J¨urgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

[15] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

(16)

[16] Je Hun Jeon and Yang Liu. Automatic prosodic events detection using syllable-based acoustic and syntactic features. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4565–4568. IEEE, 2009.

[17] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078, 2015.

[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[19] Margaret M Kjelgaard and Shari R Speer. Prosodic facilitation and interference in the resolution of temporary syntactic closure ambiguity. Journal of Memory and Language, 40(2):153–194, 1999.

[20] Valeri˘ı Kozhevnikov. Speech: Articulation and perception.

[21] D Robert Ladd and Anne Cutler. Introduction. models and measurements in the study of prosody. In Prosody: Models and measurements, pages 1–10. Springer, 1983.

[22] Ilse Lehiste. Temporal organization of spoken language. 1970.

[23] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient meth-ods with dynamic bound of learning rate. CoRR, abs/1902.09843, 2019.

[24] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.

[25] Leena Mary and Bayya Yegnanarayana. Extraction and representation of prosodic features for language and speaker recognition. Speech communication, 50(10):782–796, 2008.

[26] Sieb Nooteboom. The prosody of speech: melody and rhythm. The handbook of phonetic sciences, 5:640–673, 1997.

[27] Chris Olah. Understanding lstm networks. Retrieved from URL: https://colah.github.io/posts/2015-08-Understanding-LSTMs/, 15 Aug 2008.

[28] NHJ Oostdijk. Het corpus gesproken Nederlands. 2000.

[29] Janet Breckenridge Pierrehumbert. The phonology and phonetics of English in-tonation. PhD thesis, Massachusetts Institute of Technology, 1980.

[30] Yao Qian, Zhizheng Wu, Xuezhe Ma, and Frank Soong. Automatic prosody prediction and detection with conditional random field (crf) models. In 2010 7th International Symposium on Chinese Spoken Language Processing, pages 135– 138. IEEE, 2010.

[31] Andrew Rosenberg. Speech, prosody, and machines: Nine challenges for prosody research. In Proceedings of the International Conference on Speech Prosody, pages 784–793, 2018.

[32] Ken Ross and Mari Ostendorf. Prediction of abstract prosodic labels for speech synthesis. Computer Speech & Language, 10(3):155–185, 1996.

[33] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.

[34] Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Shrikanth S Narayanan. Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing, 16(4):797–811, 2008.

(17)

[35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfit-ting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[36] Colin W Wightman and Mari Ostendorf. Automatic labeling of prosodic patterns. IEEE Transactions on speech and audio processing, 2(4):469–481, 1994.

(18)

A

Experiments:- Word Scope

A.1 Experiments: 1 Hidden Layer

Figure 6: Parameters:

Epochs: 50, Neurons:100 , Batch size: 10

(19)

Figure 8: Parameters: Epochs: 50, Neurons:100 , Batch size: 50

(20)

(21)

(22)

A.2 Experiments: 2 Hidden Layers - Word Scope

Figure 14: Parameters: Epochs: 50, Neurons1:50, Neurons2:25 , Batch size: 25

(23)

(24)

(25)

A.3 Experiments: 5 Hidden Layers

Figure 20: Parameters: Epochs: 50, Neurons1:50, Neurons2-5:25 , Batch size: 50

(26)

(27)

B

Experiments Phrase Scope

B.1 Experiments: 1 Hidden Layer - Phrase Scope

(28)

(29)

(30)

(31)

.1

Experiments: 2 Hidden Layers - Phrase Scope

(32)

(33)

(34)

.1

Experiments: 5 Hidden Layers - Phrase Scope

(35)

Sequence Learning and Prosody: Predicting prosodic boundaries using time-series pitch data