Pitch Prediction using LSTM and Pitch Classification using MLP-Classifier

(1)

Pitch Prediction using LSTM and

Pitch Classification using

MLP-Classifier

Mara D. Fennema

11020164

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. dr. T.O. Lentz

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 107 1098 XG Amsterdam

(2)

Abstract

A common problem found in intelligent personal assists such as Alexa, Google Assistant and Siri, is that the intonation of the spoken words does not sound very natural. In an attempt to find a way to improve such incorrect pitch, this research looks at the possibility of predicting pitch based on the length of the sentence with a Long Short-Term Memory Recurrent Neural Network. Besides that, this research also looks at effectiveness of using the hidden states of this LSTM to classify types of focus. This classification was done by using a Multi-Layer Perceptron Neural Network. It was discovered that this approach for pitch prediction did not work, as the resulting pitches all stay near zero. Similarly, the approach used for pitch classification was not successful either, as the accuracy of the algorithm was equal to chance.

(3)

2.3 LSTM with Linguistics . . . 13 3 Experimental Setup 15 3.1 Pitch Prediction . . . 15 3.1.1 Data acquisition . . . 15 3.1.2 Data preprocessing . . . 17 3.1.3 LSTM . . . 19 3.2 Pitch Classification . . . 20 3.2.1 Data acquisition . . . 21 3.2.2 Data preprocessing . . . 21 3.2.3 MLP Classifier . . . 21 4 Results 23 4.1 Pitch Prediction . . . 23 4.2 Pitch Classification . . . 25

5 Conclusion and Discussion 27 5.1 Pitch Prediction . . . 27

5.2 Pitch Classification . . . 28

6 Future Work 30

References 32

(4)

Acknowledgements

First of all, I would like to thank my supervisor, Tom Lentz, for is continuous support during this research and the writing of this thesis.

I would also like to thank Tessa Wagenaar, for her help with brainstorming and support when I would get stuck on certain parts.

Noa Visser deserves my thanks as well, for her continued support during the research part with brainstorming for new ideas. Lastly, I would like to thank Jelte Fennema, Naomi Spaans, Noa Visser and Tessa Wagenaar for proofreading my thesis.

(5)

1 Introduction

Over the last few years, the use of dialogue systems, also known as chatbots or Intelligent Personal Assistants (IPA) has increased a lot. These IPAs, with famous examples such as Apple’s Siri, Google’s Google Assistant, and Ama-zon’s Alexa, have four different computational problems. These four problems are, speech recognition, text comprehension, the actual computation asked by the user, and speech synthesis. In this last one, a problem commonly found by users is that it does not always sound very natural. Intonation goes up in places where it should go down or stays very flat where it should move around. (Matsui et al., 1991)

Pitch is a linguistic term that is closely related to intonation. Just like with musical instruments, the pitch in spoken language is the degree of highness or lowness of a tone. Violante et al. (2013) have created an algorithm to improve the naturalness of unnatural sounding pitch. This was done by lowering the pitch peaks, therefore reducing the largest offsets in the pitch, and increas-ing the naturalness of the sound, while keepincreas-ing the intelligibility of the spoken sentence the same. In other words, the adjusted sentence was just as easily understood as the original one, while sounding more natural to human ears. Pitch is often visualised as a sequence of frequencies. Not all spoken sounds have a pitch, for it is only created when the vocal chords are in use. These kinds of sounds where the vocal chords are in use, are called voiced sounds. (Smith, 2000). Examples of these voiced sounds are the m, as in ‘mom’, all the vowel sounds, and the b as in ‘brother’. The opposite of voiced sounds are voiceless sounds. These sounds do not have a pitch, for it is solely created by the way the mouth is moved while exhaling. Examples of voiceless sounds are the p, as in ‘parent’, the s as in ‘silence’, and the k as in ‘cookie’. (Jurafsky and Martin, 2000) Because these sounds do not have pitch, the pitch sequence of a sentence is not always continuous, for there are parts that have no data. This can thus be either because of a pause made by the speaker, where nothing is said, or because a pitchless sounds was made.

Vallduv´ı and Engdahl (1996) state that pitch is heavily influenced by the focus of the sentence. Focus is the part of the sentence that is emphasised. More in depth information on focus will be given in Section 2.1.2. The pitch tends to go up when there is contrasting or new information within a sentence, for that is the part that the speaker wants to emphasise. Hirschberg (2007) and Lentz (2019) have described three different types of pitch, Low, High and Neutral, depending on the focus. For low focus, the pitch goes up a little bit in a specific part of the sentence, for high it goes up a lot, whereas for neutral focus the pitch is very consistent, with the pitch staying around the same frequency. Lentz (2019) has created an LSTM Recurrent Neural Network to classify these three different types of pitch in a dataset of Dutch spoken language with an accuracy of 95%. However, that same research also showed that this was not an easy task. Two other algorithms, a Support Vector Machine (SVM) and k-Nearest Neighbour (kNN) were also tested, and would not reach an accuracy higher than 66%.

(6)

Instead of improving the pitch of already existing audio where the pitch sounds unnatural, as Violante et al. (2013) have done, this research focuses on predict-ing pitch based on sentence length, disregardpredict-ing the content of the sentence. The research question can be split into two separate questions:

1. Can a general pitch of a Dutch sentence be predicted using an LSTM Recurrent Neural Network without the specific words as input?

2. In what way can the hidden layer of the pitch-prediction neural network be used to classify pitches into three separate classes?

Exploring whether a general pitch can be predicted, without knowing the mean-ing of the sentence, has not been done before, and if it works, could be applied to chatbots and Intelligent Personal Assistants (IPAs) to improve the natural-ness of the spoken language. By being able to calculate the proper response the IPA has to give at the same time as calculating the pitch, instead of doing it subsequently, computational times could decrease, and thus have the API respond quicker.

Along with a new approach to pitch prediction, the approach to pitch classifi-cation is also new. In contrast to what Lentz (2019) has done, with classifying three types of pitch directly with an LSTM, this research tries to do that in combination with predicting pitch. If first predicting pitch, and subsequently using the hidden layers of this prediction for classification works, it could indi-cate that the three types of pitch are not only theorised, but that they do exist. Even if the LSTM does not manage to predict a pitch, there is still a chance of the neural network finding a logic in the pitchcontours it sees while training. The hidden layer within the neural network might be able to assist in classifying the three different types of pitch.

As is discussed above, the current state of pitches within the state-of-the-art IPAs is not very natural. Big companies such as Google, Apple and Amazon have not yet succeeded in predicting a proper pitch for a sentence. Due to this and the fact that the scope for a Bachelor’s thesis has to be limited, considering the time limit, this research is aimed to be exploratory. It is unlikely that it will result in a properly functioning algorithm that predicts pitch perfectly, but it is still important to explore. Even with possible negative results, it is a good approach to test out, on which other research can continue to build.

Even though the hypothesis regarding the pitch prediction itself is not very positive, there is still a chance of the neural network finding a logic in the pitch-contours it sees while training. These hidden layers might be able to assist in classifying the three different types of pitch. As has been discussed before, Lentz (2019) has previously created an algorithm that has correctly classified the three different types of pitch using an LSTM. This research will also use an LSTM, but now to predict the pitch itself. It is possible that the hidden layers of another LSTM trained on pitch will help classification.

This thesis will explore the possibility of predicting pitch and classifying dif-ferent types of pitch. This will be done by first looking at previous research in

(7)

Section 2. Subsequently, the experimental setup will be explained in Section 3. Afterwards, the results will be described in Section 4, followed by the conclusion and discussion in Section 5. Lastly, possible future work will be described in Section 6.

(8)

2 Literary Background

As has been discussed in Section 1, the goal of this research is to discover whether or not pitch can be predicted and classified by a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN), and to classify three dif-ferent types of pitch using information gained by this network. This means two different fields, linguistics and machine learning, come together. Therefore, this literature section has been split into three subsections. Section 2.1 focuses on the knowledge of linguistics necessary to fully understand this research, while Section 2.2 focuses on the machine learning background needed. Lastly, in Sec-tion 2.3 these two secSec-tions are combined, so that some background in previous research on the combination of the two fields can be properly discussed as well.

2.1 Linguistics

In this section, information will be given about the linguistic terms and concepts discussed in this research. In Section 2.1.1 the the concept pitch will be discussed in detail. In Section 2.1.2 it will be explained what focus is, and how the three different types of focus can be distinguished.

2.1.1 Pitch

The pitch in spoken language is the intonation in the sentence. Variation within a pitch allows speakers to emphasise certain words, thereby conveying more meaning than could be taken from the actual words alone. An example of this would be the rise in tone at the end of a question, which in some languages, such as Korean or French, can be the only difference between a statement and question.(Brown and Yeon, 2019) (Baker and Hengeveld, 2012) Pitch can vary a lot between languages, as pitch in Korean is very muted, the pitch peaks not that much higher than the valleys in the pitchtrack. In French, however, the pitch goes up and down a lot more. Baker and Hengeveld (2012)

Not all parts of speech have pitch. As has been described in Section 1, this is based on whether the sound is voiced or voiceless.(Baker and Hengeveld, 2012) Voiced sounds use the vocal chords, and therefore have pitch, whereas voiceless sounds do not use the vocal chords and are thus pitchless. In Figure 1a, the audio of a spoken sentence is visualised, with in Figure 1b, the pitch of the same sentence is visualised. It is visible how even though the sound continues on, there are gaps in the pitch.

(9)

(a) Sound of a spoken sentence visualised

(b) The pitch of the same spoken sentence visualised Figure 1: The sound and pitch of one sentence visualised

Sometimes when speech is synthesised by computers, the audio can sound quite flat and alien to human ears. This might be partially due to the lack of a correct pitch in spoken language, as has been hypothesised by Violante et al. (2013). Violante et al. looked at how pitch could be adjusted to sound more natural to humans. The sound where the pitch got improved was not

(10)

synthesised by a computer, but was from professional public speakers, that all have pronounced pitch peaks. Pronounced pitch peaks are a feature in spoken language that makes it sound less natural, and is a problem often found in recordings of professional speakers. While pronounced pitch peaks are used by the speaker to emphasise their points, it makes the spoken language sound less natural and more rehearsed. Thus, the goal of the research was to remove these pronounced pitch peaks. This was done by scaling down the original pitch track 40% above 200Hz. The resulting audio sounded a lot more natural, and the intelligibility of the sentence stayed the same.

2.1.2 Focus

One specific element of linguistics is the focus of a sentence. Focus is defined by Vallduv´ı and Engdahl (1996) as the most salient discourse entity at a given time point - that is, the one at the top of the focus stack. In other words, focus is the new information within a sentence, the part where emphasis is desirable. Focus is a linguistic feature that is usually absent in written language, although there are some ways to write it down. It does, however, come into play with spoken language, and it is believed that it heavily influences the pitch of a sen-tence.(Vallduv´ı and Engdahl, 1996) This is because in spoken language the pitch tends to go up when new information is uttered, thereby emphasising the focus of the sentence.

Lentz and Terband (2018) have inferred three types of focus have been in-ferred: High, Low and Neutral (usually denoted as HI, LO and NEU). The kind of focus of a sentence depends on what part of the sentence contains the new information, and how this influences the pitch.

For NEU focus, there is not really one specific part of the pitch that differs significantly from the rest. This is usually the case when the entire sentence contains new information. For example, the sentence ‘Anna is drinking tea’ would have a low focus when it is an answer to the question ‘What is going on over there?’. The person asking the question knows nothing about the situa-tion, and therefore the entire sentence is new information. Hence, nothing in the sentence gets stressed in the pronunciation. Because of this, the pitch does not rise in a specific part of the sentence.

For LO focus, the new information tends to be the subject or object of a sen-tence. The pitch goes up a little bit, but not a lot, hence the name low. This happens when the question ‘What is Anna drinking?’ is asked, with the re-sponse still being ‘Anna is drinking tea’, the new information is the tea, and is therefore in focus. Similarly, the answer to the question ‘Who is drinking tea?’ would be ‘Anna is drinking tea’ with Anna being in focus.

For HI focus, the pitch goes up quite a bit, hence the term high. This is the case when a sentence is contrastive. In other words, when someone corrects old information by adding new information to the common knowledge. This could occur when the question ‘Is Ben drinking tea?’ is answered with ‘No, Anna is drinking tea.’ Anna is both new information and a correction of old informa-tion. In cases such as these, the pitch goes up significantly more than it does

(11)

with low focus.

2.2 Neural networks

A Neural Network (NN) is a structure within machine learning consisting of nodes (alternatively called neurons).(Alpaydin, 2014) There are three types of nodes, the input nodes, which receive the input from the dataset, the output nodes, which return the output, and the hidden nodes, which do most of the calculations. The input depends on the type of dataset, and can thus range from words or sentences to numbers or even images and documents. The out-put also depends on the type of problem, and can range from a classification to a prediction or a chance. The output for this research will be a pitch sequence, and the input will be the length of this pitch.

The way the output is calculated is as follows. All the nodes in a neural network have weights, which are multiplied with the input in a way befitting the type of in- and output. An image would usually be turned into a matrix based on the pixels, and text classification could be done by vectorising the input sentences. (Alpaydin, 2014)

The aforementioned weights that determine the calculation are not simply thought up, but are calculated during the training part of the algorithm. Because there is a limited amount of input states (dependent on the amount of input of the usecase), and a limited amount of output states, not a lot of weights can be applied to those. Therefore, there are the hidden nodes, usually grouped to-gether, in subsequent layers. This allows for a lot of possible computations, and therefore increasing the possible accuracy of the network. The weights within all the layers are adjusted during the training phase of the algorithm. These adjustments are done in such a way that the loss, or the incorrectness, of the al-gorithm gets minimised, thus resulting in as good of a prediction or classification as possible.

2.2.1 Multi-layer Perceptron Neural Networks

Multi-layer Perceptron (MLP) neural networks are a type of feed-forward neural networks.Hastie et al. (2009) Similarly to a standard neural network, it consists of three layers, in input layer, a hidden layer, and a output layer. All nodes, except for the input nodes, utilises a nonlinear activation function, similarly as an RNN does, as will be explained in Section 2.2.2.

An MLP learns through supervised learning. This means that the input of the neural network is labeled, with a classification problem, such as the one in this research, the network does not have to discover the classes itself, but knows between which classes to choose. The specific technique used by an MLP is backpropagation. Backpropagation is a way to minimise the loss by changing the weights of the nodes in the network. Due to the fact that an MLP uses non-linear activation functions and the fact that it consists of multiple layers, it can distinguish data that is not linearly separable. This is exceptionally useful

(12)

for a problem like this, as the different types of pitch are not separable by two single lines, splitting them in three categories.

2.2.2 Recurrent Neural Networks

Traditional neural networks as described in Section 2.2 are structured in such a way that they can learn how to predict or recognise certain concepts, depen-dent on the dataset and implementation. However, there is one shortcoming in these traditional neural networks, and that is the fact that they are structured is sequential. Information goes from the input, through all the hidden states in order, and then results in an output. These hidden states get updated to minimise the loss, and therefore recognise or predict an as correct as possible output. In the case when the data within a dataset is dependent on previous data (e.g. a number sequence, a sentence, etc), a traditional neural network cannot properly take in the previous information and use it for the output. The solution for this problem is a recurrent neural network (RNN). RNNs have loops built into their structures, allowing for previous information being taken into account for future predictions. This is not necessary for every problem, but is very useful for certain cases, for example, determining the meaning of what is being said by a speaker, for speakers tend to refer back to things that were said before, such as in the previous sentence. In problems such as this, the important information is not always given right before the current timestep, but can also be given minutes or even hours ahead of time. While an RNN is good at retrieving important information from previous instances, the larger the step between the current and important previous information gets, the more difficult it becomes for the RNN to connect the two pieces of information. In theory, RNNs are able to deal with such situations, but in practice this does not occur. Research shows that when the duration between two dependencies grows, the difficulty for gradient based learning algorithms such as RNNs grows with it. (Bengio et al., 1994)

The way a RNN works seems, initially, very similar to a traditional neural network, in the sense that it consists of the input, the hidden layer and the output. There are, however, multiple differences, the first being that the hidden layer of the previous timesteps are also taken into account when calculating the output. (Alpaydin, 2014) Because of this, a substantial amount of additional numbers are used in the calculation, resulting in very large numbers, with many of these numbers being the same or similar to each other, as an RNN is a looped neural network. If somewhere in the network a specific value would get mul-tiplied by two, through looping, the number would be 2n _{times larger after n}

iterations of the neural network, thus increasing the necessary computational power tremendously. To reduce this and give more effect to the smaller num-bers, a tanh function is used. A tanh function is a hyperbolic tan function, which, as opposed to the normal tangent function is modelled not for a circle, but, as the name suggests, for a hyperbole. Through using the tanh function, all values are flattened so that they are between -1 and 1.

Two other ways to minimise computing power are element-by-element addi-tion and element-by-element multiplicaaddi-tion. This allows for separate vectors to

(13)

be combined into a single vector with just one single computation, as opposed to calculating all the values in the vectors one by one, which is significantly computationally heavier. One advantage of using element-by-element multipli-cation is that it allows for gating. Gating is a way to keep unnecessary values out of future computations, and give crucial values more importance. By mul-tiplying the key values with 1, and the irrelevant values with 0, the irrelevant ones disappear while the key values keep their value for future computations. To make the gating more effective, it is useful to prevent the usage of nega-tive numbers, for then while doing additions, no other values might end up at zero, accidentally gating where no gating is desirable. To do this, the logistic sigmoid function is used. Similarly to the tanh function, the logistic sigmoid function takes all the values, and changes them so that everything falls between 0 and 1, preventing inadvertent gating.

By using gating, and the data of the previous timesteps, the RNN manages to forget the redundant information, and predict the proper output. Optimis-ing the output is done in the same sense as in a traditional neural network, by minimising the specified loss function.

2.2.3 Long Short-Term Memory Networks

Long short-term memory (LSTM) is an architectural structure used in recur-rent neural networks, which is able to deal with the long-term dependencies, as opposed to a normal RNN, as described in Section 2.2.2. This subset of RNNs was created by Hochreiter and Schmidhuber (1997), and has since been used in a multitude of papers. (Graves et al., 2013) (Soutner and M¨uller, 2013) The way that an RNN deals with long-term dependencies, is by adding another gate to the RNN structure, that allows possible predictions to be ignored. By doing this, the prediction can exclude certain timesteps from the past iteration, while still being able to use them in the future. This extra gate has its own set of neurons, that learns when to ignore the information, and when to let it through. Through doing this, an LSTM has very high accuracy scores when dealing with sequences, for it can use both the information from right before, and the in-formation from earlier timesteps. This makes it very useful for usage within Natural Language Processing, such as with speech recognition problems, as was done by Graves et al. (2013), with language modelling, done by Soutner and M¨uller (2013), both explained in more detail in Section 2.3, and possibly for pitch prediction as well, which is the goal of this research.

2.3 LSTM with Linguistics

As has been described in Section 2.2, a lot of research has been done using neural networks, and there are lot of applications for linguistics, or Natural Language Processing (NLP).

Both Graves et al. (2013) and Soutner and M¨uller (2013) looked at ways to use an LSTM for speech recognition. Graves et al. (2013) used a Deep

(14)

Bidirec-tional LSTM, which is an LSTM that also looks at all the sequences backwards, and predicts the output both from front to back, and back to front. The result they found was that their implementation using this LSTM with Hidden Markov Models had equal results to previous work, now only on larger datasets. Soutner and M¨uller (2013) extend an LSTM on a statistical language modelling task, using a dataset of spontaneous Czech phone calls. They used different extensions in order to gain the best results, adding Latent Dirichlet Allocation (LDA) to improve modelling of longer context, and they tried to classify the conversations based on topic. It resulted in significant improvement of speech recognition, compared to the basic model.

One specific research that combines pitch, focus and machine learning looked at classifying different types of pitch by using different types of machine learn-ing algorithms.(Lentz, 2019) These algorithms were k-Nearest Neighbour(kNN), Support Vector Machines(SVM), and an LSTM. Lentz collected the dataset by having 40 participants saying 36 sentences, with LO, HI and NEU focus la-belled. The SVM and kNN implementations were performed by both him and 15 groups of 4 students, all resulting in an accuracy of about 66%. The LSTM, however, with 128 nodes in the hidden layer had an accuracy of 95%.

(15)

3 Experimental Setup

To be able to properly answer the two research questions described in Section 1, two separate experiments have been done. The first is the actual pitch pre-diction, done by an LSTM, using the length of the pitch as input. The input is as such in order to be able to predict a pitch regardless of meaning of the sentence. The second experiment uses the hidden layer of the LSTM trained for pitch prediction. It uses this hidden layer to classify the three different types of pitch, LO, NEU and HI, while using an MLP Classifier for the classification. An overview of this implementation can be seen in Figure 2

Figure 2: Flowchart of the implementation

Input as

pitch length LSTM Output as pitch

MLP Classifier LO-HI-NEU

Classification Hidden Layers

3.1 Pitch Prediction

In this section, the first experiment will be explained. This first experiment focuses solely on the prediction of pitch using an LSTM. In Section 3.1.1 the dataset used, the Corpus Gesproken Nederlands, will be explained. In Section 3.1.2, it will be explained how this data was preprocessed, in order to be able to use it in the LSTM. In Section 3.1.3, the way the LSTM is implemented will be explained.

3.1.1 Data acquisition

To attempt pitch prediction using an LSTM Neural Network, the Corpus Gespro-ken Nederlands (CGN) was used. This dataset is a collection of 900 hours of spoken Dutch created by a consortium of Dutch universities, and supplied by the Instituut voor de Nederlanse Taal (Institute for the Dutch Language).(DLI, 2014) This dataset was created in order to be able to apply the current scientific research about speech analysis to Dutch. The goal of the CGN was to have a large high quality data set of spoken Dutch, with multiple different types of annotation. The reason that this is used, is because the labeled dataset that is used for pitch classification is also in Dutch. If a dataset in another language would be used for pitch prediction than the language of the classification, the

(16)

chances of it succeeding could be a lot smaller, for pitch is different in certain languages, as has been described in Section 2.1.1.

Component Number of

Dutch files Number of pitchfiles

Interviews with teachers of Dutch 80 33.092

Spontaneous telephone dialogues

(recorded via a switchboard) 358 78.870

Spontaneous telephone dialogues

(recorded on MD with local interface) 304 54.477

Simulated business negotiations 67 14.360

Interviews/discussions/debates

(broadcast) 428 75.945

(political) Discussions/debates/ meetings

(non-broadcast) 90 42.058

Lessons recorded in the classroom 207 47.486

Live (e.g. sports) commentaries

(broadcast) 172 6.916

News reports/reportages (broadcast) 327 13.735

News (broadcast) 5.089 53.335 Commentaries/columns/reviews (broadcast) 218 13.008 Ceremonious speeches/sermons 6 1.270 Lectures/seminars 25 13.109 Read Speech 561 139.582 Total 7.932 587.243

Table 1: Overview of the amount of files per component in the used part of CGN, and the amount of extracted pitchfiles

(17)

(a) Pitchtier of a sentence

(b) The same pitchtier, now interpolated quadratically to make the intervals consistent Figure 3: Original pitch shown in Praat, and after interpolating quadratically.

3.1.2 Data preprocessing

The audiofiles in the CGN are files of a lot of different types of speech, including but not limited to conversations, church services, news items, and monologues, see Table 1. Because of this, a significant amount of the files are in some cases longer than ten minutes. While the dataset consists of many long audiofiles, it is not desirable that the predicted pitch tracks are ten minutes long. Thus, the audiofiles first needed to be split, preferably into separate sentences. How-ever, there is no real algorithm to do this, so it was assumed that silences are indicators for the end of a sentence, and the audiofiles were thus split based on those. This was done because it seemed the most human-like, and because a pitch stops when a pause is taken in speech. Therefore, it seemed like the most practical solution. This was done by using the programme Praat.(Boersma and

(18)

Weenink, 2018) To ensure that the tiniest pause in the middle of a sentence would not be counted as the end of the sentence, a threshold of 0.2 seconds was set for something to be counted as a silence. Similarly, the threshold for something to be counted as a sounding part was also 0.2 seconds.

After splitting, the pitchtracks of these separated audio files were extracted, and converted down to PitchTiers. This is because the pitchtracks that Praat extracts are not continuous sounds, but rather a collection of possible pitches on certain timesteps, in this case with a size of 0.01 second. When converting the pitchtrack down to a PitchTier, only the pitch value that is deemed most likely to be correct remains, and all the other values are removed.

Even after converting the pitch track to a PitchTier, the intervals between dat-apoints were still very inconsistent, which can be seen in Figure 3a. There are parts where the blue datapoints are very close together, and parts where there is a lot of space between two separate points. This is in part due to the fact that not every sound in speech has a pitch, such as the ‘p’ in parent, as has been explained in Section 2.1.1. These inconsistent interval lengths are not good for the LSTM. An LSTM requires consistent timesteps for it to be able to properly predict the sequences. To allow for this, the created PitchTier was interpolated quadratically within Praat, with 4 points per parabola, which returned an ap-proximation of the probable shape in consistent timesteps. The new PitchTier will be saved as an headerless spreadsheet file, so that both the timesteps and the pitch value are saved in a simple table. The number of resulting files can be seen in Table 1.

Two parts of the dataset were excluded for this research, and have therefore not been added to the data in Table 1. The first is all of the files spoken with a Flemish accent. Flemish has more influences from French than the so-called standard Dutch accent. This is due to the fact that in Belgium, both French and Flemish are the official languages. Therefore, to increase the chances of success for this project, Flemish was chosen to be excluded in the dataset. The second part which was left out as data for this research is the entirety of Component A (so called in the documentation of the CGN), which contains spontaneous conversations, which were held face-to-face. Since this compo-nent contains spontaneous conversations, (almost) all dialogues contain mo-ments where the two people talk at the same time. When this happens, the the pitch extraction would be influenced by two voices at the same time, and is therefore unreliable for either training or testing pitch prediction.

In the annotation of the audiofiles in the CGN, only the starting time has been specified for a specific part of the spoken audio, not an ending time. This kind of annotation is not sufficient to be able to extract the parts where the two speakers speak at the same time. Because it is not feasible to go through the entire dataset to annotate where two speakers are speaking at the same time, it was been decided to exclude the entirety of Component A, for that is the only component of the CGN that has two speakers in one audiofile. Excluding both the Flemish spoken files, and Component A brings the total number of audiofiles used for this project to 7932.

(19)

  3 3 3 3 3 2 2 2 2 2 5 5 5 5 5  

(a) Input matrix of 3xl, where l = 5   0 0 200 205 210 0 0 0 300 300 200 205 210 215 210  

(b) Pitch matrix of 3xl, where l = 5

Figure 4: Examples of the input matrix, and the matrix of the actual pitches

Subsequently, the dataset was split up in a trainingset and a testset, with a 80/20 ratio. This is to make sure that the LSTM has a large enough amount of data to actually be able to train on the pitches properly, preventing overfitting, while still maintaining a large enough amount of data for testing, to make the results statistically significant.

Splitting the dataset was done by extracting the first 20 percent of files in each component described in Table 1, calling these the testset. The other files left were put together to create the trainingset.

3.1.3 LSTM

For the LSTM, the Pytorch package in Python 3.7 was used. (Rossum and Drake, 2010) (Paszke et al., 2017) The simplest implementation would be to just give one pitchlength as input, and have the algorithm return the corre-sponding pitch. However, this results in a very long computational time, so another solution was found. To reduce computational time, a batch of 128 lengths were used as input, and would return a batch of 128 pitchfiles. This is all done by matrices, both the input and the output. If the input matrix is 128x1, then the output matrix is the same shape, therefore not returning a pitchfile, but just a single number.

To prevent this, the input matrix was 128xl, where l is the length of the longest pitch file of the batch. All the values in a single row are the same, namely, the length of that original pitchfile. In Figure 4a, an example of such a matrix can be found.

In order for the LSTM to learn what is a correct prediction and what is not, the actual pitches needed to be extracted as well. But when using matrices as in- and output in an LSTM, it needs to be an mxn matrix, so all the pitches need to be padded in order to be the same length. This was done by putting zeros in front of the shorter pitch files, as can be seen in Algorithm 1, with an example of the output in Figure 4b, which is the accompanying matrix for the input matrix shown in Figure 4a. For the LSTM to learn, the loss needs to be optimised. In this case, the loss function of the LSTM was the absolute mean distance between the output matrix that the LSTM calculated, and the

(20)

Algorithm 1 Get next batch of data Data: Pitchfiles

Result: pitches, pitchLengths

Get longest pitch length for file in batch do Get length of pitch

Make list of longest length filled with current pitch length Add list to list of pitchlengths

Extract pitchdata from file

Put 0’s in front until current pitch is same length as longest pitch Add padded pitch to pitch-list

end

return pitches, pitchLengths

matrix of the actual pitches of the files from the current batch. This was done by subtracting the output matrix from the matrix containing the actual pitch, getting the absolute value from that, and then calculating the mean of the whole matrix. This mean was the loss that the LSTM had to optimise. The loss was saved every 50 iterations, as the average of those 50 iterations

To be able to give an idea to what a ‘good’ loss value is, the average pitch from the whole dataset was calculated. This pitch was then also predicted by the LSTM, and the loss of that will be seen similarly to something akin to chance. For a prediction algorithm does not have a chance, like the classification algorithm in Section 3.2 does, this value will be used as a benchmark.

3.2 Pitch Classification

The second experiment in this research is classifying the different types of focus within the pitch. This was done with an MLP Classifier, and the dataset used is the same dataset used by Lentz (2019) in his research. The dataset consists of 2171 audio files, all of one single spoken Dutch sentence, either with high, low or neutral focus. All files are labeled with their focus, and as is visible in Table 2, the distribution of the three different types of focus is nearly perfectly divided.

Classification Amount of Files

Low 725

Neutral 724

High 722

Total 2171

(21)

3.2.1 Data acquisition

Just as has been described in Section 2.1.2, the files with NEU focus do not have a specific part in the sentence that contains new information. This dataset contains NEU sentences such as ‘We zijn aan het inpakken’ (‘We are unpacking’), which could be the answer to ‘What are you doing?’. The LO focus files have a specific part of the sentence that is new information. An example from the dataset of a sentence with LO focus is ‘Ze verkopen noedelsoep uit bakken’ (‘They sell noodlesoup from containers’), where the focus is on noodlesoup. This could be an answer to the question ‘What are they selling from containers?’. The HI focused sentences are contrastive, with the focus on the part that is being corrected. A HI focused sentence from this dataset is ‘Nee, ik haal het juist uit pakken’ (‘No, I’m getting it out of cartons’), which could be reply to the question ‘Are you putting it in cartons’, for the pitch in the sentence goes up on the word ‘out’.

3.2.2 Data preprocessing

The preprocessing of this data was similar to the way it was described in Section 3.1.2. In contrast to the CGN, this dataset consists of audiofiles of separate sen-tences. Therefore, they do not need to be split, and the pitch can be extracted immediately. Afterwards, the pitchtier was created, followed by quadratic in-terpolation, and then saved as a separate text file.

After the pitch had been extracted, every pitch was put through the last it-eration of the LSTM. For every pitch predicted, the hidden layer is also saved. The hidden layer consists of a tuple of two vectors. The first vector stays the same for all the files, but the second one, known as the cell, is unique to the file it predicts. These cells were saved, and, together with the type of focus that the file has, exported to a csv file. This csv file was then used as the dataframe for the Multi-Layer Perceptron (MLP) classifier.

3.2.3 MLP Classifier

The MLP classifier used is from the Scikit-learn library for Python.(Pedregosa et al., 2011) It was a network consisting of 100 hidden layers, uses stochastic gradient descent, and a learning rate of 0.0001.

The input for the MLP classifier is extracted from the csv described in Sec-tion 3.2.2, by using the pandas library for Python.(McKinney, 2010) As the MLP classifier is not great with working with lists, both the mean and the sum of each cell were calculated. Subsequently, the dataset is split up randomly into a train- and testset, with a 80/20 ratio. The ratio of each focus-class within the training and testset, for both the mean and the sum sets, can be found in Table 3. The MLP Classifier then trains the algorithm, trying to find a way with minimum loss.

(22)

Training Sum Test Sum Training Mean Test Mean

LO 586 157 591 134

NEU 598 126 576 148

HI 590 152 569 153

(23)

4 Results

4.1 Pitch Prediction

Testset Average Pitch

35.41 22.02

Table 4: Loss of pitch prediction on the testset and the average pitch As has been discussed in Section 3.1.3, the accuracy of the output of the LSTM will be determined by the loss. As can be seen in Figure 5, the loss starts at around 30, fluctuates around that point for about the first 2000 iterations, and then has a very steep drop. After the drop, it goes back up again, with an all-time high at around 2750 iterations. A second drop happens around 3400, going back up again as fast as the previous drop at 2000. After around 3700 iterations, the loss function goes down and seems to stay around 10-15, which is under the line of the loss when the average pitch is predicted. In Figure 6, the average pitch over all of the pitchfiles is shown. The predicted pitch stays a straight line at around zero, while the average pitch goes up. This predicted average pitch has a loss of 22.02, as can be seen in Table 4, along with the fact that the average loss on the testset is a significantly higher at 35.41.

(24)

Figure 6: The average pitch and the predicted average pitch by the model This loss on the testset has also been visualised in Figure 7. This visualisation shows that loss fluctuates a lot between 10 and 80, straying quite far from the average of 35.41.

(25)

In Figure 8, four pitches from different components within the dataset were plotted, together with their predicted pitch. Similarly to the predicted pitch in 6, the predicted pitch stays near zero, whereas the actual pitch goes up.

(a) Interviews with teachers of Dutch (b) Spontaneous telephone dialogues (recorded via switchboard)

(c) Spontaneous telephone dialogues (recorded on MD with local interface)

(d) Simulated business negotiations

Figure 8: Graphs of the actual pitch and predicted pitch

4.2 Pitch Classification

As has been described in Section 3.2, classification has been done twice, once based on the mean of the cell of the LSTM, and once based on the sum of the cell of the LSTM. In Table 5, the accuracies of both classification systems can be seen. Both are near 0.33, the value of chance. An accuracy of 0.33 would occur if the algorithm classified everything as the same class.

Sum Mean Chance

0.29 0.31 0.33

Table 5: Accuracy scores of classification based on the sum and the mean of the cell in the LSTM

In Table 6, the results of the pitch classification based on sum is shown. It shows that it classifies almost everything as NEU, with only one classified as HI, whose actual class was NEU, and one classified as LO, which was actually HI. A similar thing happened when classifying based on mean. The results of this

(26)

are show in Table 7. With the mean-based classification, however, almost ev-erything is classified as LO. One pitchtrack was classified as NEU, which was supposed to be LO, and one HI pitchtrack was correctly classified as HI.

Classified as LO Classified as NEU Classified as HI

Actually LO 0 157 0

Actually NEU 0 125 1

Actually HI 1 151 0

Table 6: Classifications based on sum

Classified as LO Classified as NEU Classified as HI

Actually LO 133 1 0

Actually NEU 148 0 0

Actually HI 152 0 1

(27)

5 Conclusion and Discussion

During this thesis, the possibility of predicting pitch and classifying pitch has been explored. The hypothesis was that while the pitch prediction part would not gain great results, the pitch classification might be able to. The reason for the negative hypothesis for the pitch prediction was the fact that the state-of-the-art Intelligent Personal Assistants do not speak with perfect pitch, even though they are developed by big companies such as Google, Amazon and Apple. However, the hidden states within the neural network used for this prediction might contain information that could help with classifying different types of focus.

5.1 Pitch Prediction

For the pitch prediction, an LSTM Recurrent Neural Network has been im-plemented, with takes the length of the pitch as input, and outputs the pitch. Unfortunately, due to various factors, the LSTM has not managed to train as long as was ideal. These factors were the time limit, the fact that training the LSTM is very computationally heavy, and the fact that some pitches returned by Praat were very long, resulting in a very memory-intensive batch, and even-tually the ‘Killed’ message from Python, for the computer could not continue to calculate. Due to this, the program had to be restarted from the last saved point, taking up more and more time, for it could not run indefinitely. This also meant that due to the sheer size of some of the files, some batches were simply unusable, thus they were skipped.

These long files were sometimes more than 500.000 timesteps of 0.001 seconds long, which means that the extract pitch by Praat was longer than 8 minutes. While they were excluded for this research, it indicates that the approach for splitting the audiofiles based on silence was not the best solution. The reason for this might be the fact that some silences were shorter than 0.02 seconds, which was used as the threshold for something to be seen as a silence.

The results over time, as shown in Figure 5 in Section 4.1, do not follow an easily discernible path, but have deep valleys followed by high peaks. A proba-ble explanation for this is that the dataset is ordered by type of audio, as has been explained in 3.1.1. It is probable that the neural network got overfitted to a certain type of audio file, for example Spontaneous telephone dialogues, thus resulting in a low loss for a while, but then the next batches contained the next type of audio, for example Simulated business negotiations, which might have very different pitches.

By the end of training the neural network, the loss seems to stay consistently below the loss of the average pitch. From looking only at the trainingset’s loss function, it is unclear if this is because of the fact that the LSTM is increasing in accuracy, or that it is simply overfitted on one type of data.

However, when including the loss of the testset, it could be inferred that the LSTM simply overfitted on one type of data. This is because the loss on the testset fluctuates a lot, as shown in Figure 7. Furthermore, the average loss of

(28)

the testset is significantly higher than the loss on the predicted average pitch, as is shown in Table 4.

Additionally, another thing that supports the overfitting theory, is the fact that the shape of the loss of the testset, while fluctuating heavily, does stay above of below the average line for multiple files in a row. This shows that certain parts of the dataset consistently get predicted correctly, whereas others consistently get predicted incorrectly.

However, another explanation for the fluctuating loss is the fact that the pre-dicted pitch only consists of very small numbers around zero, which can be seen in Figures 6 and 8. Here, the predicted pitch stays around zero, never going up nor down. Thus, the case could be that the pitches where the loss is very low, are simply pitches with a low frequency.

Furthermore, the reason that the loss of the average pitch is significantly lower than that of the testset, is because a large part of the average pitch is around zero, therefore making the total average loss low. As the predicted pitch is near zero, and the first half of the average pitch is as well, the average pitch is close to being correct in this first half. The reason that the average pitch is near zero for so long, is because not all the files are of the same length, and the shorter files get padded with zeros, in order to make them all the same length.

In conclusion, the neural network at this point does not yield good results. The loss for the testset is significantly higher than that of the average pitch. This indicates that the low loss that the neural network has at the last iteration is either because of overfitting. However, the more likely conclusion is that it simply depends on the height of the frequency of the actual pitches. A way to determine whether this is the case, is to let the neural network train longer. This and other solutions will be discussed in Section 6.

5.2 Pitch Classification

For pitch classification, it was hypothesised that it could be possible to be able to classify the type of focus in the dataset created by Lentz (2019). However, given the results in Section 4.2, this was not the case. The accuracy is a little lower than 0.33, which is the value of chance.

The reason why it is not exactly chance, is because, as is visible in Table 3, the classes are not perfectly divided in the train- and testset. This is because the splitting was done randomly.

As can be seen in Table 6, the class that (nearly) everything got classified as is NEU when using the sum as input, whereas when looking at Table 7, everything gets classified as LO, when using the mean as input. This has nothing to do with the values of mean and sum, but with the distribution of the classes within the trainingset. Because the train- and testset were created randomly, the three classes where not split up perfectly, but some classes were more prominent in the trainingset than others, as can be seen in Table 3. With the sum trainingset, NEU was most prominent, and least prominent in the testset, while for mean

(29)

this was the case with LO. Thus, the MLP Classifier was overfitted. It could not find a proper predictor within the input, and proceeded to classify nearly everything as the class that was most prominent in the trainingset.

It could be argued that the mean or sum of the cell of the LSTM is not enough information for the base of the classification. So, other ways to classify using the entire matrix would be useful. This has not been done in this research due to time constraints on the project. This would, however, be interesting to do as future work, as can be seen in Section 6.

Thus, it can be concluded that the current implementation of this classifica-tion model does not work. The implementaclassifica-tion that uses either the sum of the cell or the mean of the cell of the LSTM used for pitch prediction, does not actually classify focus to different classes, it simply classifies everything into the same class.

(30)

6 Future Work

As has been discussed in Section 5, a lot of improvements are necessary for both pitch prediction and classification. For example, to ensure that the LSTM is not overfitted, it could be trained more. However, as the graphs in Figures 6 and 8 show, the resulting pitch seems to always be zero. It could be possible that this might change, but since it has not done so after more than 4000 iterations, the chance of this being the case is not likely.

Besides training the network longer, it could also be explored if more hidden layers could be useful. As the research of Lentz (2019) used an LSTM with 128 hidden layers yielded very positive results. In contrast, this research used only 8 hidden layers, due to the fact that 128 hidden layers is significantly more computationally heavy. It is possible that when using more hidden layers, the network finds better pitches.

Additionally, it could be explored if padding the files with values other than 0 (such as -1) could improve the prediction. As has been explained in Section 2.2.2, zeros in neural networks can be used for gating, for anything multiplied with zero stays zero. It is possible that by padding the shorter files with zeros, accidental gating occurred. To prevent this, a negative number for padding could be used, or no padding at all. In the case of not padding whatsoever, the input would have to be a single pitchlength, instead of a matrix of multiple pitchlengths, thus increasing the computation time.

Another solution for better pitch prediction would be to instead of having the LSTM predict the entire pitch in one go, to predict the next pitch value, given the previous pitch value. The reason this research chose not to do that is the following. A pitchtrack as a whole might fluctuate a lot, as can be seen in Figure 3a, but the difference between two timesteps usually is very small. Therefore, the chance of the pitch resulting in a straight line seemed quite large, thus this approach was not used. However, it might still garner better results than the results of this research, therefore trying it would be interesting.

Trying a different way of splitting the audiofiles of the CGN than the approach used in this research might also help. As has been discussed in 5.1, some of the audiofiles were incredibly long, indicating that the Praatscript written was not perfect for this dataset. While simply excluding these long files seemed to work, another way of splitting sentences could be explored, to prevent the need to exclude parts of the dataset.

As for the pitch classification, it has already been noted in Section 5.2, us-ing all the data in the cell of the LSTM could result in a higher accuracy than that of the current implementation. As the current implementation uses the mean and the sum of the cell, a lot of raw data has been excluded, compressing the matrix to one single number.

Another way to possibly increase the accuracy of pitch classification, would be to improve the pitch prediction. As the input of the classification algorithm comes from the LSTM, improving this LSTM could also improve the

(31)

classifica-tion.

In conclusion, future research could go in multiple directions, in order to im-prove the current results. By training the network longer, changing the amount of hidden layers, and changing the padding of the pitches, or even foregoing the padding all together, the pitch prediction, and by extension the pitch classifi-cation, might be improved. To improve the pitch classification itself, training the classifier with the raw data of the cells of the neural network, instead of the sum or the mean of these cells, could improve accuracy.

(32)

References

Alpaydin, E. (2014). Introduction to Machine Learning Ethem Alpaydin. Baker, A. E. and Hengeveld, K. (2012). Linguistics, volume 15. John Wiley &

Sons. p.299.

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependen-cies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.

Boersma, P. and Weenink, D. (2018). Praat: doing phonetics by computer [Computer program]. Version 6.0.37, retrieved 3 February 2018 http://www. praat.org/.

Brown, L. and Yeon, J. (2019). The Handbook of Korean Linguistics. Blackwell Handbooks in Linguistics. Wiley. p.41.

DLI, D. L. I. (2014). Corpus gesproken nederlands - cgn (version 2.0.3). Data set.

Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). Springer Series in Statistics The Elements of Statistical Learning - Data Mining, Inference, and Predic-tion.

Hirschberg, J. (2007). Pragmatics and Intonation 2 Intonation : Its Parts and Representations. Pragmatics, pages 1–18.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.

Jurafsky, D. and Martin, J. (2000). Speech & language processing.

Lentz, T. O. (2019). Machine learning as a tool for linguistic pattern compar-isons.

Lentz, T. O. and Terband, H. R. (2018). Articulatory strategies to mark promi-nence in consonants. In Cho, T., Kim, S., Choi, J., Kim, J. J., Kim, S. Y., and Lee, K.-J., editors, Linguistic and cognitive functions of phonetic granu-larity in speech production and/or perception in L1 and L2, volume 1, pages 42–43. Hanyang International Symposium on Phonetics and Cognitive Sci-ences of Language, Hanyang Institute for Phonetic and Cognitive SciSci-ences of Language.

Matsui, K., Pearson, S. D., Hata, K., and Kamai, T. (1991). Improving natural-ness in text-to-speech synthesis using natural glottal source. In Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 769–772.

McKinney, W. (2010). Data structures for statistical computing in python. In van der Walt, S. and Millman, J., editors, Proceedings of the 9th Python in Science Conference, pages 51 – 56.

(33)

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Pas-sos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Re-search, 12:2825–2830.

Rossum, G. V. and Drake, F. L. (2010). Python Tutorial. History.

Smith, C. L. (2000). Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet (1999). Phonology. Soutner, D. and M¨uller, L. (2013). Application of lstm neural networks in

lan-guage modelling. In International Conference on Text, Speech and Dialogue, pages 105–112. Springer.

Vallduv´ı, E. and Engdahl, E. (1996). The linguistic realisation of information packaging. Linguistics, 34.

Violante, L., Rodr´ıguez Zivic, P., and Gravano, A. (2013). Improving speech synthesis quality by reducing pitch peaks in the source recordings. pages 502–506.

(34)

Appendix

The code used in this thesis can be found on a public GitHub repository1_{. As}

both datasets are private, only the code is on the repository. A request to access the CGN can be done at the website of the Instituut voor de Nederlandse Taal2_.

The sequence to run the code in the repository is as follows:

1. splitAudioAndGetInterpolatedPitch.praat 2. train.py

3. test.py 4. makeCSV.py 5. classify.ipynb

Besides these 5 files, a collection of other files is in this repository as well. These files have been used to visualise some of the data or gain other knowl-edge surrounding the research and data. Two of the files, models.py and dataloader.py are classes used in the files noted above. The most recent model is also on the repository, under the name PitchLSTM iteration 4050.pt. This is so that future training can pick up where this research ended, and the first 4050 iterations do not have to be repeated.

1_{Code for this thesis: https://github.com/maradf/pitchprediction} 2_{CGN: https://ivdnt.org/downloads/tstc-corpus-gesproken-nederlands/}

Pitch Prediction using LSTM and Pitch Classification using MLP-Classifier