Geo-locating based on the pronunciation of words within the Dutch Low Saxon area

(1)

Geo-locating based on the pronunciation

of words within the

Dutch Low Saxon area

Martijn E.N.F.L. Schendstok

Bachelor thesis

Information Science

Martijn E.N.F.L. Schendstok S2688174

(2)

Abstract i

abstract

This research focuses on predicting where someone is from based on their pronunci-ation of words. The research is applied to Dutch Low Saxon, thus makes predictions on locations within the Dutch Low Saxon area in the Netherlands. Focusing on the research question: Can we predict the geographical location of people within the Dutch Low Saxon area based on their pronunciation of words? With sub-question: How well will the model preform on two different levels of location, province and municipal-ity? Which spectral features can improve the model the most?

Wieling et al.(2007) did research on the contemporary Dutch dialects, based on the Goeman-Taeldeman-Van Reenen-Project (GTRP) data. The dialects were com-pared using the Levenshtein distance, a measure of pronunciation difference. The GTRP data of the Netherlands reduced to its three most important dimensions via MDS shows two distinct dialect area’s, namely the Frisian and Low Saxon area’s. If it is possible to realise a reliable model it would be useful to forensic police investi-gations and gives insight in the use of spectral features for dialect classification.

The dataset that is used for this research is the Stemmen1

dataset, which includes the pronunciation of the same 10 words for each participant. These words are transcribed into SAMPA script using the Munich Automatic Segmentation System2 (MAUS) and multiple spectral features are computed form the audio files. These spectral features are maximum frequency, minimum frequency, bandwidth, cen-troid, MFCC, Chroma, zero crossing rate, and RMS.

The classification task is applied on both province labels and municipality la-bels, implementing 4 classification algorithms: Random Forests, Support Vector Machines (SVM), Linear SVM, and K-nearest neighbour. These classifiers are first trained and tested using the transcribed words only as features to obtain a baseline for the increase in accuracy when implementing the spectral features. With a grid search the best combination of spectral features are found, for each classifier on both municipality classification and province classification.

The classification algorithm that performed best in the final model is the Ran-dom Forests classifier. With Centroid, MFCC, and RMS as spectral features for municipality classification it achieved an accuracy of 32%. For province classifica-tion implemented spectral features to achieve an accuracy of 60% are Minimum frequency, and MFCC.

It was concluded that can predict the geographical location of people within the Dutch Low Saxon area base on their pronunciation of words. The performance on classifying municipalities is rather decent as well, especially as it is a multi-label classification task including 54 different labels. The spectral feature that improved the model the most is MFCC. MFCC shows the highest improvement in accuracy and MFCC returned the best results in almost all cases. It occurs in 6 out of 8 spectral feature combinations, including the highest accuracy for both municipality and province.

Unfortunately the dataset was to small for really good and consistent results, especially because it is really imbalanced. It would therefore be interesting to repeat this research on a larger dataset. It is expected that this would yield higher results. Furthermore, much higher accuracy’s could be achieved if the classifiers where to be improved by setting more optimal parameters. For this research it was the intention to focus on the improvement by implementing spectral features and to compare the classifiers in this task.

1 https://woordwaark.nl/stemmen/

(3)

contents ii

list of tables

Table 1 Participants per province. . . 4

Table 2 Accuracy on transcribed words (without spectral features) per classifier. . . 10

Table 3 Accuracy of the individual implementation per spectral fea-ture for each classifier . . . 10

Table 4 Spectral feature combination with the highest accuracy score per location level and classifier . . . 10

Table 5 Accuracy, Precision, Recall, and F-score of the final model per classifier and location level. . . 11

Table 6 Participants per municipality. . . 17

Table 7 Accuracy for all combinations of spectral features, per classi-fier and location level . . . 18

(4)

Preface iii

preface

I want to start off with saying that when I started with this thesis I was completely new in analysing audio signals. It is the first time I implemented spectral features. While this made it sometimes overwhelming when I started, I did learn a great deal. And I must say that I struggled a lot working on this thesis. The main subject of Dutch Anglo Saxon of our supervisor, was not my cup of tea. I’m not originally from the Low Saxon area in the Netherlands and not greatly interested in dialects in general. But the implementation of spectral features did interest me, once I got the basic understanding.

I really want to thank my thesis supervisor Martijn Bartelds. He was lenient with certain deadlines due to circumstances, but he did set the bar high. I might have not loved this when I received feedback, but looking back it did result in a better thesis.

Unfortunately, the final thesis is not on the level I wanted, mainly the discussion and conclusion are to rushed. Due to covid-19 the time to write a thesis was not optimal to start with, but in the last week before the deadline I (luckily) found 2 mistakes in my code. One of which I found on the day of the deadline. But I’m really proud of my results and feel like I came close to the bar that mister Bartelds set for the Introduction, Data description, Method, and results.

(5)

1 introduction 1

1 introduction

With most people using their mobile phone for social media, with location tracking enabled for most applications, their location is embedded with the metadata for most of their posts or messages. This information is used by companies for rec-ommender systems (Guy et al.,2010), marketing (Sundsøy et al.,2014), and more. When it comes to emergency services (Power et al.,2014), rapid disaster response (Ashktorab et al., 2014) uses the geo-location data and so does law-enforcement during forensic investigations (Brunty and Helenek,2014).

In forensic casework, if there is evidence in the form of recordings, forensic pho-netics (Jessen,2008) can be applied. Using forensic phonetics they might find the suspect by analysing the suspects voice (Baldwin and French, 1990). Finding out were a suspect is from could be very helpful in an investigation. In other words, geo-locating based on forensic phonetics.

Wieling et al.(2007) did research on the contemporary Dutch dialects, based on the Goeman-Taeldeman-Van Reenen-Project (GTRP) data. The dialects were com-pared using the Levenshtein distance, a measure of pronunciation difference. The GTRP data of the Netherlands reduced to its three most important dimensions via MDS shows two distinct dialect area’s, namely the Frisian and Low Saxon area’s. Their map on the average Levenshtein distance between the GTRP varieties for the Netherlands shows strong connections among the Dutch Low Saxon dialects, espe-cially within the provinces of Groningen and Drenthe. The dialect of Gelderland and western Overijssel (both Low Saxon) can be identified as well and there is a clear boundary between Low Saxon (northeastern dialects) and Low Franconian (western, southwestern and southern dialects).

The results give us an indication that predicting location on a province-level might be possible based on the pronunciation. Although, the dialect landscape could have changed as the recordings in the GTRP were gathered during the period 1980- 1995. At least the possibility to predict someones location seems to be higher in the Frysian and Dutch Low Saxon area compared to the Low Franconian part of the Netherlands. On the other hand these closely unified varieties in the Dutch Low Saxon provinces might make it harder to predict the town or municipality someone is from.

This research focuses on predicting where someone is from based on their pro-nunciation of words. The research is applied to Dutch Low Saxon, thus makes predictions on locations within the Dutch Low Saxon area in the Netherlands. The Dutch Low Saxon area is selected based on the findings byWieling et al.(2007). Fur-thermore, since 2018 Dutch Low Saxon has been recognised as an official language by the Dutch government3

. If it is possible to realise a reliable model it would be useful to forensic police investigations and gives insight in the use of spectral features for dialect classification.

This research will focus on the question: Can we predict the geographical location of people within the Dutch Low Saxon area based on their pronunciation of words? With sub-question: How well will the model preform on two different levels of location, province and municipality? Which spectral features can improve the model the most?

(6)

2 background 2

2 background

Leemann et al. (2018) created the English Dialects App (EDA), which features a dialect quiz and dialect recordings. Earlier apps were developed for Swiss German (Leemann et al.,2015) and German (Kolly and Leemann,2015). EDA gathers data on a wide number of linguistic variables (lexical, phonological and grammatical) and predicts the users dialect based on their answers. The acoustic-phonetic corpus obtained from EDA was automatically transcribed using MAUS4

(Kisler et al.,2016). The prediction was done using 26 variables (each with 2 to 10 variants) of dif-ferent types; phonetic and phonolexical (73%), lexical (12%), morphological (12%), and syntactic (3%). Variables each showing different geographical distributions were chosen so that small areas could be distinguished from each other on the basis of a unique combination of variants across the set of variables. This data is rela-tively similar to the data used for this research, although in this research the data is 100% on phonetic and phonolexical and 10 variables instead of 26, but shows that phonetic and phonolexical contains a high amount of information use full for geo-locating.

InLeemann et al. (2018) no classification model was implemented. The dialect prediction works off a table that contains a row for each locality, and a column for each pronunciation variant. This lookup-table works similarly to a basic decision tree. Unfortunately the paper byLeemann et al.(2018) does not contain data on the accuracy of their geo-locating model.

Johnson (2019) researched emotion detection through speech analysis, creating a multi-label classification model which identifies the emotion from speech samples. In the paper K-Nearest Neighbours, Support Vector Machine (SVM), and Deep Neu-ral Network classifiers are compared. On the classifiers multiple spectNeu-ral features are applied, namely Mel Frequency Cepstral Coefficients (MFCC), Chromagram from the waveform, Mel scale Spectrogram, Spectral contrast of waveform, Tonal Centroid features. The research by Johnson (2019) showed that the Deep Neural Network (DNN) classifier performed best, followed by SVM.

Jain et al. (2020) researched emotion detection through speech analysis, as well. They implemented the SVM classifier due to the simple and efficient classification algorithm, used for classification and pattern recognition. Their research imple-ments spectral features as pitch, energy, and speech rate, but mainly compare the Mel Frequency Cepstral Coefficients (MFCC) to Linear Prediction Cepstral Coeffi-cients (LPCC) spectral features. Their research show the best results for MFCC, with an overall 11.96% higher accuracy than LPCC.

4 Munich Automatic Segmentation System

(7)

3 data 3

3 data

3.1 Data Description

The dataset that is used for this research is the Stemmen5

dataset. The data is ob-tained from a citizen science project, using an online quiz, focused on the Dutch Low Saxon population. The project is a collaboration between CGTC6

, the Uni-versity of Groningen and Lân fan taal7

. This dataset contains 4343 items with 10 variables.

During the quiz the participant is shown 10 words at a time, for each word the participant is asked to select, from a list, which transcribe pronunciation is similar to how the participant pronounces it. Once the participant made his or her choice the participant is asked to pronounce that word, which is recorded. After the par-ticipants finishes the quiz for all 10 words, he/she is asked to fill out some personal information. This information is saved in the data set as a separate entry per word, with the following variables:

1. ID: the ID of the participant

2. Word: the word for which pronunciations were included 3. Choice: selected phonetic word from the app

4. MAUS: spoken word transcribed in IPA style 5. Longitude: longitude of the participants location 6. Latitude: latitude of the participants location 7. Location: City or Town of the participants location 8. Province: Province of the participants location 9. Gender: gender of the participant

10. Age: age of the participant

Besides these variables the audio for each entry is available as well, in the form of a wav-file. In total the dataset contains the entries of 460 participants. Each partici-pant was asked to pronounced 10 words which have been automatically transcribed with the Munich Automatic Segmentation System8

(MAUS) (Kisler et al.,2016) into SAMPA script. Those 10 words are:

aarde, bed, briefje, eitje, huis, kinderen, muis, twee, vier, wij kloppen.

Unfortunately, not all participants pronounced all 10 words. Due to this the dataset only contains 4343 items instead of 4600 (=460∗10).

3.2 Processing

The dataset contains some entries without any location information (longitude, lat-itude, location nor province), these can not be used with the development of the model. Furthermore, there are entries with a location outside the Dutch Low Saxon area. Only the entries with a location within the Dutch Low Saxon area are selected. This selection is primarily done using a list of towns within the Dutch Low Saxon fromwww.taal.phileon.nl9

a website that focuses on the ’other’ original languages of the Netherlands. For the towns within the provinces of Groningen, Drenthe, and Overijssel are automatically selected as these provinces are completely within 5 https://woordwaark.nl/stemmen/

6 Centrum voor Groninger Taal & Cultuur:https://www.cgtc.nl

7 https://lanfantaal.com

8 https://www.bas.uni-muenchen.de/Bas/BasMAUS.html

(8)

3 data 4

the Dutch Low Saxon area. The towns within Fryslân, Gelderland, and Utrecht provinces, that are not selected by the list previously mentioned, are checked by hand.

To the selected dataset a new location oriented variable is added, the municipality, based on the town/city (location variable, variable 7). The municipality variable is added to lower the amount of categories for the machine learning model. This is done according to data on towns and cities in the Netherlands for 202010

from the CBS11

. The municipality is added by looking up to which municipality the town/city for each data entry belongs to according to the data from the CBS.

The dataset contains some instances where a participant ID does not occur 10 times, once for each word. In these cases data on all 10 words is not available for these participants. To impute the missing data random imputation of a single variable using information from related observations (Gelman and Hill,2006, page 533) is applied. This will only be applied on participants with 5 words or more, so that the data of each participant is at least 50% original. The word imputation is realised by randomly assigning the missing words from other participants within the same municipality.

Table 1:Participants per province.

Province Participants Drenthe 60 Groningen 145 Overijssel 31 Gelderland 22 Fryslân 1

The code used to process the data is available on the GitHub repository12 . The final data set contains the entries of 259 participants, from 137 different cities, 54 municipalities (Table 6, page17), and 5 provinces (Table1). To the data the corre-sponding audio file is added a well for ease of use. The final data sat thus has two extra variables, totalling 12:

11. Municipality: Municipality of the participants location 12. Audio file: The corresponding wav-file to the word

10 Data 2020:

https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=84734NED&_theme= 233

11 Central Agency for Statistics of The Netherlands (Centraal Bureau voor de Statistiek)

(9)

4 method 5

4 method

4.1 Textual Feature

The textual features are the MAUS transcribed word. To prepare the MAUS tran-scribed words one hot encoding is applied as the classifiers do not work with string data. With one hot encoding the IPA string of the transcribed word is encoded into a binary array. Each location in the array corresponds to an IPA style character and if the character occurs in the string it is represented with a 1. This way no information is lost concerning occurring transcribed sounds.

4.2 Spectral Features

To extract the spectral features the LibROSA13

package by McFee et al. (2015) is applied to the audio files in the dataset and the features are added to the corre-sponding data entry. When loading the audio files (wav-files) with librosa, the audio is automatically resampled as a waveform with the sampling rate (the num-ber of samples of audio carried per second, measured in Hz). By default, all audio is mixed to mono and resampled to 22050 Hz at load time. Using the waveform and sampling rate the spectral features can be extracted with the librosa package. With the librosa package each feature is computed per frame, the waveform is split up into frames using the sampling rate. To keep the the feature input the same size for each audio file the average of these frames is taken.

4.2.1 Pitch

Pitch is used in speech recognition (Ghahremani et al., 2014) and is one of the most important features in speech emotion recognition (El Ayadi et al.,2011). Pitch frequency is the vibration rate of a vocal. The features are computed based on the complete utterance. The pitch features are computed using the Fast Fourier Transform (FFT) bin centre frequency per frame, which is the default in librosa. This centre frequency, or centroid, at frame t is defined in equation1(Klapuri and Davy,2007).

centroid[t] =sumkS[k, t] ∗ f req[k]/(sumjS[j, t]) (1) In this equation, S is a magnitude spectrogram and freq is the array of frequencies (FFT frequencies in Hz) of the rows of S.

The specific pitch features applied in this research are:

• Approximate maximum frequency

• Approximate minimum frequency

• Bandwidth, deviation in pitch

• Centroid, the mean frequency

The maximum and minimum frequency are estimated using spectral rolloff. Spec-tral rolloff is the centre frequency below which a specified percentage of the total spectral energy. For the maximum this percentage is 85% and for the minimum 10%.

(10)

4 method 6

4.2.2 Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) spectral feature is widely used as a representation of phonetic information in automatic speech recognition and speech emotion recognition (Schuller et al., 2003). The process of calculating MFCCs is shown in Figure 1. The following explanation on how MFCCs are computed is based onBartelds et al.(2020).

Figure 1:MFCCs extraction from speech signal (Jain et al.,2020).

In this experiment the framing includes the full utterance, thus the whole audio signal is analysed. Windowing divides the speech sample into short frames, with the default in the librosa package, the audio signal is divided into 20 windows. Windowing is applied as the characteristics of an audio signal are relatively stable within a short frame of time (Zhu and Alwan,2000).

FFT is then taken from each of these windowed frames to transform the audio signal from the time domain to the frequency domain (Zheng et al.,2001). FFT is an efficient algorithm for computing the Discrete Fourier Transform (DFT) (Sevgi, 2007). By applying FFT the process of how sound is perceived within the human auditory system is simulated (Dave,2013).

After the FFT is taken from the windowed frames, the Mel frequency wrapping is applied. The FFT-transformed audio signal is passed through a collection of filters, the Mel-filter bank. Each processes frequencies that fall within a certain range, fre-quencies outside that range are discarded (Muda et al.,2010). The Mel-filter bank provides information on the energy that is present near certain frequency regions (Rao and Manjunath,2017). From the energies, that are returned by the Mel-filter bank, the logarithm is taken. This in accordance with the human auditory sys-tem, as humans do not perceive the loudness of an incoming audio signal linearly. This produces a signal that is represented in the cepstral domain (Oppenheim and Schafer,2004).

However the logarithmically transformed filter bank energy representations over-lap. To solve the overlap, the Discrete Cosine Transform (DCT) is computed. The DCT results in a set of cepstral coefficients. By default DCT-II is used in the librosa package. For the final MFCCs feature representation for each MFCC segment the mean is computed, this is mainly done to keep the same array shape for all data entries, which results in an array of 20 mean MFCCs.

4.2.3 Chromagram

Chroma features are a powerful representation for music audio. The entire spec-trum is represented by the 12 distinct semitones (or chroma) of the musical octave. It takes the audio input and computes a sequence of short-time chroma frames, the frames are based on the sample rate. This is achieved by mapping each Short-Time Fourier Transform (STFT) bin directly to a chroma, after selecting only spectral peaks (Ellis,2007). For each chroma the mean of all frames are computed, this is mainly done to keep the same array shape for all data entries. Thus, results in chromagram, an array, of 12 mean chromas. The chromagram provides information about the occurrence of semitones within the speech of a participant, which could improve the classification model.

(11)

4 method 7

4.2.4 Zero Crossing Rate

Zero crossing rate is an indicator of the frequency at which energy is concentrated in the signal spectrum. Zero crossing rate is often implemented in the front-end processing in automatic speech recognition. The rate at which zero crossings occur is a simple measure of the frequency content of a signal. Zero-crossing rate is a measure of number of times in a given time frame that the amplitude of the speech signals passes through a value of zero (Bachu et al.,2008). The length of the time frame is 2048 by default in librosa. To compute the centre the signal y is padded so that frame D[:, t]is centred at y[t∗hop_length], where hop_length=512 by default.

4.2.5 Root-Mean-Square

Root-mean-square (RMS) is a representation of energy in signal processing. The energy of a signal corresponds to the total magnitude of the signal, which roughly corresponds how loud the signal is (Boashash,2016). Energy is an important feature in speech analysis (Schuller et al.,2004).

The RMS is computed per frame, by default the frame length is 2048 in librosa. The RMS definition is an integral over the signal period as in equation2, where u is the signal amplitude and T stands for the period, in this case one frame (Nastase, 2015). uRMS= q (1 T Z T 0 u (t)2∗dt) (2) 4.3 Classifiers

Multiple classifiers are implemented to see which one preforms best for the task of geo-locating based on transcriptions in IPA style and spectral features. In this research only the default implementations of the classifiers are applied. The imple-mentation of the classifiers are realised with the SciKit Learn14

package byPedregosa et al.(2011). The focus in this research is on the improvement based on spectral fea-tures and due to time limitations the decision was made to use the default classifiers for a fair comparison between the different classifiers.

4.3.1 Random Forests

Random Forest (RF) is a combination of multiple decision tree classifiers and uses averaging to improve the predictive accuracy and control over-fitting (Breiman, 2001) and can be applied to multi-class classification (Fernández-Delgado et al., 2014). The RF classifier is expected to preform best compared to other classifiers as it is robust to overfitting (Liaw et al.,2002), and works well with imbalanced datasets (Brown and Mues,2012).

4.3.2 Support Vector Machine

Support Vector Machines (SVMs) constructs hyper planes or boundaries. These are used to split data points into different categories (Jain et al.,2020). SVM is a binary ’one-versus-one’ classifier, but can be automatically implemented as a multi-class ’one-versus-rest’ classifier as the SciKit Learn allows to monotonically transform the results. On the other hand, LinearSVM implements ’one-versus-rest’ multi-class strategy natively. The standard SVM with the Radial basis function (RBF) kernel and LinearSVM with the linear kernel are implemented to see whether a linear of non-linear kernel performs best. The linear kernel splits the data points into different categories with only hyper planes, where as the RBF kernel allows for more fluid hyper planes.

(12)

4 method 8

The advantages of SVMs are their effectiveness in high dimensional spaces even when the number of dimensions is greater than the number of samples (Pedregosa et al.). This makes it robust when implementing a large combination of features. Furthermore, there is the advantage that SVM is that it is very easy to train (Jain et al.,2020).

4.3.3 K-Neares Neighbours

K-Neares Neighbours (KNN) is a straightforward, computationally quick classifier, and a commonly used technique. It is straightforward as classification is achieved by identifying the nearest neighbours to a query example and using those neigh-bours to determine the class of the query (Cunningham and Delany,2020). Thus, classification is computed from a majority vote of the k-nearest neighbours of each point, this vote is defined in equation 3, by Cunningham and Delany (2020). In equation3 q is the data point for which the vote is determined, x_cis the neighbour, d(q, xc) is the distance, yc is the class of the neighbour, and p normally is 1 but could be increased to reduce the influence of more distant neighbours.

Vote(yj) = k

∑

c=1 1 d(q, xc)p1 (yj, yc) (3)

Therefore, the vote assigned to class yj by neighbour xcis 1 divided by the dis-tance to that neighbour, in other words 1(yj, yc)returns 1 if the class labels match and 0 otherwise (Cunningham and Delany,2020). In SciKit Learn the default of k is 5.

4.4 Approach

Firstly, the dataset is randomly split into two parts, based on randomising the par-ticipants, a train-development set (90%) and test set (10%). The train-development set is used during development of the model. The participant data is randomised and 10-fold Cross Validation is applied to justify the model robustness. The entire train-development set is divided into ten fold. In each cross validation, 9 out of 10 folds are used for training, and the remaining fold is used to validate the model. The mean accuracy of the 10 validations is computed as the overall accuracy.

The classification location labelling is applied on each word separately. Therefore each participants receives 10 location predictions (province or municipality level). The final predicted location is derived using majority voting, the most occurring location label for a participant. Majority voting is applied to increase the amount of training data, this becomes ten times as large by training the model per word instead of training it per participant. It also makes the implementation of spectral features easier, as it is not needed to combine the spectral features data from 10 different audio files (one per word).

With this one hot encoded transcribed words the basic model, without spectral features, is trained and tested using 10-fold Cross Validation to obtain a baseline to see which spectral features improve the model the most. It also gives us insight in the base efficiency of the classifiers.

Once the model is finished based on the transcription of words alone, an abla-tion study is conducted to assess spectral features that improve the classificaabla-tion models. To find the best resulting combination of spectral features, a grid search is conducted on all possible combinations. This is done for all 4 classifiers, on both municipality classification, as province classification. The spectral features that improve the accuracy of the model most will be applied to the classification model. Therefore, to each classifier different spectral feature combinations are ap-plied, which can differentiate for municipality and province classification as well.

(13)

4 method 9

The model is evaluated, during development and testing, using accuracy. Accu-racy shows how well the model performs by the percentage of correct classification. The final model, when testing, is evaluated on precision (eq.4), recall (eq.5), and F-score (eq. 6), as well. Both precision and recall have a natural interpretation in terms of probability.

As the data is imbalanced and multi-label classification is applied, precision and recall are computed using macro averaging. This results in each class having the same weight when computing these values. The full code of the model is available on the GitHub repository15

.

Precision= TruePositive

TruePositive+FalsePositive (4)

Recall= TruePositive

TruePositive+FalseNegative (5)

F−Score=2∗ Precision∗Recall

Precision+Recall (6)

(14)

5 results 10

5 results

5.1 Development

Table2 shows the accuracy per classifier for municipality and province classifica-tion without the implementaclassifica-tion of spectral features. These accuracy’s serve as a baseline for the improvement of the classification model when implementing the spectral features.

Table 2:Accuracy on transcribed words (without spectral features) per classifier.

RF LinearSVM SVM KNN

Municipality 0.161 0.162 0.158 0.103

Province 0.554 0.556 0.555 0.556

Table 3 shows the accuracy with the implementation of each spectral feature individually, per classifier for both municipality and province classification. The bold-face accuracy’s are the highest values per classifier. From Table3 we see that there is a noticeable difference in improvement, with the usages of spectral fea-tures, between not only the classifiers, but between classification on municipality and province for the same classifier as well.

Table 3:Accuracy of the individual implementation per spectral feature for each classifier

Municipality Province RF LinearSVM SVM KNN RF LinearSVM SVM KNN Max frequency 0.184 0.150 0.155 0.137 0.531 0.555 0.557 0.436 Min frequency 0.232 0.137 0.149 0.124 0.555 0.552 0.556 0.414 Bandwidth 0.197 0.145 0.158 0.133 0.526 0.555 0.555 0.457 Centroid 0.219 0.158 0.146 0.124 0.582 0.556 0.556 0.405 MFCC 0.388 0.175 0.162 0.180 0.658 0.565 0.565 0.470 Chroma 0.355 0.175 0.132 0.167 0.642 0.555 0.555 0.595

Zero crossing rate 0.239 0.146 0.150 0.171 0.534 0.557 0.557 0.572

RMS 0.235 0.163 0.150 0.128 0.559 0.555 0.555 0.568

To find the best combination of spectral features a gird search is applied for all combination, the results are shown in Table7(page18). The combinations with the highest accuracy for each classifier, on both municipalities and provinces, can be seen in Table4.

Table 4:Spectral feature combination with the highest accuracy score per location level and classifier

Location Level Classfier Spectral features Accuracy

Municipality RF Centroid, MFCC, RMS 0.418

LinearSVM Chroma, Zero crossing rate 0.231

SVM Bandwidth, MFCC, Chroma, Zero crossing rate 0.179

KNN Max frequency, MFCC, Zero crossing rate 0.201

Province RF Min frequency, MFCC 0.692

LinearSVM Min frequency, MFCC, Chroma 0.578

SVM MFCC, Zero crossing rate 0.565

(15)

5 results 11

5.2 Final Model

The final model for each classifier implements the spectral feature combination that resulted in the highest accuracy as seen in Table4. The models are then be trained on the full development dataset as train-dataset (234 participants) and tested using the separate test-dataset (25 participants) for the final evaluation of the models. The results of the final model are shown in Table5, which displays the accuracy, precision, recall, and F-score, per classifier and per location level.

Table 5:Accuracy, Precision, Recall, and F-score of the final model per classifier and location level. RF LinearSVM SVM KNN Municipality Accuracy 0.320 0.080 0.120 0.280 Precision 0.274 0.015 0.026 0.250 Recall 0.278 0.079 0.118 0.229 F-score 0.276 0.026 0.043 0.239 Province Accuracy 0.600 0.520 0.480 0.480 Precision 0.643 0.269 0.120 0.120 Recall 0.458 0.267 0.250 0.250 F-score 0.535 0.268 0.162 0.162

(16)

6 discussion 12

6 discussion

As expected beforehand, the RF classifier showed the best overall results. Although the results for all classifiers on both municipalities and provinces where really simi-lar on only the transcribed words, Table2. When implementing the spectral features RF showed the highest accuracy’s during both development as testing.

KNN performed surprisingly well, it has the second highest accuracy on province classification during development (Table 4) and the second highest accuracy on municipality classification during testing (Table5). It is also the only classifier that performed better during testing on the classification of municipalities compared to the accuracy’s during development. These results solidify why KNN is a commonly used technique (Cunningham and Delany,2020).

During development I implemented cost-sensitive learning. As the dataset is im-balanced, especially for the municipality classes, cost-sensitive learning (Weiss et al., 2007) can be applied by setting class weights to improve the accuracy of classifier. The class weights are determined from the ratio between the amount of data in a class compared to the class with the most data (Ahamed, 2018). Applying cost-sensitive learning to the classifiers actually lowered the results and, therefore, was not implemented in the model. The reason why cost-sensitive learning decreased the results is unknown, the implementation might have been to simple for the large amount of labels. Due to time restrictions I was unable to look into this more.

The final results are mostly an extension of the results during the development of the model. Although, a bit lower. An important aspect in play is the size of the data set, which was on the smaller size. There are multiple instances where a municipality is only represented by one participant. If one, or multiple, of these instances occur within the testing data it will automatically lower the accuracy, as it was not represented during training of the model. Especially because there were only 25 participants within the testing data, so each of these instances would au-tomatically lower the accuracy by 4%. That the dataset was too small, which was evident during development as well. The results could differentiate between two runs a classifier with the same implemented features.

(17)

7 conclusion 13

7 conclusion

We can predict the geographical location of people within the Dutch Low Saxon area base on their pronunciation of words. The accuracy during development for the RF classifier of 0.692 is adequate. Especially due to the smaller dataset and only implementing the default classifier. The performance on classifying municipalities is rather decent as well, especially as it is a multi-label classification task including 54different labels. Especially due to the smaller dataset and only implementing the default classifier. While it would be a stretch to conclude that the model performs well, on two different levels of location, but it does show promising results. Enough to substantiate further research.

The spectral feature that improved the model the most is MFCC. MFCC shows the highest improvement in accuracy and MFCC returned the best results in almost all cases as seen in Table 3. It occurs in 6 out of 8 spectral feature combinations, in Table4, including the highest accuracy for both municipality and province. This result is in line with the results ofJain et al.(2020) and supports why this spectral feature is widely used as a representation of phonetic information in automatic speech recognition and speech emotion recognition (Schuller et al.,2003).

As stated before, a big limitation in this research was the size of the dataset, it would therefore be interesting to repeat this research on a larger dataset. It is expected that this would yield higher results. Furthermore, much higher accuracy’s could be achieved if the classifiers where to be improved by setting more optimal parameters. For this research it was the intention to focus on the improvement by implementing spectral features and to compare the classifiers in this task. The results show the highest accuracy’s for RF, but KNN performed well on the extreme imbalanced municipality classification. Especially KNN could improve quite a bit by finding the optimal size of k, which is 5 by default.

(18)

references 14

references

Sabber Ahamed. 2018. Important three techniques to improve machine learning model performance with imbalance datasets.https://towardsdatascience.com/

working-with-highly-imbalanced-datasets-in-machine-learning-projects-c70c5f2a7b16

accessed: 31-03-2020.

Zahra Ashktorab, Christopher Brown, Manojit Nandi, and Aron Culotta. 2014. Tweedr: Mining twitter to inform disaster response. In ISCRAM.

RG Bachu, S Kopparthi, B Adapa, and BD Barkana. 2008. Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In American Society for Engineering Education (ASEE) Zone Conference Proceedings, pages 1–7. John R Baldwin and Peter French. 1990. Forensic phonetics. Pinter Publishers London,

UK:.

Martijn Bartelds, Caitlin Richter, Mark Liberman, and Martijn Wieling. 2020. A new acoustic-based pronunciation distance measure. Frontiers in Artificial Intelligence, 3:39.

Boualem Boashash. 2016. Time-frequency signal analysis and processing: a comprehensive reference, second edition. Academic Press.

Leo Breiman. 2001. Random forests. Machine learning, 45(1):5–32.

Iain Brown and Christophe Mues. 2012. An experimental comparison of classifi-cation algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3):3446–3453.

Joshua Brunty and Katherine Helenek. 2014. Social media investigation for law enforce-ment, chapter 3.3. Routledge.

Padraig Cunningham and Sarah Jane Delany. 2020. k-nearest neighbour classifiers–. arXiv preprint arXiv:2004.04523.

Namrata Dave. 2013. Feature extraction methods lpc, plp and mfcc in speech recog-nition. International journal for advance research in engineering and technology, 1(6):1– 4.

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572–587.

Dan Ellis. 2007. Chroma feature analysis and synthesis. https://labrosa.ee. columbia.edu/matlab/chroma-ansyn/accessed: 19-05-2020.

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181.

Andrew Gelman and Jennifer Hill. 2006. Data analysis using regression and multi-level/hierarchical models, chapter 25. Cambridge university press.

Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Tr-mal, and Sanjeev Khudanpur. 2014. A pitch extraction algorithm tuned for auto-matic speech recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2494–2498. IEEE.

Ido Guy, Naama Zwerdling, Inbal Ronen, David Carmel, and Erel Uziel. 2010. Social media recommendation based on people and tags. In Proceedings of the 33rd inter-national ACM SIGIR conference on Research and development in information retrieval, pages 194–201.

(19)

references 15

Manas Jain, Shruthi Narayan, Pratibha Balaji, Abhijit Bhowmick, Rajesh Kumar Muthu, et al. 2020. Speech emotion recognition using support vector machine. arXiv preprint arXiv:2002.07590.

Michael Jessen. 2008. Forensic phonetics. Language and linguistics compass, 2(4):671– 711.

Alfred Johnson. 2019. Emotion detection through speech analysis. Master’s thesis, Dublin, National College of Ireland.

Thomas Kisler, Uwe D Reichel, Florian Schiel, Christoph Draxler, and Bernhard Jackl. 2016. Bas speech science web services-an update of current developments. Anssi Klapuri and Manuel Davy. 2007. Signal processing methods for music

transcrip-tion. Springer Science & Business Media.

Marie-José Kolly and Adrian Leemann. 2015. Dialäkt äpp: Communicating dialec-tology to the public—crowdsourcing dialects from the public. Trends in phonetics and phonology. Studies from German-speaking Europe, pages 271–285.

Adrian Leemann, Marie-José Kolly, and David Britain. 2018. The english dialects app: The creation of a crowdsourced dialect corpus. Ampersand, 5:1–17.

Adrian Leemann, Marie-José Kolly, Jean-Philippe Goldman, Volker Dellwo, Ingrid Hove, Ibrahim Almajai, Sarah Grimm, Sylvain Robert, and Daniel Wanitsch. 2015. Voice äpp: a mobile app for crowdsourcing swiss german dialect data. In Sixteenth Annual Conference of the International Speech Communication Association.

Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by random-forest. R news, 2(3):18–22.

Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Batten-berg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8.

Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recogni-tion algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint arXiv:1003.4083.

Adrian S Nastase. 2015. How to derive the rms value of pulse and square wave-forms. MasteringElectronicsDesign. com. https://masteringelectronicsdesign. com/how-to-derive-the-rms-value-of-pulse-and-square-waveforms/accessed:

07-06-2020.

Alan V Oppenheim and Ronald W Schafer. 2004. From frequency to quefrency: A history of the cepstrum. IEEE signal processing Magazine, 21(5):95–106.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. 1.4. support vector machines — scikit-learn 0.23.0 documentation. https://scikit-learn.org/stable/modules/ svm.htmlaccessed: 18-05-2020.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Robert Power, Bella Robinson, John Colton, and Mark Cameron. 2014. Emergency situation awareness: Twitter case studies. In International conference on information systems for crisis response and management in mediterranean countries, pages 218–231. Springer.

(20)

references 16

K Sreenivasa Rao and KE Manjunath. 2017. Speech recognition using articulatory and excitation source features. Springer.

Björn Schuller, Gerhard Rigoll, and Manfred Lang. 2003. Hidden markov model-based speech emotion recognition. In 2003 IEEE International Conference on Acous-tics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., volume 2, pages II–1. IEEE.

Björn Schuller, Gerhard Rigoll, and Manfred Lang. 2004. Speech emotion recogni-tion combining acoustic features and linguistic informarecogni-tion in a hybrid support vector machine-belief network architecture. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–577. IEEE.

Levent Sevgi. 2007. Numerical fourier transforms: Dft and fft. IEEE Antennas and Propagation Magazine, 49(3):238–243.

Pål Sundsøy, Johannes Bjelland, Asif M Iqbal, Yves-Alexandre de Montjoye, et al. 2014. Big data-driven marketing: how machine learning outperforms marketers’ gut-feeling. In International Conference on Social Computing, Behavioral-Cultural Mod-eling, and Prediction, pages 367–374. Springer.

Gary M Weiss, Kate McCarthy, and Bibi Zabar. 2007. Cost-sensitive learning vs. sam-pling: Which is best for handling unbalanced classes with unequal error costs? Dmin, 7(35-41):24.

Martijn Wieling, Wilbert Heeringa, and John Nerbonne. 2007. An aggregate anal-ysis of pronunciation in the goeman-taeldeman-van reenen-project data. Taal en Tongval, 59(1):84–116.

Fang Zheng, Guoliang Zhang, and Zhanjiang Song. 2001. Comparison of different implementations of mfcc. Journal of Computer science and Technology, 16(6):582–589. Qifeng Zhu and Abeer Alwan. 2000. On the use of variable frame rate analysis in

speech recognition. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), volume 3, pages 1783–1786. IEEE.

(21)

a large tables 17

a

large tables

Table 6:Participants per municipality. Municipality Participants Coevorden 9 AaenHunze 5 Midden-Drenthe 11 Groningen 37 Westerkwartier 18 Hellendoorn 2 Leeuwarden 1 Pekela 3 Emmen 14 DeWolden 3 Epe 1 Westerveld 2 Midden-Groningen 16 HetHogeland 20 Steenwijkerland 2 Tynaarlo 4 Dinkelland 2 Kampen 2 Meppel 2 Stadskanaal 7 Putten 2 Westerwolde 18 Bronckhorst 2 Borger-Odoorn 7 Lochem 2 Delfzijl 4 Zwolle 2 Loppersum 7 Ommen 1 Noordenveld 3 Oldambt 9 Borne 1 Appingedam 3 Enschede 6 Hoogeveen 1 Ede 2 Oldebroek 2 Deventer 2 Montferland 1 Ooststellingwerf 1 Barneveld 3 Rijssen-Holten 1 Wierden 5 Zutphen 1 Almelo 2 Apeldoorn 1 Laren 1 Olst-Wijhe 1 Berkelland 1 Heerde 1 Arnhem 2 Veendam 1 Oldenzaal 1 Twenterand 1

(22)

a large tables 18

Table 7:Accuracy for all combinations of spectral features, per classifier and location level

RF LinearSVM SVM KNN Municipality Province Municipality Province Municipality Province Municipality Province Max frequency, Min frequency 0_.252 0_.5551 0_.1415 0_.5511 0_.1627 0_.5551 0_.1364 0_.4386 Max frequency, Min frequency,

Bandwidth

0_.2525 0_.5609 0_.1493 0_.5598 0_.1487 0_.5551 0_.1462 0_.475 Max frequency, Min frequency,

Bandwidth, Centroid

0.2399 0.5475 0.1585 0.56 0.1618 0.5556 0.1417 0.4451 Max frequency, Min frequency,

Bandwidth, Centroid, MFCC

Bandwidth, Centroid, MFCC, Chroma

0_.3982 0_.658 0_.188 0_.5594 0_.1587 0_.5571 0_.171 0_.4712

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, Chroma, Zero crossing rate

0_.3634 0_.6661 0_.1969 0_.56 0_.1538 0_.556 0_.1703 0_.4962

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, Chroma, Zero crossing rate, RMS

0.3341 0.6457 0.192 0.5687 0.1444 0.5553 0.1589 0.4665

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, Chroma, RMS

0_.402 0_.6714 0_.1755 0_.5418 0_.1547 0_.5569 0_.1629 0_.4692

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, Zero crossing rate

0_.3714 0_.6708 0_.2016 0_.5466 0_.1545 0_.5571 0_.1571 0_.4792

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, Zero crossing rate, RMS

0_.3594 0_.6411 0_.162 0_.5558 0_.158 0_.5563 0_.1705 0_.469

Max frequency, Min frequency, Bandwidth, Centroid, MFCC, RMS

Bandwidth, Centroid, Chroma

Bandwidth, Centroid, Chroma, Zero crossing rate

0_.3455 0_.6232 0_.1505 0_.5641 0_.1583 0_.5554 0_.1493 0_.4601

Max frequency, Min frequency, Bandwidth, Centroid, Chroma, Zero crossing rate, RMS

0_.3509 0_.6455 0_.1674 0_.556 0_.1625 0_.5551 0_.15 0_.4518

Max frequency, Min frequency, Bandwidth, Centroid, Chroma, RMS

Bandwidth, Centroid, Zero crossing rate

0_.2554 0_.5938 0_.1493 0_.5683 0_.1496 0_.5554 0_.1616 0_.4658

Max frequency, Min frequency, Bandwidth, Centroid, Zero crossing rate, RMS

0.269 0.5975 0.1457 0.5558 0.1582 0.5563 0.1585 0.4533

Max frequency, Min frequency, Bandwidth, Centroid, RMS

Bandwidth, MFCC

Bandwidth, MFCC, Chroma

Bandwidth, MFCC, Chroma, Zero crossing rate

0_.4025 0_.654 0_.1627 0_.5638 0_.1533 0_.5549 0_.1621 0_.4529

Max frequency, Min frequency, Bandwidth, MFCC, Chroma, Zero crossing rate, RMS

0_.3678 0_.6531 0_.1795 0_.5332 0_.1536 0_.5562 0_.1464 0_.4797

Max frequency, Min frequency, Bandwidth, MFCC, Chroma, RMS

Bandwidth, MFCC, Zero crossing rate

0_.3685 0_.6714 0_.1922 0_.5207 0_.1533 0_.5553 0_.1464 0_.4844

Max frequency, Min frequency, Bandwidth, MFCC, Zero crossing rate, RMS

0_.3667 0_.6696 0_.171 0_.5475 0_.1583 0_.5562 0_.1496 0_.4399

Max frequency, Min frequency, Bandwidth, MFCC, RMS

Bandwidth, Chroma

Bandwidth, Chroma, Zero crossing rate

0_.3415 0_.6359 0_.1547 0_.5429 0_.1585 0_.5551 0_.1455 0_.4607

Max frequency, Min frequency, Bandwidth, Chroma, Zero crossing rate, RMS

0_.3245 0_.662 0_.1496 0_.56 0_.1543 0_.5554 0_.1319 0_.4393

Max frequency, Min frequency, Bandwidth, Chroma, RMS

Bandwidth, Zero crossing rate

0_.2351 0_.5982 0_.1529 0_.5551 0_.1585 0_.5571 0_.1408 0_.4645 Continued on next page

(23)

a large tables 19

Table 7 – continued: Accuracy for all combinations of spectral features, per classifier and location level RF LinearSVM SVM KNN Municipality Province Municipality Province Municipality Province Municipality Province Max frequency, Min frequency,

Bandwidth, Zero crossing rate, RMS

Bandwidth, RMS

Cen-troid

Cen-troid, MFCC

Cen-troid, MFCC, Chroma

Cen-troid, MFCC, Chroma, Zero crossing rate

0_.3627 0_.642 0_.1663 0_.5717 0_.1618 0_.5558 0_.1534 0_.4404

Max frequency, Min frequency, Cen-troid, MFCC, Chroma, Zero crossing rate, RMS

0.3978 0.6328 0.1875 0.5426 0.1625 0.5565 0.1495 0.4478

Max frequency, Min frequency, Cen-troid, MFCC, Chroma, RMS

Cen-troid, MFCC, Zero crossing rate

Cen-troid, MFCC, Zero crossing rate, RMS

0_.3681 0_.6585 0_.1788 0_.56 0_.1491 0_.5565 0_.1504 0_.4366

Max frequency, Min frequency, Cen-troid, MFCC, RMS

Cen-troid, Chroma

Cen-troid, Chroma, Zero crossing rate

Cen-troid, Chroma, Zero crossing rate, RMS

0.3641 0.6672 0.1672 0.5645 0.1618 0.5547 0.158 0.4529

Max frequency, Min frequency, Cen-troid, Chroma, RMS

Cen-troid, Zero crossing rate

Cen-troid, Zero crossing rate, RMS

Cen-troid, RMS

MFCC

MFCC, Chroma

MFCC, Chroma, Zero crossing rate

MFCC, Chroma, Zero crossing rate, RMS

0_.3759 0_.6716 0_.1888 0_.5591 0_.1545 0_.5556 0_.1587 0_.4705

Max frequency, Min frequency, MFCC, Chroma, RMS

MFCC, Zero crossing rate

MFCC, Zero crossing rate, RMS

MFCC, RMS

Chroma

Chroma, Zero crossing rate

Chroma, Zero crossing rate, RMS

Chroma, RMS

0_.3464 0_.6589 0_.1833 0_.5601 0_.1621 0_.5551 0_.1538 0_.4397 Max frequency, Min frequency, Zero

crossing rate

0.2732 0.5971 0.1245 0.556 0.1489 0.5542 0.1409 0.4399 Max frequency, Min frequency, Zero

crossing rate, RMS

0_.2567 0_.6103 0_.1408 0_.5554 0_.1576 0_.5556 0_.1366 0_.4226 Max frequency, Min frequency, RMS 0_.2312 0_.5859 0_.1411 0_.5513 0_.1576 0_.5558 0_.1322 0_.4357 Max frequency, Bandwidth 0_.2529 0_.556 0_.1543 0_.5558 0_.1534 0_.5562 0_.1583 0_.4739 Max frequency, Bandwidth,

Cen-troid

0_.2701 0_.5764 0_.1616 0_.5504 0_.1578 0_.5553 0_.1587 0_.4888 Max frequency, Bandwidth,

Cen-troid, MFCC

Cen-troid, MFCC, Chroma

(24)

a large tables 20

Table 7 – continued: Accuracy for all combinations of spectral features, per classifier and location level RF LinearSVM SVM KNN Municipality Province Municipality Province Municipality Province Municipality Province Max frequency, Bandwidth,

Cen-troid, MFCC, Chroma, Zero crossing rate

0_.3842 0_.662 0_.1926 0_.5641 0_.1578 0_.5551 0_.144 0_.4514

Max frequency, Bandwidth, Cen-troid, MFCC, Chroma, Zero crossing rate, RMS

0_.3759 0_.6451 0_.1714 0_.5545 0_.1587 0_.5558 0_.1543 0_.4612

Max frequency, Bandwidth, Cen-troid, MFCC, Chroma, RMS

Cen-troid, MFCC, Zero crossing rate

Cen-troid, MFCC, Zero crossing rate, RMS

0.3795 0.6353 0.1712 0.5342 0.1587 0.5558 0.1411 0.4841

Max frequency, Bandwidth, Cen-troid, MFCC, RMS

Cen-troid, Chroma

0.3545 0.6531 0.1792 0.5464 0.1444 0.5547 0.15 0.5054 Max frequency, Bandwidth,

Cen-troid, Chroma, Zero crossing rate

Cen-troid, Chroma, Zero crossing rate, RMS

0_.3629 0_.6453 0_.1578 0_.5674 0_.1583 0_.5551 0_.1493 0_.483

Max frequency, Bandwidth, Cen-troid, Chroma, RMS

Cen-troid, Zero crossing rate

Cen-troid, Zero crossing rate, RMS

0.2612 0.6036 0.1453 0.552 0.1578 0.5543 0.1411 0.4826 Max frequency, Bandwidth,

Cen-troid, RMS

0.2176 0.5984 0.1328 0.5473 0.1498 0.5553 0.1362 0.462 Max frequency, Bandwidth, MFCC 0_.3712 0_.6442 0_.1752 0_.5257 0_.1489 0_.556 0_.1714 0_.4786 Max frequency, Bandwidth, MFCC,

Chroma

0_.3629 0_.6328 0_.1922 0_.5386 0_.1533 0_.5547 0_.1578 0_.4734 Max frequency, Bandwidth, MFCC,

0.3888 0.6585 0.1879 0.5516 0.1585 0.5562 0.1533 0.4743 Max frequency, Bandwidth, MFCC,

Chroma, RMS

0.3763 0.6152 0.196 0.5462 0.1587 0.5551 0.1627 0.4793 Max frequency, Bandwidth, MFCC,

Zero crossing rate

Zero crossing rate, RMS

RMS

0.3601 0.6587 0.1953 0.513 0.1534 0.5558 0.1618 0.4694 Max frequency, Bandwidth, Chroma 0_.346 0_.6101 0_.1884 0_.556 0_.154 0_.5549 0_.1491 0_.4792 Max frequency, Bandwidth, Chroma,

Zero crossing rate

0_.3375 0_.6313 0_.163 0_.56 0_.1668 0_.5556 0_.1618 0_.473 Max frequency, Bandwidth, Chroma,

0_.3212 0_.6585 0_.1833 0_.5654 0_.1538 0_.556 0_.158 0_.4486 Max frequency, Bandwidth, Chroma,

RMS

0_.3547 0_.6103 0_.1792 0_.5522 0_.158 0_.5556 0_.1531 0_.4736 Max frequency, Bandwidth, Zero

crossing rate

0.2431 0.5857 0.1705 0.5545 0.1491 0.5554 0.158 0.4576 Max frequency, Bandwidth, Zero

crossing rate, RMS

0_.2442 0_.5817 0_.1659 0_.5547 0_.154 0_.5562 0_.1451 0_.4576 Max frequency, Bandwidth, RMS 0_.2264 0_.5989 0_.1665 0_.556 0_.1495 0_.556 0_.1572 0_.4871 Max frequency, Centroid 0_.2438 0_.56 0_.1406 0_.5554 0_.1623 0_.5562 0_.1375 0_.4661 Max frequency, Centroid, MFCC 0_.3804 0_.6322 0_.1752 0_.5567 0_.1571 0_.5545 0_.1582 0_.492 Max frequency, Centroid, MFCC,

Chroma

0_.3712 0_.6574 0_.1833 0_.5389 0_.1486 0_.5551 0_.1629 0_.4995 Max frequency, Centroid, MFCC,

0.3639 0.6406 0.2049 0.5556 0.167 0.5554 0.1712 0.4955 Max frequency, Centroid, MFCC,

Chroma, RMS

0.3801 0.6411 0.1667 0.5596 0.1663 0.5547 0.1538 0.4745 Max frequency, Centroid, MFCC,

Zero crossing rate

RMS

0.3763 0.6545 0.188 0.5339 0.1672 0.5554 0.1538 0.4783 Max frequency, Centroid, Chroma 0_.3498 0_.6572 0_.1835 0_.5475 0_.1536 0_.5547 0_.1759 0_.4705 Max frequency, Centroid, Chroma,

Zero crossing rate

0.3333 0.6362 0.1748 0.5592 0.1627 0.5554 0.1457 0.4326 Max frequency, Centroid, Chroma,