• No results found

Automatic breath detection in monophonic song recordings

N/A
N/A
Protected

Academic year: 2021

Share "Automatic breath detection in monophonic song recordings"

Copied!
47
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Automatic breath detection

in monophonic song

recordings

An exploratory study into the detection of phrase

boundaries in global datasets

(2)

Layout: typeset by the author using LATEX.

(3)

Automatic breath detection in

monophonic song recordings

An exploratory study into the detection of phrase

boundaries in global datasets

Aafje Kapteijns 11857153

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. W.H. Zuidema, B. J. M. Cornelissen Msc Institute for Logic, Language and Computation

Faculty of Science University of Amsterdam

Science Park 907 1098 XG Amsterdam

(4)

Abstract

There has been renewed interest in quantitative analysis in musical universals. One musical universal is the predominance of arched and descending contours in vocal music. This is called the “melodic arch hypothesis”. An explanation for this phenomenon is that the melodic arch in musical phrases reflects the efficiency of the vocal motor production system. Previous work has shown that the tendency towards arch-shaped contours can be generalized cross-culturally, but the hypoth-esis has only been tested on a few cultural sources. Collecting global data is a challenge in the field of computational musicology, due to the absence of sheet music in non-Western datasets. Therefore, with the purpose of extracting melodic contours samples from audio recordings, an automatic breath detection algorithm was implemented on a global dataset (the Garland Encyclopedia of World Music) and a Dutch dataset (the Meertens Tune Collection). Using a Decision Tree and Random Forest classifier trained on key features for breath detection like the Mel Frequency Cepstrum Coefficients, pitch, intensity and duration, the algorithm was able to detect breaths in the Meertens Tune Collection with a precision rate of 59.2% and a recall rate of 98.7%. Next, a quantitative contour analysis was per-formed on the melodic phrases from the two datasets. The average contour of the phrases from the Garland Encyclopedia of World Music showed a clear arch shape, but the average contour of the Meertens Tune Collection showed a hori-zontal shape. However, the most frequent contour types for both datasets were the arched and descending types, in line with the melodic arch hypothesis. Addi-tionally, the unsupervised clustering method KMeans was applied on the melodic contours, presenting the extent to which the contours form clusters without as-signing contour types.

(5)

Acknowledgements

First and foremost, I would like to thank my supervisor Bas Cornelissen for giving me the opportunity to write my thesis closely related to my interests at the ILLC. He provided me with great guidance during our Zoom meetings and e-mail sessions and he gave me valuable feedback throughout my entire thesis. I would also like to thank dr. W.H. Zuidema for his feedback and enthusiasm during my project.

Secondly, I would like to express my gratitude to P. E. Savage, A.T . Tierney and A. D. Patel for making their annotations of the Garland Encyclopedia of World Music available to me. Without their annotations my thesis project would not have been the same.

Lastly, thanks to Marianne de Heer Kloots for connecting me with Bas Cor-nelissen and letting me use the beautiful illustration she made for the ILLC blog on my cover.

(6)

Contents

1 Introduction 3

1.1 Empirical support for the melodic arch hypothesis . . . 3

1.2 Automatic breath detection . . . 6

1.3 Project goal . . . 6

2 Automatic breath detection 8 2.1 Theoretical foundation . . . 8

2.1.1 Key features . . . 8

2.1.2 Classification methods . . . 9

2.2 Approach . . . 10

2.2.1 An overview of the algorithm . . . 10

2.2.2 The creation of the used datasets . . . 12

2.2.3 Breath feature analysis . . . 14

2.2.4 Baseline: The Praat detection algorithm . . . 15

2.2.5 Feature extraction . . . 16

2.2.6 Tuning of the chosen classifiers . . . 17

2.3 Results for the automatic breath detection . . . 20

3 The analysis of melodic contours 23 3.1 Theoretical foundation . . . 23

3.1.1 Huron’s dual approach . . . 23

3.1.2 Clustering methods . . . 24

3.2 Procedure for the contour analysis . . . 25

3.2.1 Extraction and manipulation of the contours . . . 25

3.2.2 Average and typological contour analysis . . . 25

3.2.3 Clustering method . . . 27

3.3 Results for the contour analysis . . . 28

3.3.1 Results for the average contour analysis . . . 28

3.3.2 Results for the typological contour analysis . . . 30

(7)

4 Conclusion 34 4.1 Review of the automatic breath detection algorithm . . . 34 4.2 Review of the contour analysis . . . 36 4.3 Future work . . . 37

References 38

Appendices 40

(8)

Chapter 1

Introduction

Song can be found in every human culture (Mehr et al., 2019). While songs can ex-hibit great structural diversity, certain aspects of melodic shape have been found cross-culturally (Tierney, Russo & Patel, 2011). Research in musical universals contributes to debates about the biological and evolutionary origins of music. One important musical universal is the predominance of arched and descending melodic contours in vocal music (Sachs, in Savage, Tierney & Patel, 2017). This is called the “melodic arch hypothesis”. Tierney et al. (2011) explain the phenomenon with the “motor constraint hypothesis”: “This hypothesis proposes that certain widespread features of music reflect energetically efficient use of the vocal motor production system rather than evolutionary adaptations specific to human music.” (Savage et al., 2017). The predominance for arched and descending contours in music is reflected by this hypothesis, because “[...] [this] may reflect the tendency for air pressure beneath the vocal folds (“subglottal pressure") to increase rapidly at the start of a continuous vocalization and then decline gradually over the course of the vocalization.” (Savage et al., 2017). Therefore, higher pitches are easier to produce at the beginning of a phrase due to the higher subglottal pressure, so arched and descending pitch contours are more efficient in terms of energy ex-pended than contours with opposite shapes, like U-shaped and ascending contours (Savage et al., 2017). In Figure 1.1 a melody is shown with the pitch sequence in semitones over time by Tierney et al. (2011), in order to illustrate the arched and descending contours in musical phrases.

1.1

Empirical support for the melodic arch hypothesis

The first empirical support for the melodic arch hypothesis was provided by Huron (1996). He performed a quantitative analysis on the Essen Folksong Collection (Selfridge-field, 1995), a dataset with over 6000 Western folk songs. The collection

(9)

Figure 1.1: An example of a melody from a song, made by Tierney et al. (2011), from the Essen Folksong Collection to illustrate arched and descending pitch contours in music reflected by the melodic arch hypothesis. A shows the melody in Western music notation consisting of three phrases. B represents the melody in a pitch-time sequence, with pitch in semitones from the tonic pitch. The blue lines represent the notes, the dashed lines the phrase boundaries and the red dots the mean pitches of the first, second and third part of the phrase. C shows the contour shape of the phrases: ascending, arched and descending. Plot D visualizes the notes in B, randomly reordered in time, to show the larger jumps the melody makes, compared with the smooth melody in plot B.

includes the pitch sequences and rhythmic durations as well as phrase boundaries for all songs. Huron described nine different categories of melodic contours: as-cending, desas-cending, concave, convex (arch-shaped), asas-cending, horizontal-descending, ascending-horizontal, descending-horizontal and horizontal. The con-vex and descending-contour types were proven to be the most frequent in the Es-sen Folksong Collection. Additionally, Huron concluded that the average melodic contour was arch-shaped. Figure 1.2 shows Huron’s findings and illustrates the melodic arch as found in the average contour of the Essen Folksong Collection.

Just like Huron, Tierney et al. (2011) used the Essen Folksong Collection. A dataset with bird sounds from different songbird families was added in order to explore the similarities and differences between bird music and human music. Tierney et al. (2011) hypothesize that animals with similar motor constraints will produce sounds with similar melodic contours. This prediction was confirmed by empirical analysis of diverse human and avian song samples (Tierney et al., 2011).

(10)

Figure 1.2: The average contour of the Essen Folksong Collection as found by Huron (1996). Huron (1996) ordered the phrases by length and plotted them separately. In this plot the 5-note, 6-note, 7-note and 8-note phrases are shown. His findings were consistent with the melodic arch hypothesis.

In addition to the Essen Folksong Collection, Savage et al. (2017) used a global dataset of field recordings from the Garland Encyclopedia of World Music (Nettl, Stone, Porter & Rice, 1998). 387 phrase contours were manually extracted from 35 songs from the Garland Encyclopedia of World Music. The results confirmed the earlier findings from Huron (1996) and Tierney et al. (2011), namely, that the average phrase contours of the different corpora were significantly arched and descending in both Western and non-Western examples (Savage et al., 2017).

It appears that the tendency towards arch-shaped contours can be generalized cross-culturally, but due to the small global dataset of the Garland Encyclopedia of World Music used in addition to the larger dataset limited to Western music (Essen Folksong Collection), the conclusions should be tested with other global corpus studies (Savage et al., 2017). Cornelissen, Zuidema and Burgoyne (2020) prove with a novel corpus of Gregorian chants, that the truth of the melodic arch hypothesis strongly depends on formulation and typology used. This thesis will use their viewpoints as a foundation.

(11)

Collecting global data is a challenge in the field of computational musicology. The Non-Western music corpora often consist of audio files only, due to the absence of sheet music. To extract melodic phrase contours from audio files, the annotations of phrase boundaries is necessary. Savage et al. (2017) annotated the 387 phrases from the Garland Encyclopedia of World Music by hand, and defined a phrase as follows: “[...] periods of continuous vocalizing separated by breaths.”. This definition corresponds with the motor constraint hypothesis, which states that a continuous vocalization (between two breaths) is characterized by the increase and decline of subglottal pressure over the course of a vocalization. This could explain the predominance of arched and descending melodic contours. The definition of phrases in audio as periods of continuous vocalizing separated by breaths results in phrase boundary annotations corresponding with breath pauses. Therefore, detecting breaths can be a useful method for the extraction of melodic phrases in audio files.

1.2

Automatic breath detection

Nakano, Ogata, Goto and Hiraga (2008) describe a method for automatically de-tecting breaths in monophonic vocal music with the purpose of removing unwanted sounds. Monophonic music is the simplest form of music from a single source with-out accomplishments (Shashirekha, 2014). Three Hidden Markov Models were used to detect breath pauses, singing voices and silent sections in 27 song record-ings from the RWC Music Database (Nakano et al., 2008). The overall precision and recall rates are respectively 77.7% and 97.5%. The precision rate describes the percentage of detected breaths that are correct, while the recall rate describes the percentage of breaths detected by the system. A second study presenting an algorithm for automatic breath detection in vocal music is carried out by Ruinsky and Lavner (2007). A template-matching procedure was performed using breath features, like the Mel Frequency Cepstral Coefficients (MFCC), “[...] which are known for their ability to distinguish between different types of audio data.” (Ru-insky & Lavner, 2007). This study presented a recall rate of 97.6% and a precision rate of 95.7% in 22 song recordings from different datasets.

1.3

Project goal

This thesis project will explore the extraction of melodic contours from mono-phonic vocal music, using algorithms for breath detection to annotate the phrase boundaries in audio files. The relevance of this thesis arises from the develop-ment of new samples of melodic phrases, which allows for cross-cultural research in the field of computational musicology to proceed, in order to create a better

(12)

understanding of variation in melodic contours in vocal music.

The purpose of this thesis is twofold. Firstly, an automatic breath detection algorithm was used to detect phrase boundaries in a selection of audio files. This will answer the first research question: “How can melodic phrases be extracted from audio recordings of monophonic vocal music?”. Secondly, a quantitative melodic contour analysis was performed on the phrases from the selected audio files, to confirm and broaden the state of the art description of cross-cultural variation in melodic contours. This will respond to the second research question: “Is the quantitative analysis of the extracted melodic contours consistent with the melodic arch hypothesis?”. Additionally, an unsupervised clustering method is used in order to find clustering in the melodic contours, resembling the typological categories defined by Huron.

This thesis project contributes to the field of computational musicology by accomplishing the following:

• the creation of new annotated musical data

• the reproduction of earlier work, by performing a quantitative analysis on the same dataset used by Savage et al. (2017)

• the extension of the quantitative analysis performed by Savage et al. (2017) by performing the typological approach in addition to the average approach • the analysis of a novel dataset, for further evaluation of the melodic arch

hypothesis

• the exploration of an automatic breath detection algorithm, with the purpose of collecting more Non-Western samples of melodic phrases

• the exploration of a unsupervised clustering method on melodic contours The layout of this thesis is as follows: in the next chapter the automatic breath detection is presented. The theoretical foundation is provided and the approach and implementation necessary for the project execution are described. Next, the obtained results of the automatic breath detection are presented. The third chapter consists of the analysis of the melodic contours. The theoretical foundation and the approach for this section of the thesis is given and the results are set out. The last chapter will cover the discussion and the conclusion of the findings.

(13)

Chapter 2

Automatic breath detection

2.1

Theoretical foundation

This paragraph presents the theoretical foundation for the automatic breath de-tection. The approach is motivated by examining related work and defining key concepts and algorithms used in this thesis.

2.1.1 Key features

As stated in the introduction two studies described methods for automatic breath detection in vocal music (Nakano et al., 2008; Ruinsky & Lavner, 2007). In this paragraph, a few other relevant papers will be discussed, in order to create a broad understanding of the key features used in automatic breath detection.

Nakano et al. (2008) and Ruinsky and Lavner (2007) both used the Mel Fre-quency Cepstrum Coefficients (MFCCs) as key features. Automatic breath de-tection methods using cepstral coefficients are also presented by Ostendorf, Price, Bear and Wightman (1990) and Wightman and Ostendorf (1991), who performed their methods on speech signals.

MFCCs are short-term spectral-based features, known for their dominance in speech recognition (Logan, 2000). Logan (2000) investigated the value of MFCCs in music modeling and described the calculation of MFCCs in her paper. First, the signal is divided in small frames. For each of the frames, the cepstral feature vector is extracted. This vector corresponds with the spectral envelope of the frame. Next, a Fourier transformation is applied and the logarithm of the amplitude spectrum is scaled using the Mel scale. The Mel scale is based on the human auditory system and is linear below 1kHz and logarithmic above that limit, because humans do not perceive audio as linear (Logan, 2000). The last step is transforming the vectors with the Discrete Cosine Transform (DCT) to acquire 13 features, that represent the amplitude spectrum for the whole signal. Aside from using the

(14)

MFCCs as a feature, Ruinsky and Lavner (2007) use silence duration for automatic breath detection.

Shashirekha (2014) also proposes a method using the thirteen MFCC features. The mean and the variance of these features are used to train a linear classifier, which is ordered to classify monophonic music into a category or genre. In order to calculate the MFCC features, this thesis makes use of Praat (Boersma & Weenink, 2018), a system for performing research in phonetics by computer, and the library Parselmouth, a Python library for Praat software (Jadoul, Thompson & de Boer, 2018).

Another key feature used for automatic breath detection is the intensity of a signal. Praat contains a built-in algorithm for detecting silences, which is available in Praat via “Annotate to Textgrid (silences)”. The algorithm computes a TextGrid file containing all intervals with annotations (silence or sounding), based on a silence intensity threshold and a minimum silence duration. The algorithm is based on a script created by de Jong and Wempe (2009) in order to find syllable nuclei. Since pauses tend to have lower energy than the vowels within syllables, the intensity was used to find peaks in the energy contour and thus find the nuclei (de Jong & Wempe, 2009). First the sound is bandpass filtered between 80 and 8000 Hz, because low frequency noise can be filtered out without changing the sound, while enhancing the intensity measurement (Boersma & Weenink, 2018). The intensity contour is evaluated in respect to the intensity threshold. Sounding and silence intervals are removed if the duration is shorter than the minimum sounding or silence duration. Neighbour intervals are merged if an interval is removed (Boersma & Weenink, 2018).

2.1.2 Classification methods

Previous work shows that breaths can be detected using several key features in audio signals: the MFCCs, duration and intensity. In this paragraph, a few clas-sification methods used in related work are cited and explained.

In some papers, Hidden Markov Models (HMMs) were applied in breath de-tection experiments (Wightman & Ostendorf, 1991; Nakano et al., 2008). The usage of HMMs in music or speech analysis is valuable, because “[...] they have a preferable nature of being able to deal with variant time-length events, and are able to track data variance along time.” (Nakano et al., 2008). Nakano et al. (2008) state here that HMMs tend to perform well in cases concerning time series. HMMs compute probabilities of sequences of variables with observed events and hidden events (Jurafsky & Martin, 2008). Hidden events cannot be observed, but are still important events for the probabilistic model. In the papers by Wightman and Ostendorf (1991) and Nakano et al. (2008), the HMMs were used to classify a signal as a breath pause or a silent section.

(15)

Lavner and Ruinsky (2009) present an algorithm to divide audio signals into speech or music. The chosen classifier for their algorithm was the Decision Tree, and was used in combination with an extensive feature selection. “[...] the decision tree uses simple heuristics, which are easy to implement in a variety of hardware and software frameworks.” (Lavner & Ruinsky, 2009). The Decision Tree is a supervised learning method. It builds a model, while learning basic decision rules, and predicts the outcome for a target variable (Pedregosa et al., 2011). The trees are simple to interpret and can be applied on small datasets, while the understanding of the results is maintained.

An extension of the Decision Tree classifier is the Random Forest: a supervised ensemble learning algorithm. It fits multiple Decision Trees on different parts of the dataset and selects the best solution by observing the predictions of all trees. The usefulness of the Random Forest classifier is that it often improves the accuracy of the model and prevents overfitting.

2.2

Approach

This paragraph describes the system developed to detect breaths in audio signals. Furthermore, the used datasets and the preprocessing and feature extraction are described.

2.2.1 An overview of the algorithm

In this thesis an algorithm is presented that automatically detects breath intervals in audio signals. An overview of the pipeline of the created algorithm is shown in Figure 2.1. The key objective is the extraction of musical phrases, to be able to analyse the phrase contour plots in the audio recordings. Considering this goal, the algorithm aims to meet the following requirements:

1) The algorithm should take a dataset of audio recordings as an input and it should output the breath and phrase intervals.

2) The algorithm should correctly identify most phrases, so contour plots can be derived from audio recordings.

3) The algorithm should use the key features and classification methods, as seen in related research in automatic breath detection, to contribute to the state of the art techniques.

4) The algorithm should be evaluated with precision and recall rates, to compare the performance with the achievements of previous work.

(16)

Figure 2.1: An overview of the algorithm for the automatic breath detection. The used datasets (the MTC Phrase detection subset and the Garland Encyclopedia subset ) are divided in training and test sets. A baseline is created by performing the Praat "To Textgrid (silences)" algorithm. Next, all audio is segmented in 0.1 seconds and the intensity, MFCC matrix and pitch values are calculated. A Decision Tree and Random Forest classifier are trained with the baseline predictions and the calculations, and tested on the test sets. The predicted breath pauses are evaluated using the annotated breaths.

(17)

2.2.2 The creation of the used datasets

The datasets used in this project are a subset of the Meertens Tune Collection (van Kranenburg & de Bruin, 2019) and a subset of the Garland Encyclopedia of World Music (Nettl, Stone, Porter & Rice, 1998). The first was selected due to the accessibility, while the latter, also used by Savage et al. (2017), was selected due to diversity and globality of the dataset.

The Meertens Tune Collection provides several collections of the Dutch Song Database (Nederlandse Liederenbank), a database consisting Dutch and Flemish songs dating from the 16th until the 21st century. The collection “Onder de Groene Linde” (MTC-OGLAUDIO-1.0) contains circa 7000 audio recordings collected by Dutch field workers (van Kranenburg & de Bruin, 2019). These audio recordings were filtered on singer ID. Audio recordings marked with multiple singer IDs, indicating polyphonic music, were excluded. In order to avoid selecting multiple songs from one singer, duplicate singer IDs were dropped. Finally, 100 audio recordings were randomly selected.

Breath pauses in 20 audio recordings were annotated by hand and saved in TextGrid files. To minimize the subjectivity of the annotations, an annotation scheme was established (see Table 2.1). To give every recording the same weight in the system, despite of the difference in duration, a maximum of 15 phrases were extracted from each recording. The audio recordings were modified to have a duration covering the 15 phrases. A total of 233 phrases was obtained. The annotated subset of the Meertens Tune Collection used in this thesis is named: MTC phrase detection subset.

(18)

Table 2.1: The annotation scheme used to annotate 233 phrases from 20 audio recordings in the Meertens Tune Collection. Audio recordings were discarded if parts contained polyphonic music, background music or had too much noise. To obtain a diverse dataset, a maximum of 15 phrases was annotated from one recording, because the hypothesis was made that phrases from the same recording are similar.

Step 1 Listening to the audio file, discard if: • polyphonic or background music

• too much noise, resulting in gaps in pitch contour

Step 2 Annotate beginning (“begin”)

Annotate breath pauses throughout song as “1" Annotate musical pauses as “pause”

Annotate ending (“end”) after annotating 15 phrases Step 3 Save as TextGrid

The second dataset used in this thesis is the Garland Encyclopedia of World Music. This dataset consists of nine different volumes, containing circa 300 audio record-ings. The dataset represents a diverse mix of genres from different parts of the world. Savage et al. (2017) annotated 208 phrases from 16 recordings in European languages and 179 phrases from 19 recordings of non-European languages. These annotations were made available for this project. 10 recordings without parts con-taining instrumental or group music were used as test set for the automatic breath detection. This subset will be called: Garland Encyclopedia subset. For the con-tour analysis in Chapter 3 the MTC phrase detection subset will be used, as well as the whole annotated dataset of Savage et al. (2017), which will be referred as the Garland Encyclopedia of World Music.

(19)

2.2.3 Breath feature analysis

To get a grasp of the important features of a breath and phrase sound, the an-notated silence and phrase intervals of the MTC phrase detection subset and the Garland Encyclopedia subset were analysed. The duration and the mean of the pitch values were derived from each sound object. In Table 2.2 and 2.3 the statis-tical parameters are shown.

Table 2.2: The breath statistical parameters for the used datasets: the MTC Phrase detection subset and the Garland Encyclopedia subset. Important to note is that the breaths of the two datasets are different in terms of the breath duration. The breaths of the Garland Encyclopedia subset are significantly longer than the breaths of the MTC Phrase detection subset. The average pitch is higher, but both pitch values are very low, which indicates the absence of pitch values in the breaths.

Dataset Audio record-ings Number of breaths Average (Hz) Average (s) Minimum (s) Maximum (s) Stdev (s) MTC Phrase detection subset 20 210 35.77 0.55 0.11 1.90 0.27 Garland Ency-clopedia subset 10 135 69.42 1.75 0.031 7.70 2.10

Table 2.3: The phrase statistical parameters for the used datasets: the MTC Phrase detection subset and the Garland Encyclopedia subset. The latter shows a relatively high average phrase duration, compared to the first dataset. Dataset Audio record-ings Number of phrases Average (Hz) Average (s) Minimum (s) Maximum (s) Stdev (s) MTC Phrase detection subset 20 233 220.61 3.85 0.36 9.55 1.79 Garland Ency-clopedia subset 10 135 243.28 6.99 1.63 22.80 4.41

(20)

2.2.4 Baseline: The Praat detection algorithm

The automatic breath detection system is based on the key detection features and classification algorithms described in previous work. The Praat built-in detection algorithm, created by de Jong and Wempe (2009), was selected as a first prepro-cessing algorithm. The motivation for this choice is twofold. Firstly, the algorithm is based on simple parameters, namely, the intensity curve of a sound interval and the minimum silence duration. When the intensity drops below a chosen intensity threshold for a minimum of the chosen amount of time, the interval is identified as a silence interval. Secondly, the algorithm takes the temporal aspect of a song recording into account when identifying silences due to the dual-parameter detec-tion. The predictions by this algorithm will form the baseline for the automatic breath detection algorithm used in this thesis. More features will be added to increase the performance of the algorithm.

The MTC Phrase detection subset audio recordings are used for the automatic breath detection. The audio recordings from the MTC Phrase detection subset are randomly divided into a training set (16 recordings) and a test set (4 recordings). With the use of the library Parselmouth - a Python library for Praat software (Jadoul, Thompson & de Boer, 2018) - TextGrid files containing the predicted silence intervals are created with the “Annotate to Textgrid (silences)” algorithm. The first step to implement the Praat algorithm is choosing the best parameters. A Grid Search is implemented on the training set for the two hyperparameters: the intensity threshold and the minimum silence duration. The parameter set resulting in the highest F-score for the training set when evaluated with the annotated silence intervals, is used for predicting the silences in the test set.

After analyzing the breath sounds, it became apparent that the breath pitch contours consist of predominantly zero values, but some pitch values are still present. Therefore, the pitch contour of every predicted silence that followed from the Praat algorithm is checked. When a breath contained pitch values with more than 20 percent nonzero values, the prediction is discarded.

The evaluation metrics for the predicted silences of the Praat algorithm are shown in Table 2.4. Evaluating the predicted silences is based on the presence of overlap between the annotated and predicted silence. A predicted silence is considered as a true positive (tp) if the silence interval overlaps with an annotated silence interval. A predicted silence is considered as a false positive (fp) if it does not overlap with an annotated silence interval. False negatives (fn) are defined as annotated intervals that are not predicted by the algorithm. Predicted intervals that correspond with annotated beginnings, pauses or endings are not considered incorrect.

(21)

Table 2.4: The evaluation metrics for the Praat algorithm. The best parameters for the algorithm are found by performing a Grid Search on the training set. The F-score, precision and recall scores are calculated for the test set of the MTC Phrase detection subset. The Praat algorithm is also applied to 8 recordings of the Garland Encyclopedia subset and tested on 2 recordings. The precision rates for both datasets are high. The recall rate is slightly higher for the Garland Encyclopedia subset.

Dataset Audio record-ings

Best parameter set

tp fp fn Precision Recall F-score

MTC Phrase detection subset 4 intensity threshold: -25 dB minimum duration: 0.25 s 33 4 32 89.2% 50.8% 64.7% Garland Ency-clopedia subset 2 intensity threshold: -30 dB minimum duration: 0.25 s 13 5 8 72.2% 61.9% 66.7% 2.2.5 Feature extraction

The shortest breath pause annotated in the dataset lasts 0.1 second (see Table 2.2). In order to extract features from the breath pauses, all pauses are equalized. Therefore, longer breaths as well as phrase intervals are divided in 0.1 second fragments. The predicted silence intervals, obtained with the Praat algorithm, are used to determine the prediction for each fragment (1 for a predicted silence and 0 for a predicted sounding). A fragment was classified as a 1, if it had any overlap with the predicted interval. If an annotated breath pause had overlap with more than one predicted breath pause, the predicted breath pauses were incorrect.

The following features are calculated for each fragment: (1) the minimum, maximum and the mean of the intensity curve, (2) the mean and variance of the pitch values and (3) the mean and variance of the MFCC matrix. The features are extracted with Parselmouth. A Parselmouth sound object is derived with file-name.to_sound(). The intensity (1) and pitch values (2) are respectively obtained with sound.to_intensity().values and sound.to_pitch().selected_array[‘frequency’]. The MFCC matrix (3) for a sound object is extracted using

sound.to_mfcc().to_matrix_features(). This method divides the sound signal into 200 samples with a window length of 0.025 seconds and calculates the 13 MFCC features. The mean and variance of the MFCC matrix give a description of the power spectrum of the fragment, based on all samples.

(22)

The extracted features and the Praat call prediction for each fragment form two Pandas DataFrames (McKinney, 2010): the training (16 audio recordings) and the test set (4 audio recordings) of the MTC Phrase detection subset. These DataFrames are used to train and validate the classifier.

2.2.6 Tuning of the chosen classifiers

The classifiers chosen for this project are the Decision Tree and the Random Forest classifier. The algorithm is performed on small datasets, so the ability to interpret the outcome of a classifier is necessary. The two classifiers are based on basic decision rules and they are easy to implement due to the usage of simple heuristics (Lavner & Ruinsky, 2009). The classifiers are implemented by packages from scikit-learn (Pedregosa et al., 2011). The classifiers require a number of hyperparameters, thus, a Randomized Search is performed. The Randomized Search makes random choices from a grid of specified hyperparameters to optimize the parameters of the classifier. In contrast to the Grid Search used in optimizing the two Praat algorithm parameters, not all parameters are tested, but a fixed number of random samples is chosen. In Table 2.5 and 2.6 the settings for the Randomized Search are shown for respectively the Decision Tree and the Random Forest. The remaining part of the parameters are set to the default state. For a visual representation of the importance of the different hyperparameters for the Randomized Search, see Appendix 1.

(23)

Table 2.5: Decision Tree parameter settings for the Randomized Search. The parameter param_distributions takes a dictionary consisting of parameters names of the classifier and lists of values. For the Decision Tree, the hyperparameters max_depth, min_samples_split, min_samples_leaf and max_leaf_nodes are used. The max_depth limits the depth of the tree and decreases the training time. The parameter min_samples_split contains the minimum num-ber of samples a node should have to be nominated for a split. The parameter min_samples_leaf contains the number of samples in a node to become a leaf node. Lastly, max_leaf_nodes controls the number of leaf nodes.

Parameters Grid

estimator DecisionTreeClassifier(random_state=0)

param_distributions

"max_depth": [x for x in range(1,1000)],

"min_samples_split": [x for x in range(2,1000)], "min_samples_leaf": [x for x in range(1,400)], "max_leaf_nodes": [x for x in range(2,1000)]

n_iter 500

random_state 0

scoring ["f1", "precision", "recall", "accuracy"]

refit “f1”

cv model_selection.StratifiedKFold(n_splits=5)

The Randomized Search takes param_distributions with the purpose of preventing the classifier to overfit on the training set and helping regularize the trees. The parameter n_iter takes the amount of parameter sets to be randomly chosen. The scoring parameter, together with refit determines what evaluation metric is chosen to decide which parameter setting performs the best. Randomized Search also provides a method to perform cross validation. The parameter cv determines the cross validation splitting strategy (Pedregosa et al., 2011). The training data is splitted in five stratified folds. The percentage of the target variables (silence and phrase) in the training dataset, are preserved in each part. Thus, the variables are equally represented in each fold.

(24)

Table 2.6: Random Forest parameter settings for the Randomized Search. Additionally to the hyperparameters for the Decision Tree, n_estimators, max_features, min_weight_fraction_leaf and bootstrap are used for the Random Forest. The parameter n_estimators contains the number of Decision Trees created, while max_features determines the number of features to consider when looking for a split. min_weight_fraction_leaf contains the fraction of the sum of the weights of all training samples that is required to be at a leaf node. The bootstrap parameter takes a boolean value. When set to True, the training samples for each tree are drawn with replacement. When set to False, the whole training set is used for each tree.

Parameters Grid

estimator RandomForestClassifier(random_state=0)

param_distributions

"n_estimators": [x for x in range(1,1000)], "max_depth": [x for x in range(1,1000)],

"min_samples_split": [x for x in range(2,1000)], "min_samples_leaf": [x for x in range(2, 600)], "max_leaf_nodes": [x for x in range(2,1000)], “max_features”: [0,1,2,3,4,5,6,7,8],

“min_weight_fraction_leaf”: [0.0, 0.1], “bootstrap”: [True, False]

n_iter 500

random_state 0

scoring ["f1", "precision", "recall", "accuracy"]

refit “f1”

cv model_selection.StratifiedKFold(n_splits=5)

The classifiers are fitted on the training data from the MTC Phrase detection subset. The parameter set, that produced the highest F-score on the training set, is used to predict the target values of the test set. The best parameter sets for the Decision Tree classifier and the Random Forest classifier are presented in Table 2.7.

The predicted target values are observed and evaluated with the annotated values of the test set, after which the precision, recall and the F-score are calculated for both classifiers. Lastly, the automatic breath detection algorithm is tested on the other dataset: the Garland Encyclopedia subset. To get a clear image of the performance of the algorithm, the algorithm is also trained and tested on the Garland Encyclopedia subset, as well as trained on the Garland Encyclopedia subset and tested on the MTC Phrase detection subset.

(25)

Table 2.7: Best parameter sets for the Decision Tree and the Random Forest for the MTC Phrase detection subset. These sets produced the highest F-score on the training set and is used to predict the test set. For each 0.1 second segment is decided if the segment is a silence (1) or a phrase (0) interval.

Classifier Parameter set

Decision Tree “min_samples_split”: 89 “min_samples_leaf”: 44 “max_leaf_nodes”: 343 “max_depth”: 714 Random Forest “n_estimators”: 883 “min_samples_split”: 61 “min_samples_leaf”: 50 “max_leaf_nodes”: 785 “max_features”: 8 “max_depth”: 298 “bootstrap”: False

2.3

Results for the automatic breath detection

This paragraph presents the results for the automatic breath detection on the MTC Phrase detection subset and the Garland Encyclopedia subset. Once again, the calculation of the true positives, false positives and false negatives is based on the presence of overlap between the annotated and predicted silences. Table 2.8 consists of the evaluation metrics for the Decision Tree classifier and the Random Forest classifier. The F-score acquired from the baseline, the Praat algorithm, is shown for comparison. The precision and recall rates for the algorithm trained and tested on the MTC Phrase detection subset are respectively 52.4% and 98.9% using the Decision Tree as classifier, and 59.2% and 98.7% using the Random Forest as classifier. So, the algorithm is able to find 98.7% of the annotated breaths in the MTC Phrase detection subset and 59.2% of the predicted breaths were correct.

The F-score for the algorithm decreases from 74.0% to 57.1%, when trained and tested on another dataset, the Garland Encylopedia subset. However, the precision rate using the Random Forest classifier increases from 59.2% to 80.0%, when performing the algorithm on this dataset. The evaluation metrics, when training and testing on different datasets, are significantly lower. This shows that different musical traditions cannot be generalized, when performing computational work. Thus, the importance of testing hypotheses about musical universals on Western and on non-Western datasets is stressed by this finding.

In comparison to the baseline, the F-score shows an increase for the MTC Phrase detection subset, namely, 64.7% to 74.0% when using the Random Forest classifier. However, for the Garland Encyclopedia subset, the F-score shows a

(26)

significant decrease. So with relatively simple features (intensity and duration) the baseline algorithm was able to perform better on the recordings of the Garland Encyclopedia subset.

Table 2.8: Evaluation metrics for the automatic breath detection algorithm. The algorithm is trained with 16 recordings from the MTC Phrase detection subset and 8 recordings from the Garland Encyclopedia subset and tested with 4 recordings and 2 recordings respectively. Important to note is that the F-score shows an increase when using the classifiers in comparison to the baseline for the MTC Phrase detection subset. For the Garland Encyclopedia subset the F-score shows a decrease. The low rates when training and testing on different datasets are worth mentioning.

Classifier Training set

Test set tp fp fn Precision Recall F-score

Baseline MTC MTC 33 4 32 89.2% 50.8% 64.7% Decision Tree MTC MTC 89 81 1 52.4% 98.9% 68.5% Random Forest MTC MTC 74 51 1 59.2% 98.7% 74.0%

Baseline Garland Garland 13 5 8 72.2% 61.9% 64.7%

Decision Tree Garland Garland 8 3 10 72.7% 44.4% 55.2% Random Forest Garland Garland 8 2 10 80.0% 44.4% 57.1% Decision Tree MTC Garland 2 27 16 6.9% 11.1% 8.5% Random Forest MTC Garland 2 27 16 6.9% 11.1% 8.5% Decision Tree Garland MTC 0 8 55 0% 0% 0% Random Forest Garland MTC 0 8 55 0% 0% 0%

(27)

In Figure 2.2, the amplitude of one audio recording of the MTC Phrase detection subset is plotted. The annotated silences, the predicted silences (baseline) and the predicted silences derived by the Random Forest (trained and tested on the MTC Phrase detection subset ) are indicated to show the performance in a visual manner.

Figure 2.2: Plot of the annotated and predicted silences. The amplitude of one audio recording in the MTC Phrase detection subset is plotted against the time. A shows the hand-annotated silences in red. In plot B the yellow intervals represent the silences predicted by the baseline: the Praat algorithm. C shows in grey the silence intervals as predicted by the Random Forest classifier. These silence intervals are similar to the annotated silence intervals. The baseline predictions (B) passes over a lot of the annotated intervals in A.

(28)

Chapter 3

The analysis of melodic

contours

3.1

Theoretical foundation

In this paragraph the theoretical foundation for the analysis of melodic contours is provided. As stated in the introduction, Huron (1996) provided the first empirical support for the melodic arch hypothesis. His procedure and the approaches taken by Tierney et al. (2011) and Savage et al. (2017) to analyse melodic contours, are described in this paragraph.

3.1.1 Huron’s dual approach

Huron (1996) performed his quantitative analysis on the Essen Folksong Collection (Selfridge-field, 1995). He translated the pitches for all phrases in the collection to numerical values: the semitone pitch distance from middle C (C4). Firstly, all phrases of a given length in notes were averaged together. This resulted in aver-age contour plots of the phrases in the Essen Folksong Collection, which showed a predominance of arch-shaped contours. Huron addresses one problem: when averaging together a dataset of phrases with descending contours and ascending contours, it “[...] will result in an arch shape that fails to characterize accurately the contour of any of the original phrases.” (Huron, 1996). The resulting average will be a horizontal contour, which is not present in the original dataset. Thus, Huron (1996) also classifies all phrases in the collection to one of nine simple cate-gories: ascending, descending, incave, convex (arch-shaped), horizontal ascending, horizontal descending, ascending horizontal, descending horizontal and horizontal. The first and the last pitch were extracted, and the average of the pitches in re-maining part of the phrase was calculated (Huron, 1996). The contour type was

(29)

determined by observing the three pitches. For instance, when the first and last pitch are lower than the middle pitch, the contour type is arch-shaped.

Tierney et al. (2011) follow Huron’s typological approach and classified all songs in the Essen Folksong Collection into the nine categories. Their method was slightly different: Tierney et al. (2011) converted all phrases into continuous pitch-time functions. 50 equally spaced time samples were extracted from the functions. To determine the contour type, each melodic phrase was divided into three equal time frames and the mean pitch of each of these frames was calculated. The pitches were considered as equal, when the absolute difference did not exceed the limit of 0.2 semitones. Tierney et al. (2011) confirmed Huron’s earlier findings. Savage et al. (2017) compare the earlier findings of Tierney et al. (2011) with a novel dataset: the Garland Encyclopedia of World Music. They tracked the pitch of the audio recordings with Praat and converted it from frequency to semitones with formula 3.1, “where F is the frequency of the data point in Hz and mean(F) is the mean frequency of the contour” (Savage et al., 2017).

ST = 12 · log2(F/mean(F )) (3.1) They follow the method of Tierney et al. (2011) to convert the phrases in continu-ous functions, but they take Huron’s (1996) average contour approach, instead of the typological approach. They tested their average findings with a 1000 random permutations, to confirm that the observed values were significantly different from the random distribution (Savage et al., 2017).

3.1.2 Clustering methods

Huron’s (1996) typological approach classifies melodic contours in nine contour types, by determining the type of each contour individually. Clustering methods might be able to perform unsupervised clustering on the contours in different contour types, because of shown relevance of the methods in music computation. For instance, Turnbull and Elkan (2005) used the clustering method KMeans to find clusters in pitch features of audio sounds to determine the musical genre of the audio files. KMeans is an unsupervised machine learning algorithm, that clusters data by separating the samples in groups of equal variance. It aims to minimize the inertia of the clusters, which is based on how coherent the clusters are and is derived with formula 3.2 (Pedregosa et al., 2011).

n X i=0 min µj∈C (||xi− µj||2) (3.2)

(30)

3.2

Procedure for the contour analysis

This paragraph contains a description of the procedure followed by this project to perform a quantitative melodic contour analysis. Firstly, the extraction and manipulation of the pitch contours is described. Secondly, the procedure for the average contour analysis and the typological contour analysis is set out. Lastly, the unsupervised clustering method for an additional analysis of the contours is explained.

3.2.1 Extraction and manipulation of the contours

This project follows the procedure Huron (1996), Tierney et al. (2011) and Savage et al. (2017) for contour extraction and manipulation. The annotated phrase intervals are used to extract phrase sound objects for the Garland Encyclopedia of World Music and the MTC Phrase detection subset. Following the method of Savage et al. (2017) the pitch contours were tracked using Praat. The audio files are split up in phrases using Parselmouth.

All phrases were converted into continuous pitch-time functions. The pitch values were extracted using Parselmouth and were translated to numerical values: the semitone pitch distance from middle C (C4) with formula 3.1. The pitch times were extracted using the method .to_pitch().xs. 50 equally spaced time samples were taken using the scipy class interpolate.interp1d (McKinney, 2010). 1-D in-terpolation creates a function based on a fixed set of datapoints, the timesamples, and uses linear interpolation to derive the pitch values. The pitch contours were normalized to have mean zero and a duration of 1 second. Significant outliers, caused by incorrect pitch tracking, were aligned with the rest of the contour by changing the value of the outlier to the average of the pitch value before and after. 3.2.2 Average and typological contour analysis

The procedure for the average contour analysis was as follows: the mean of all pitch contours for each of the 50 time samples was taken. This was carried out for both datasets and compared to former research. For the typological approach, the procedure as demonstrated by Tierney et al. (2011) is followed. All phrases were divided in three equal time frames. The mean pitch of each part was calculated. Based on the three pitches, all phrases were classified to belong in one of Huron’s nine contour categories. The number of phrases belonging to each contour category is calculated.

Pitches were marked as equal, if the absolute difference did not exceed the 0.2 semitones tolerance. A tolerance of 1.0 semitone was also tested, to observe the impact of this parameter setting (Cornelissen et al, 2020). Figure 3.1 shows plots

(31)

of the nine contour categories using the pitch contours of the Garland Encyclopedia of World Music.

Figure 3.1: The nine contour categories with corresponding pitch sequences. The blue lines represent the average contour of the phrases in the Garland Encyclopedia of World Music with the particular contour type. The standard deviation is also shown. The black lines illustrate the corresponding pitch sequence of the pitch means of the first, second and third part of the phrase. This sequence determines the contour type for the phrase.

(32)

3.2.3 Clustering method

This project uses sklearn.cluster.KMeans (Pedregosa et al., 2011), which takes a few significant parameters. In Table 3.1 the parameters used in the clustering method KMeans are presented and explained. The method is called nine times, to take Huron’s (1996) nine contour categories in consideration and to observe the contour clustering, with different number of clusters K. KMeans is fitted to the pitch contours of the Garland Encyclopedia of World Music, which consist of 50 semitones-time points. Thus, for each cluster, the coordinates of the centroids are plotted as a contour in a 50-dimensional space. The contours that are derived by plotting the centroids are observed for each value of K. Next, a quantitative analysis follows for nine clusters to detect the contours inside the clusters and calculate the number of phrase contours with a certain type in one cluster. The result is compared with Huron’s nine contour categories.

Table 3.1: The KMeans parameter settings. KMeans is performed on the Garland Encyclopedia of World Music with different values for the n_clusters, namely, in the range of 1 until 9. The rest of the chosen values are default settings.

Parameter Explanation Chosen value n_clusters number of clusters to form 1-9

init initialization method: how to se-lect the initial clusters.

k-means++

n_init number of times to run algorithm with different initial clusters

10

max_iter maximum iterations for a single run of the algorithm

300

tol tolerance value to declare conver-gence

0.0001

random_state determines random number gen-eration

0

(33)

3.3

Results for the contour analysis

This paragraph presents the results for the analysis of the melodic contours of the 485 phrases of the Garland Encyclopedia of World Music and the 233 phrases of the MTC Phrase detection subset. The following three sections show the results for the average contour analysis, the results for the typological approach and the results for the KMeans clustering method respectively.

3.3.1 Results for the average contour analysis

In Figure 3.2 the average contours of the melodic phrases in the MTC Phrase detection subset and the Garland Encyclopedia of World Music are shown. Because the analysis is performed on the same audio recordings of the Garland Encyclopedia of World Music and with the same method, the average contour can be compared to the findings of Savage et al. (2017) (see Figure 3.3). Both plots show the melodic arch clearly. However, the plot by Savage et al. (2017) begins at a lower value and descends further at the end of the phrase. Also, the peak in the plot by Savage et al. (2017) has a different form than the peak in the plot of the Garland Encyclopedia of World Music in Figure 3.2. This difference could be explained by a different usage of the pitch tracking algorithm. The findings are consistent with the melodic arch hypothesis.

The average contour plot for the MTC Phrase detection subset shows a signif-icant dip in the middle part of the contour. As the mean pitch of the first, middle and last part are 0.60, 0.45 and 0.37 semitones, the contour can be regarded as a horizontal contour type, although the pitch means are descending. Thus, the aver-age contour is inconsistent with the melodic arch hypothesis. Therefore, classifying this average contour in the nine types defined by Huron (1996) is not valuable, because information about the form of the contour will be lost.

(34)

Figure 3.2: The average contour plots of the MTC Phrase detection subset (A) and the Garland Encyclopedia of World Music (B). The standard deviation (0.25σ) is also shown. A shows a significant dip in the middle. When classifying this contour as one of the nine contour types defined by Huron (1996), it is defined as a horizontal contour. B is defined as an arch-shaped contour.

Figure 3.3: The average contour plot of the Garland Encyclopedia of World Music by Savage et al. (2017). This average contour shows a clear arch shape. When comparing this figure with the reproduced Garland Encyclopedia of World Music average contour in Figure 3.2, only a couple differences can be noted. This plot begins at a lower value and the peak has a different form than the contour in Figure 3.2.

(35)

3.3.2 Results for the typological contour analysis

All phrases in the datasets were classified as one of the nine contour types defined by Huron (1996). In Figure 3.4 the frequencies of the contour types is shown for the Garland Encyclopedia of World Music and the MTC Phrase detection subset. 0.2 and 1.0 semitones are used as tolerance values. In contrast to the distribution with a tolerance of 0.2 semitones, for which the descending, convex and concave contour types are dominant, the plot is more evenly distributed with a tolerance of 1.0 semitone.

The contour types proven to be the most frequent in the 485 phrases of the Garland Encyclopedia of World Music, classified with a tolerance parameter of 0.2 semitones, are the convex and descending types. With a tolerance parameter of 1.0 semitone, the frequencies of each contour type are more equally distributed. The descending, convex, horizontal and horizontal-descending contour types are the most frequent types. The high occurrence of the horizontal type with a tolerance parameter of 1.0 semitone, is not consistent with the melodic arch hypothesis.

For the 233 phrases in the MTC Phrase detection subset, with a tolerance parameter of 0.2 semitones, the concave, convex and descending contour types occur the most frequent. When adjusting this parameter to 1.0, the concave, convex and horizontal types have the highest frequency. In line with the melodic arch hypothesis, the ascending contour types (ascending, horizontal-ascending, ascending-horizontal) belong to the least frequent contour types in the datasets, regardless of the tolerance parameter setting. On the other hand, the concave and horizontal types are frequent in the MTC Phrase detection subset. This is not consistent with the melodic arch hypothesis.

(36)

Figure 3.4: The frequencies of the contour types for  = 0.2 and  = 1.0. The descending, convex and concave contour types are predominant in the distribution with  = 0.2. The plot is more evenly distributed with  − 1.0. The highest bars in the plots correspond with the most frequent types in the MTC Phrase detection subset and the Garland Encyclopedia of World Music.

3.3.3 Results for the KMeans clustering method

KMeans was fitted on the phrase contours of the Garland Encyclopedia of World Music. The method was called nine times with a different value of K in the range of one to nine. In Figure 3.5 the contour of the coordinates of the centroids are plotted in a 50-dimensional space for the different values of K.

For one cluster, a horizontal contour appears. For two clusters, slightly descend-ing and ascenddescend-ing contours are shown. For three clusters, a horizontal contour is added for one of the centroids. In the plot for four clusters, the descending and ascending contours of are more clearly visible, and an arched type is added. The plots for five and six clusters are alike. But not every cluster contour can be defined as one of Huron’s nine types (1996), like contours with forms that can be described as sinusoids. In the plots for seven and nine clusters, a contour that begins high and ends relatively low appears. These clusters contain only one contour, so can be seen as outliers. Lastly, for eight clusters, all contours mentioned above are shown.

(37)

Figure 3.5: Contour plots of the cluster coordinates of different values of K. The centroids that were created by the KMeans method are plotted in a 50-dimensional space. The contours that are shown cannot all be defined as one of the nine types defined by Huron (1996), although the descending, ascending and arched contours do appear.

Considering Huron’s (1996) nine types, the contours were clustered in nine clusters. In Figure 3.6 the distribution of the contour types in each cluster is shown, using a tolerance parameter of 1.0 for a more even distribution of the contour types. To be consistent with the types defined by Huron (1996), each cluster should contain one type only.

(38)

Apart from the horizontal type, that clustered in cluster 8, the contour types are distributed over the nine clusters. Thus, the nine centroids, derived with the unsupervised clustering, do not correspond with the nine types of Huron (1996).

Figure 3.6: The distribution of contour types in the nine KMeans clusters. Each bar represents the number of contours with a certain contour type within the cluster. In cluster 6 and cluster 7 the dominant contour types are the descending, horizontal-descending and descending-horizontal contours. The ascending and ascending-horizontal contours are clustered in cluster 5. These five mentioned contour types are roughly absent in other clusters. The convex contours are primarily clustered in cluster 3, while occurring on a small scale in other clusters. In cluster 8 the horizontal contours are clustered. While cluster 8 consists of the largest number of contours, the horizontal type outweighs the other contour types present.

(39)

Chapter 4

Conclusion

This thesis has presented an automatic breath detection algorithm with the pur-pose of extracting melodic phrases and a quantitative analysis to confirm and broaden the description of variation in pitch contours. Specifically to examine the melodic arch hypothesis, one of most important musical universals. In this thesis project new annotated musical data was created and analysed, the aver-age contour of the Garland Encyclopedia of World Music by Savaver-age et al. (2017) was reproduced and the quantitative analysis extended by performing the typo-logical approach on the data. Also, an automatic breath detection algorithm was explored, with the purpose of collecting more Non-Western samples of melodic phrases and an unsupervised clustering algorithm was performed in order to com-pare the clusters to Huron’s (1996) contour types.

Based on the findings, this chapter attempts to answer the two research ques-tions: “How can melodic phrases be extracted from audio recordings of monophonic music?” and “Is the quantitative analysis of the extracted melodic contours con-sistent with the melodic arch hypothesis?”. Firstly, the results for the automatic breath detection algorithm are discussed in order to answer the first research ques-tion. Secondly, the results for the quantitative contour analysis are discussed to respond to the second research question. Lastly, the findings are concluded and suggestions for future work are made.

4.1

Review of the automatic breath detection algorithm

To extract melodic phrases from the audio recordings of the Garland Encyclope-dia of World Music and the MTC Phrase detection subset, an automatic breath detection was used. A Praat algorithm was chosen as a baseline and the Decision Tree and Random Forest as classifiers. The overall precision and recall rates on the MTC Phrase detection subset were respectively 52.4% and 98.9% using the

(40)

Decision Tree classifier, and 59.2% and 98.7% using the Random Forest classifier. Using relatively simple features, intensity and duration, the Praat algorithm was able to detect breaths in the MTC Phrase detection subset with a F-score of 64.7%. Thus, adding extra features, like the MFCC matrix and pitch, the classifier found 98.7% of the annotated breaths in the MTC Phrase detection subset and 59.2% of the predicted breaths were correct.

The resulting breath and phrase interval approximations provide for the ability to extract melodic phrases from audio recordings, but the phrases need to be checked due to the relative low precision rate. Phrases could be interrupted by incorrect breath pauses, so an inspection on phrase length could be useful.

Evaluating the algorithm on a subset of the Garland Encyclopedia of World Music, resulted in an increased precision rate, namely, 80.0% using the Random Forest classifier, but an overall lower F-score: 57.1%. The performance of the Praat algorithm on this dataset, however, resulted in a F-score of 66.7%. The usage of the classifier with MFCC features and pitch values lead to an unexpected decrease in the F-score, when applied on the Garland Encyclopedia of World Music. A possible cause for this effect is the fact that Savage et al. (2017) constructed their annotated dataset to extract phrase boundaries. Thus, the breath pauses were not annotated, but the hypothesis was made that breaths would exist in between annotated phrases. The lower performance of the automatic breath detection on the Garland Encyclopedia of World Music than on the MTC Phrase detection subset could be explained by the fact that not all phrases were allocated between breaths in the annotations of Savage et al. (2017).

Training and testing the algorithm on different datasets lead to significant low results. This shows that different musical traditions cannot be generalized, when constructing algorithms or performing quantitative analysis. The importance of research into different musical traditions is emphasized by this fact.

Furthermore, the automatic breath detection in this thesis was performed on monophonic vocal music. The purpose of the detection algorithm was to extract melodic phrases from audio recordings, which often consist of background or in-strumental music, polyphonic sections or speeches. To use the algorithm to create more samples for cross-cultural research in computational musicology, the algo-rithm should be able to handle a variety of audio recordings.

(41)

4.2

Review of the contour analysis

The quantitative analysis of the melodic contours of the Garland Encyclopedia of World Music and the MTC Phrase detection subset was twofold. For the average contour analysis, the findings of Tierney et al. (2011) were reproduced. The average contour plot of the phrases in the Garland Encyclopedia of World Music shows a clear arch or convex shape. Surprisingly, the average contour of the MTC Phrase detection subset has a horizontal shape, instead of the arch, when defined as one of Huron’s contour types. However, the mean pitch of the three equally divided parts of the average contour - 0.60, 0.45 and 0.37 semitones - show a descending sequence. It appears that Huron’s typology is not sufficient for the classification of the derived average contour.

The difference between the average contour of the Garland Encyclopedia of World Music and the MTC Phrase detection subset could be explained by a dif-ferent annotating tactic. As previously stated, Savage et al. (2017) focused on annotating the phrase intervals, while in this thesis the focus lay on annotating the breath pauses, without anticipating on melodic structure. Thus, a reason could be that the annotations were derived differently.

Next, the typological contour analysis was performed. All phrases in the Gar-land Encyclopedia of World Music and the MTC Phrase detection subset were classified as one of the nine contour types of Huron (1996). In line with the melodic arch hypothesis, the convex and descending contour types belong to the most frequent types in the Garland Encyclopedia of World Music as well as the MTC Phrase detection subset. But, the horizontal and concave contour types occur frequently as well. This is not in line with the melodic arch hypothesis.

The tolerance parameter plays a significant part in the distribution of the frequency of the contour types. The distribution is more equally distributed for a tolerance parameter of 1.0 in contrast to the dominance of the descending, convex and concave contour types with the 0.2 semitones. This difference in distribution is noteworthy and is worth investigating in future research.

Additionally, KMeans clustering was used to cluster the pitch contours of the Garland Encyclopedia of World Music and plot the cluster coordinates. While some cluster contours can be defined as horizontal, descending, ascending and convex types, many cluster contours are more complex. Furthermore, after per-forming KMeans with nine clusters, the frequency of the contour types present in the clusters was explored. Certain clustering of contour types was detectable, like the cluster with horizontal contours, but the contour types were distributed over the different clusters. The clusters were not restricted to one contour type only.

The unsupervised clustering shows that the contours derived from the Gar-land Encyclopedia of World Music tend to cluster in different types than the nine contour types defined by Huron (1996). This result gives potential for future work.

(42)

4.3

Future work

In this thesis we were able to construct an algorithm, that meets the predetermined requirements, and extract melodic phrases from audio recordings. By performing a quantitative contour analysis on the phrases from the used datasets, some of the predictions stated by the melodic arch hypothesis were confirmed. As predicted, the descending and arch-shaped types belong to the most frequent types in the datasets, and for the Garland Encyclopedia of World Music the average contour showed a clear arch-shaped form. However, the shape of the average contour of the phrases in the MTC Phrase detection subset could only be defined as a horizontal contour type.

Thus, the results obtained in this thesis are on some points consistent with the hypothesis of the predominance of arched and descending contours in vocal music, but they do not rule out other hypotheses. The Garland Encyclopedia of World Music is highly diverse, but relatively small. The MTC Phrase detection subset contains an even smaller amount of phrases. Therefore, the conclusions should be tested on more cross-cultural data. The method, proposed in this thesis, to automatically generate more samples of data from audio recordings using Artifi-cial Intelligence techniques, should be expanded with more features or a different classification method for example. More research into automatic breath detection will allow for the ability of cross-cultural research in the field of computational musicology to proceed, to create a better understanding of musical universals.

Other candidates for future work include the research into the tolerance pa-rameter, which resulted in a difference of the frequency distribution of the nine contour types in the data, and research into the clustering of the melodic contours, that was explored in this thesis. The latter to discover the extent to which the melodic contours form clusters and if it could lead to assigning contour types.

(43)

References

Boersma, P., & Weenink, D. (2018). Praat: doing phonetics by computer [Com-puter program]. Retrieved from http://www.praat.org/

Cornelissen, B., Zuidema, W. H., & Burgoyne, J. A. (n.d.). The melodic arch revisited. (unpublished)

De Jong, N. H., & Wempe, T. (2009). Praat script to de-tect syllable nuclei and measure speech rate automatically. Behav-ior Research Methods, 41 (2), 385–390. Retrieved 2020-05-29, from http://link.springer.com/10.3758/BRM.41.2.385 doi: 10/dwtv2n Huron, D. (1996). The melodic arch in Western folksongs. Computing in

musicol-ogy, 10 , 3–23.

Jadoul, Y., Thompson, B., & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71 , 1–15. doi: https://doi.org/10.1016/j.wocn.2018.07.001

Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An intro-duction to speech recognition, computational linguistics and natural language processing. Upper Saddle River, NJ: Prentice Hall.

Lavner, Y., & Ruinskiy, D. (2009). A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation. EURASIP Journal on Audio, Speech, and Music Processing, 1-14. Retrieved 2020-06-05, from http://asmp.eurasipjournals.com/content/2009/1/239892 doi: 10/d8kk6d

Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In Ismir (Vol. 270, p. 1-11).

McKinney, W. (2010). Data structures for statistical computing in python. In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th python in science conference (p. 51 - 56).

Mehr, S. A., Singh, M., Knox, D., Ketter, D. M., Pickens-Jones, D., Atwood, S., . . . Glowacki, L. (2019). Universality and diversity in human song. Science, 366 (6468). Retrieved 2020-06-17, from https://science.sciencemag.org/content/366/6468/eaax0868 doi: 10/ggdvjp

(44)

Nakano, T., Ogata, J., Goto, M., & Hiraga, Y. (2008). Analysis and Automatic De-tection of Breath Sounds in Unaccompanied Singing Voice. Proc. of ICMPC , 387–390.

Nettl, B., Stone, R. M., Porter, J., & Rice, T. (1998). The Garland encyclopedia of world music. OKS Print.

Ostendorf, M., Price, P. J., Bear, J., & Wightman, C. (1990). The Use of Relative Duration in Syntactic Disambiguation. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley. Retrieved 2020-05-21, from https://www.aclweb.org/anthology/H90-1006 doi: 10/ftwx6c Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,

. . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12 , 2825–2830.

Ruinskiy, D., & Lavner, Y. (2007). An Effective Algorithm for Automatic Detec-tion and Exact DemarcaDetec-tion of Breath Sounds in Speech and Song Signals. IEEE Transactions on Audio, Speech, and Language Processing, 15 (3), 838– 850. doi: 10/cc59tx

Savage, P. E., Tierney, A. T., & Patel, A. D. (2017). Global Music Recordings Support the Motor Constraint Hypothesis for Human and Avian Song Con-tour. Music Perception: An Interdisciplinary Journal , 34 (3), 327–334. doi: 10/ggmd4f

Selfridge-Field, E. (1995). Essen musical data package. Center for Computer Assisted Research in the Humanities, Stanford University.

Shashirekha, H. L. (2014). Using MFCC Features for the Classification of Mono-phonic Music. International Conference on Information and Communication Technologies, 975 .

Tierney, A. T., Russo, F. A., & Patel, A. D. (2011). The motor ori-gins of human and avian song structure. Proceedings of the National Academy of Sciences, 108 (37), 15510–15515. Retrieved 2020-02-17, from http://www.pnas.org/cgi/doi/10.1073/pnas.1103882108 doi: 10/b65mxj

Turnbull, D., & Elkan, C. (2005). Fast recognition of musical genres using RBF networks. IEEE Transactions on Knowledge and Data Engineering, 17 (4), 580–584. doi: 10/bwdc53

Van Kranenburg, P., & De Bruin, M. (2019). The Meertens Tune Collections: MTC-FS-INST 2.0. Meertens Online Reports. Amsterdam: Meertens Insiti-tuut.

Wightman, C., & Ostendorf, M. (1991). Automatic recognition of prosodic phrases. In [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, p. 321-324). doi: 10/fsc8rg

(45)

Referenties

GERELATEERDE DOCUMENTEN

• The final author version and the galley proof are versions of the publication after peer review.. • The final published version features the final layout of the paper including

Opmerkelijk was de vondst van een benen mantelspeld (Afb. 14) uit grafkuil V/A/7 het enige relict van de kledij of lijkwade. 64 Eén datering op basis van een 14C-onderzoek en één

Ten positive datapoints correspond to a 20s long seizure, thus the chosen training set sizes represent training sets including increasing number of seizures from one up till five..

peaks are detected. Additionally, all IBIs shorter than 2 seconds are removed as they are also not considered by clinical experts. For an example of 65 seconds EEG, this leads

peaks are detected. Additionally, all IBIs shorter than 2 seconds are removed as they are also not considered by clinical experts. For an example of 65 seconds EEG, this leads

Comparison of the results shows that the location found by localizing the detected spikes (figure 6) is the same as the localization of the selected background

Anderen renden rond, klommen in struiken en bomen, schommelden of speel- den op een voormalig gazon, dat door al die kindervoeten volledig tot een zandvlakte vertrapt was (er

In Chapter 3, we discuss about how the breathing rate in a subject helps in finding out the different sleep stages like the Rapid Eye Movement (REM) and Non Rapid Eye Movement