Features for harmonic cadence recognition in music of the Classical period

(1)

Features for harmonic cadence

recognition in music

of the Classical period

Arianne N. van Nieuwenhuijsen 10618155

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor

dhr. dr. J.A. (Ashley) Burgoyne

Institute for Logic, Language & Computation (ILLC) University of Amsterdam

Science Park 107 1098 XG Amsterdam

(2)

Summary

Musical phrases can easily be heard when listening to music. In music of the classical period, some chord successions are commonly used to define the closures of these phrase. These are called cadences and can be seen as punctuation marks in music. The recognition of these cadences can be used to improve research in the field of Music Information Retrieval (MIR), such as segmentation. Therefore, automatic cadence recognition seems to be interesting to pursue. Due to the dependency of cadences on sequential information, the program IDyOM (Information Dynamics of Music) is used. IDyOM uses multiple viewpoint context models to capture musical sequences and to predict upcoming musical events. In this project, IDyOM was used to predict, and thus recognise, cadences in the bass lines of the Kostka-Payne corpus. This resulted in an overview of features which are available in IDyOM that could express cadences the best. The results show that cadences are multidimensional, but also that IDyOM is not ideal for the classification of musical events. Finally, the musical meaning of the viewpoints and the detected cadences are discussed.

(3)

1 Introduction

Many theories have emerged to conceptualise and understand music’s many forms. The music of the common practice period (especially the Classical period) has resulted in musical theories that depend highly on harmonic analyses. In these analyses, the functions of chords within the music are defined in terms of scale degrees. Some successions of scale degrees are commonly used to define closures of musical segments, indicating intersections or the ending of a musical piece (Weihs et al., 2016; Temperley, 2004).

These successions are called cadences and can be seen as punctuation marks within musical phrases to facilitate the recognition of structure in pieces of music. Therefore, cadences can contribute to the segmentation of musical pieces as they define closure of phrases and thus provide a clear structure in music (Weihs et al., 2016; Pauwels et al., 2013; Van Kranenburg and Karsdorp, 2014). This structure indicated by cadences is in accordance with the comprehension humans have of musical pieces.

An example of how cadences can be used is demonstrated in a research by Pauwels et al. (2013). Pauwels et al. (2013) used the probability of chord pairs to calculate the likelihood of having reached the end of a segment. This is based on the idea that musical sentences end in cadences, which are described by only a few chord progressions. Temperley (2004) also mentions the importance of cadences for a sense of closure and how they can aid in finding the key in segments. The key can be found as the most frequently used cadence (the authentic cadence, see Appendix A.2 on page 36) ends with the tonic of the key (Weihs et al., 2016; Temperley, 2004), making cadences a good indicator of the key of musical phrases. Segmentation, chord finding and key finding are major subjects within the field of Music Information Retrieval (MIR) (Weihs et al., 2016; Van Kranenburg and Karsdorp, 2014; Burgoyne and Downie, 2016), from which follows that it is interesting to be able to identify cadences within Classical music to aid these objectives. Likewise, being able to recognise cadences automatically, can help understanding how humans process musical information and gives insight into whether an artificial system is capable of detecting this feature in music.

The expectation is that when the appropriate features are used for the machine learning, the system is likely to be able to recognise cadences. Cadences are an aspect that can be recognised by humans in music notation, which should make it detectable when the features that describe cadences are chosen to identify them. However, the difficulty lies in finding the features that can reasonably define cadences, while keeping the amount of features low to make cadence recognition computationally possible. Therefore, the research question that arises is: What features are the best for describing harmonic cadences in symbolic music of the Classical period for automatic cadence recognition?

As mentioned in the research question, the focus will lie on the automatic detection of cadences, which means that the recognition will be carried out by a machine learning algorithm. Machine learning algorithms are algorithms that

(5)

“learn” to perform an ability without it being programmed. In this case, the machine learning algorithm that will be used, should be capable of learning the recognition of cadences with the given data. This data will be music of the common practice period, where in this project only symbolic music data will be used. The use of audio data is not feasible, as a complete and correct transcription of audio is needed to be able to interpret the cadences correctly. This can not be done automatically yet and a manual transcription would take too much time for this project. The symbolic music data will yield the features that the machine learning algorithm can use to learn the recognition of cadences.

The following sections will firstly discuss related research concerning cadence recognition in Section 2. Thereafter, the corpus that was used as data set and how it was processed will be covered in Section 3 on page 8. This section will also describe the program that will be used for the recognition of cadences and how this results in the possibility of testing different musical features. This is followed by the results in Section 4 on page 18. Lastly, the conclusion and discussion that follow from the results are discussed in Section 5 on page 26. Please refer to the glossary on page 34 if a term is unclear.

(6)

2 Related work

Section 1 indicated the relevance of cadence recognition. However, cadence recog-nition is not a very prominent research topic in the field of MIR and can mostly be found as a component of MIR research on other topics. An example is the research by Pauwels et al. (2013), who use some form of cadence detection for the structural segmentation of music. They combine a system that looks for peaks in the novelty observations of the cosine similarity matrix with a system that is based on harmonic analyses. The latter system uses the probabilities of chord pairs for the recognition of structural boundaries, which is based on the idea that these boundaries are implausible places to experiment with new chord combinations. This is in accordance with the existence of cadences in common practice music as these limit the possible chord combinations with which a musical sentence can end.

Another research where the chord functions are used for another task, tries to substitute chords with other chords that have a similar function within the music to create new chords progressions (Eppe et al., 2015). One of the substitutions is blending cadences, where for example the descending bass line of the Phrygian cadence is used to create a similar chord progression without using the same chords. However, this paper does not mention the recognition of these cadences, which makes the detection of cadences an interesting feature to add as new music could be created by detecting and blending cadences.

The previous researches showed how harmonic analyses (and thus cadences) can contribute to different objectives. Likewise, Temperley (2004) mentions that a harmonic analyser could also profit from the capability of cadence recognition. An example of an analyser is the Melisma music analyzer1(Sleator and Temperley, 2001) that performs the analysis of different musical structures. This program is capable of performing adequate harmonic analyses, but the results are not optimal. The addition of cadence recognition could improve the performance, while the harmonic analyses that the program produces can help cadence identification. The analyses define the functions of the chords in the musical pieces, what can be used as feature for cadence recognition. Another example of automatic harmonic analysis is discussed by Barth´elemy and Bonardi (2001). They presented a method for analysis with the figured bass that is extracted from a musical piece. This research shows how harmonic analysis can be done after removal of ornamental notes and the extraction of the figured bass, which can help to simplify the music for the recognition of cadences.

The only research on cadence recognition that could be found at the moment of writing aimed to recognise cadences within traditional Western Stanzaic songs (Van Kranenburg and Karsdorp, 2014). Their endeavour was to detect cadences in Dutch oral folk songs, but also to study the importance of different features and parameters for cadence detection. The recognition was based on melodic and lyrical formulas that often indicate the closure of a musical segment. The

(7)

recognition was performed on symbolic musical data in **kern format with available phrase endings. The **kern format is a representation for symbolic music especially made for music of the common practice period. To make the data feasible for detection, trigrams of successive pitches were created, which were either labelled as cadential or non-cadential. Then, a number of features were extracted to describe the melodic and textual properties and these were afterwards used in a Random Forest Classifier. However, as each prediction was done independently, all sequential information got lost. To check the importance of the sequential information, Van Kranenburg and Karsdorp (2014) did another trial with a data set that took neighbouring trigrams into consideration which improved the precision of the program. However, even with a high precision, the recall remained moderate. They concluded that cadences are multidimensional and sequence-based, which means that cadence recognition requires a broad spectrum of features to describe the cadences, but also requires a data form that considers the sequential information that is kept within music.

All in all, cadence recognition can contribute to several other objectives in the field of MIR and could likewise be improved. The research that has been done, shows that a harmonic analysis can be done with just the figured bass and with elimination of ornamental notes (Barth´elemy and Bonardi, 2001). This insight can be used to simplify the data to facilitate the recognition of cadences in this project. The only other research on cadence recognition mentions the importance of sequential information (Van Kranenburg and Karsdorp, 2014), which is important for choosing the data form and the program to process the data. This research also mentions that cadences are multidimensional, which indicates that at least several features are needed to be able to describe cadences.

(8)

3 Method

The first task of the project is to get usable data of music from the Classical period. To address this musical period, the Kostka-Payne corpus was used. This corpus is based on the assignments that are given in the music theory book by Kostka and Payne (1995). In this corpus, 46 excerpts of music from different composers from the Classical period are given with solutions for a harmonic function analysis. As several composers are represented in this corpus and as cadences are part of the theory that is to be learnt, there is a higher possibility of finding enough cadences within this small corpus while still representing the period. A full list of the composers in the Kostka-Payne corpus can be found in Appendix B on page 37.

The next task is to find a way of detecting cadences with a machine learning algorithm. To reach this goal, the program called IDyOM2(Information Dynamics of Music) will be used (Pearce, 2005). IDyOM is a sequence-based framework that should be able to capture how a musical piece progresses in the dimension of time. How a musical piece develops over time is a crucial feature for the human recognition of cadences, but this sequential information is also of great importance for the detection of cadences as mentioned in Section 2 (Van Kranenburg and Karsdorp, 2014). This is, however, impossible to display with n-grams, because small n-grams may not contain enough information to capture cadences, while higher n-grams will have counts that are too sparse given the small data set that is used. These problems are circumvented by using IDyOM for the recognition of cadences. A more detailed description of how IDyOM works can be read in Section 3.2 on page 11. Thereafter, Section 3.3 on page 12 will discuss how IDyOM was used for this research.

The last task will be to compare the performance of the models created with different features and to evaluate the results of these models. How the performance can be compared will be discussed in Section 3.4 on page 15.

3.1 Data

The original Kostka-Payne corpus is not digital and the versions that can be found on the Internet are adapted by other researchers. The adaptation3that is the most suitable for this project contains the original transcripts of the music, the manually annotated harmonic reduction, the tonality and local key changes per phrase, and a grouping structure to be able to interpret various levels of the musical piece. The harmonic reductions are the most usable option for cadence recognition as these reductions contain the original excerpts without the ornamental notes that do not have any harmonic function (Barth´elemy and Bonardi, 2001). As extra information is omitted, it becomes easier to process the music, while the residual notes all have a harmonic meaning. The corpus has 46 musical excerpts in MusicXML format and does not include the function analyses of the original corpus. This corpus also does

2_{https://code.soundsoftware.ac.uk/projects/idyom-project} 3_{http://ccm.web.auth.gr/datasetdescription.html}

(9)

not contain annotated cadences, which therefore have to be manually annotated. Appendix C on page 38 shows a table containing all the annotated cadences in the corpus for this project.

The first processing of the files is done by extracting the parts in the MusicXML files that only contain the harmonic reductions. Within the MusicXML files, each staff is a part that can be viewed separately and the last two staffs of the files contain the harmonic reductions in notation for piano. These two staffs were exported together to new MusicXML files with a document viewer for music notation called MuseScore4. Several attempts were made to export the relevant parts with a Python script. However, these parts could not be extracted with music-related Python packages without transforming the files to fit the two relevant parts into one staff. To be certain information would not get lost, the option to export the harmonic reductions with MuseScore was chosen.

Thereafter, the parts with the harmonic reductions were processed with the Python package music215(Cuthbert and Ariza, 2010). This package enables the processing of symbolic musical data with Python, while simultaneously giving the possibility to view the data in musical notation with MuseScore. To reduce the data even more, music21 was used to retrieve only the melody and bass line. The melody and bass line are the voices in music that are the best at defining the perception that listeners have of music and contain the most crucial notes for the recognition of harmonic structure (Barth´elemy and Bonardi, 2001).

The extraction was done by firstly checking whether the excerpt already con-tained labels for the different voices, where the voices combine to create the chords. If the excerpt contained voices, the highest and the lowest voice were extracted, which represent the melody and the bass respectively. If the file did not contain any labels for voices, the script would parse through each chord and extract the lowest and highest note to obtain the needed lines. This extraction, however, relies on the assumption that the melody and bass line always sound in each chord and , for example, are not implied by the last highest or lowest tone that was heard in a previous chord. These reductions also ignored the presence of repetition, dynamic markings, and other additional information such as fermata and tempo markings. However, even after checking whether voices were available in the files, there remained several files were the voices were annotated rather inaccurately, which made the script return only a few notes instead of full melodies and bass lines. To still be able to extract the melody and bass line, the following files were manually adapted to ensure the logical extraction by changing the annotation of the voices: 5, 15, 19, 21, 22, 24, 26, 27, 28, 29, 32, 38, 39, 41, 43, 44, 46. Even after the manual adaptation, the extraction of file 22 still did not work correctly and was thus excluded from the data set. The last problem that arose during extraction was the preservation of anacruses, which could not be kept. This means that when an excerpt started with an anacrusis, the new files will have a shifted placement of

4_{https://musescore.com/} 5_{http://web.mit.edu/music21/}

(10)

chords within measures. This may eventually impact the recognition of cadences due to chords not falling on the downbeats in measures anymore. Afterwards, the melody and bass line per excerpt were put into new MusicXML files.

(a) Original transcript of “Jesu, der du meine Seele”, BWV 78 by J.S. Bach.

(b) Harmonic reduction without ornamental notes.

(c) Extracted melody and bass line.

Figure 1: Example of dataprocessing

Lastly, the files containing only the melody and bass line were exported to **kern with Humdrum6. The **kern representation is another format used for symbolic musical data (as mentioned in Section 2 on page 6), which has the additional option of annotating “phrases” in music with curly brackets. This notation for phrases in **kern can also be interpreted by IDyOM, the program that will be used for the machine learning. Therefore, the notation of phrases is used to annotate cadences by defining the ending of each phrase at the last note of cadences that can be heard when listening to the original music transcription (see Appendix C on page 38 for a list of all manually annotated cadences in the data set).

An example of how the data is processed can be seen in Figure 1. Figure 1a shows the original transcript of an excerpt as it appears in the Kostka-Payne corpus, while Figure 1b shows the harmonic reduction as it was delivered in the adapted corpus. In Figure 1c the extracted melody and bass line can be seen without the notation for repetition, the fermata and tempo marking. This last figure is an example for both before and after the exportation to **kern and the annotation of cadences, as the difference in file format and the addition of phrases in the **kern files are not shown in music notation.

(11)

3.2 IDyOM

The goal of this project is to recognise cadences within musical pieces, for which the program called IDyOM by Pearce (2005) will be used as mentioned in Section 3 on page 8. IDyOM uses multiple viewpoint models for the prediction of musical sequences. This makes it possible to use the observations of Van Kranenburg and Karsdorp (2014) that sequential information is essential for the recognition of cadences. IDyOM is extended on the multiple viewpoint systems by Conklin and Witten (1995). The goal of their research was to evaluate musical theories by the prediction and generation of music with machine learning. To understand how IDyOM works, the following paragraphs will set out the research by Conklin and Witten.

Conklin and Witten (1995) aimed to create a program that constructs a generative music theory through the analysis of existing musical pieces. They assumed that musical theories can be evaluated by the predictions that they make and that their generative ability to create new, acceptable works is based on their predictive power. The predictive power can be measured in terms of entropy, as high information in music can be found at musical events where the expectations of the listener do not align with the actual events in the music. These points of high information are places were music sounds unstable and are interesting for the identification of important structural events. To be able to recognise these events, Conklin and Witten (1995) aimed to create a program with the capability to form a predictive theory for the creation of expectations. In this case, the predictability of events in entropy is defined by the amount of bits that is needed to capture the musical information with the theory. Unexpected events will contain high information which can be seen as peaks within the entropy distribution of a musical piece.

The improvement of the program’s performance is achieved with context models, a form of machine learning (Conklin and Witten, 1995). Context models consist of a database of sequences containing events, a frequency count for sequences of different lengths, and an inference method which computes the probability of events in a certain context. However, context models encounter the same problem that n-grams have with sequences as mentioned in Section 3, namely the problem of low order models being too general, while higher order models being too specialised to be able to capture the concept. These problems are circumvented by introducing induction in the context models. This makes the models incremental as with each new event that is encountered, a specialisation of the model occurs as it is added to the system.

The context to compute the probability of events originates from short and long term models (Conklin and Witten, 1995). The Long Term Model (LTM) is driven by structure and statistics derived from a large corpus of similar data, which can be seen as the training data. The Short Term Model (STM), however, is ruled by the individual structure and statistics of the musical piece that is being predicted, thus by all precedent events in the sequence. The STM is dynamic as it is tailored to the current sequence and is discarded after the prediction of the sequence. The

(12)

weighted combination of the predictions of the STM and LTM results in the final prediction for each event in the current sequence. The weighting takes place by giving viewpoints with very low certainty about the outcome, a lower weight.

However, due to the internal structure of musical events, context models were previously unsatisfactory. A musical event is described by several attributes, such as pitch, duration, and start time, which context models cannot capture. To tackle this problem, Conklin and Witten (1995) suggest the use of multiple viewpoint systems. A viewpoint models a type, which is an abstract property of an event (e.g. pitch or the melodic interval with its predecessor). The types are modelled by using a function to map the events in a sequence to a particular type and thereafter a context model for this type is created. The usage of several viewpoints together creates a multiple viewpoints system. Thus a musical sequence can be seen as multiple sequences for each type in such a system. This also means that the prediction per event is based on many viewpoints and thus context models. These multiple viewpoint systems are adaptive as the model changes while the piece that is being examined progresses over time, but also looks back to what happened beforehand within a musical piece to extract a context.

As mentioned, IDyOM uses these multiple viewpoint models, but differs some-what from the models by Conklin and Witten (1995). The most important difference is the output that IDyOM generates (Pearce, 2005). IDyOM’s most important output is called Information Content (IC) which describes the predictive power of IDyOM. The IC is calculated with the −log2of the assigned probability to the actual event of the state. This differs from the entropy output that IDyOM also outputs, which de-picts how certain IDyOM is about its predictions in terms of bits needed to describe the events. The documentation of IDyOM also mentions the use of multiple voices and harmonic interpretations to be more significant to music. This makes IDyOM a good option for the use of cadence recognition as it is capable of performing some form of machine learning, while keeping sequential information of multiple voices intact.

3.3 IDyOM processing

For this project, IDyOM was used to try to recognise cadences in the Kostka-Payne corpus (see Section 3 on page 8). Firstly, the pre-processed data which only contained the melody and bass line was read into IDyOM. Time signatures that have specific symbols which are depicted as *met(C) in **kern were removed as these were not recognised by IDyOM. After reading the corpus in, the interpretation of the data set by IDyOM is shown in Figure 2 on the next page. This figure shows the different basic viewpoints that IDyOM has extracted when reading in the data. The values indicate which appearances can be found in the whole data set for each viewpoint. For example, the viewpoint cpitch has 49 different appearances in the test set, which range from 30 up to 86.

As can be seen, there are viewpoints in the data set which have only one option or do not have any values (NIL) at all in IDyOM. These were therefore ignored in

(13)

Figure 2: Depiction by IDyOM of the corpus with melodies and bass lines.

the feature selection for cadence recognition. It is also remarkable that the viewpoint voicehas one value. This shows that IDyOM is not capable of interpreting the melody and bass line at the same time even when they are defined as different voices within the **kern files. The IDyOM documentation indicates that it should be possible to interpret several voices synchronously as mentioned in Section 3.2. Several attempts were made to get IDyOM to interpret multiple voices, but after all these attempts failed, it was chosen to only use the bass line. The bass line in common practice music is the most important voice for the interpretation of harmonics (Barth´elemy and Bonardi, 2001). The previous depiction of the data set also shows that even when only one voice is perceived by IDyOM, the pitches show both pitches belonging to the bass line as well as the melody. Therefore, the data set was processed anew to extract only the bass lines this time, which afterwards created the depiction by IDyOM in Figure 3 on the following page.

Thereafter, this data set was used in IDyOM for the prediction of the viewpoint phrase. This is done because cadences were annotated by defining the cadences in the **kern files as the ending of phrases (see Section 3.1 on page 8). When IDyOM reads in these files, it stores the information of where the cadences are in the viewpoint phrase. In this viewpoint, cadences are indicated by a “-1”, while the beginning and the middle of a phrase are denoted by “1” and “0” respectively

(14)

Figure 3: Depiction by IDyOM of the corpus with only bass lines.

(see the appearances of phrase in Figure 3).

IDyOM is made for the prediction of viewpoints. It predicts by computing probabilities for each possible outcome a target viewpoint can have. Thus in the case of phrase, the probability for an event being either “-1”, “0”, or “1” will be computed. For the calculation of the probability, IDyOM uses source viewpoints. Source viewpoints are the viewpoints with which IDyOM creates models to predict the target viewpoint. In other words, the source viewpoints can be seen as the features that IDyOM uses for its prediction of the target viewpoint. By looking back on what has occurred beforehand in the current music sequence with the STM and by using the LTM which has been made by learning from the training data, the probability per event can be calculated. The appearance of the target viewpoint with the highest probability is IDyOM’s final prediction.

The code to run IDyOM for prediction requires at least the following parameters: the data set that is to be used, the target viewpoints for which the upcoming events are predicted and the source viewpoints which are used for predicting the target viewpoints. Several other settings can be modified and the full documentation of all parameters can be found on the IDyOM website7. The used settings that are

7

https://code.soundsoftware.ac.uk/projects/idyom-project/wiki/ Idyom

(15)

important for the machine learning background and the interpretation of the results are:

• both+ - using a short term model which is only trained on the current sequence as well as a long term model that is trained with the training set along with incremental training on the test set.

• k = 10 - the amount of resampling values for cross validation. This value for k causes the data set to be divided into ten sets, which means that a model will be trained ten times with nine of these sets. The remaining set is used as test set, which makes it possible to use small amounts of data effectively. • output-path - with this feature, IDyOM will create a data file which can

afterwards be used for the evaluation of its performance. The interpretation of these data files will be discussed in Section 3.4.

• detail = 3 - determines how the information content is averaged in the output. With this value, the IC is averaged over the entire data set, each composition and the raw IC for every single event in the data set is given. To check which viewpoints are the best for prediction, first, all viewpoints are evaluated on their performance separately. However, not every viewpoint can be used as a predictor for another viewpoint, and the option that IDyOM provides to still use viewpoints to predict others is by creating linked viewpoints (Conklin and Witten, 1995; Pearce, 2005). A linked viewpoint is a product of two viewpoints. These linked viewpoints will be depicted in this project with the symbol ⊗. To be able to use any viewpoint as predictor for the target viewpoint phrase, the viewpoint that is to be tested is linked to the target viewpoint by the addition of another pair of parentheses to the code for running IDyOM. Then, to run IDyOM, the following code was given with data set 5 and the viewpoint cpitch (chromatic pitch) as example to test on predictive power:

CL-USER> (idyom:idyom 5 ’(phrase)

’((phrase cpitch)) :output-path "/home/arianne/Documents/results/")

Afterwards, the best viewpoints will be used to make new combination by linking several viewpoints with phrase. These will result in different combinations and models. Lastly, the interpretation of these results will hopefully give some insight into which features describe cadences.

3.4 Evaluation

After running the program, the results for the different models created with different combinations of viewpoints need to be evaluated. The evaluation is done with the data files created by IDyOM. As mentioned in Section 3.3, IDyOM has the option of adding an output-path to which IDyOM will export the output of the model in a

(16)

data file. In these data files, every note is listed with its ID and its musical properties (in terms of the basic viewpoints) along with information about the model, the entropy, information content and finally the predictions made by IDyOM. (Full documentation of the IDyOM output can be found on the IDyOM website8.)

When IDyOM tries to predict the viewpoint phrase, the predictions are listed as a distribution of the probability of the current note being the beginning of a phrase (denoted in the viewpoints as “1”), in the middle of the phrase (denoted as “0”), or at the end (denoted with “-1” and in this case the cadences). Then, the entropy and IC is calculated with the assigned probability for the actual state of the event, which means that the entropy is higher when a lower probability was assigned to the actual state. However, in the case of cadence recognition, it is more important to take a look into the correct classification of cadences by the model. To evaluate the classification, the probabilities assigned by IDyOM for each state per event are compared, and the state with the highest probability is seen as the chosen classification by the model. With these classifications, the amount of true positives, false positives, false negatives, and true negatives can be retrieved. These values are subsequently used to compute the precision (equation 1) and recall (equation 2), which denote how many of classifications that are made are correct and how many of the cadences were actually found respectively.

precision = true positives

true positives + f alse positives (1) recall = true positives

true positives + f alse negatives (2) F − measure = 2 · precision · recall

precision + recall (3) Afterwards, the F-measure (equation 3) is calculated with the precision and recall per model. The F-measure is a harmonic mean between the recall and precision that the system has achieved, making it possible to easily compare the performances between models.

The first evaluation that needs to be done is to review the results of all the viewpoints separately. By comparing the F-measure for each model created with the viewpoints, the best five will be selected for creating new models. These models will be made by creating linked viewpoints between two viewpoints and the viewpoint phraseto assure that the source viewpoints can be used as predictors for the target viewpoint. Even more viewpoints can be combined if IDyOM can handle the complexity of these higher models. In turn, these new models with more linked viewpoints will be evaluated on performance and will result in an evaluation whether the addition of several viewpoints, and thus more dimensions, is helpful for the performance of the program.

8_{https://code.soundsoftware.ac.uk/projects/idyom-project/wiki/}

(17)

At the same time, it is interesting to check whether IDyOM’s normal output can give an indication of its cadence recognition performance. As mentioned in Section 3.3, IDyOM outputs a weighted Information Content (IC) for each model. Therefore, for each model with a single link, the weighted IC given by IDyOM will be tracked to be able to check for a possible correlation. This is done by comparing the F-measures with the ICs and checking whether a correlation can be found between these two measures. If a correlation is found, the IC may indicate the use of other viewpoints for the follow up of linking more viewpoints, as the viewpoints with the best IC may differ from those with the best F-measure.

The last part of the evaluation is the musical interpretation. How are the best source viewpoints related to what cadences are? Do the features capture the properties of cadences that musical theories describe? What could still be lacking if the results are not as high as hoped? But also which cadences are recognised can give some insight, as it might occur that, for example, the program is not capable of recognising a specific type of cadence, while it performs perfect on another kind of cadence. Thus, are all cadences equally recognisable? If it does work, how have the features captured the cadences and is it a good start for a model of how humans perceive harmonic cadences? If the program does not work, why was the program not able to recognise cadences? In short, the most interesting part of the evaluation is how this could relate back to the music and the human perception and cognition of music.

(18)

4 Results

4.1 Viewpoint selection

The first created models were those that test the performance of a single viewpoint. This was done by linking all possible viewpoints to the viewpoint phrase as source viewpoint. IDyOM contains 55 viewpoints that could be used with the information that it has about the corpus. However, while creating models with some viewpoints, IDyOM was not able to create them due to errors. These errors were caused by problems with the programming of IDyOM itself and were not related to the corpus. Therefore, these seven viewpoints were omitted from the list of viewpoints to test, as solving these problems is out of the scope of this project. However, by omitting these viewpoints, several features that can be found in symbolic music are ignored, which may be interesting for cadence recognition. The viewpoints that gave the errors and were omitted are listed below.

• referent • cpintfref • inscale • fiph • liph • thrqu • thrintfrefliph

Along with the viewpoints mentioned above, the viewpoint lphrase was also ignored. While the viewpoint did have a very high performance (the highest F-measure that was found with only using lphrase as source viewpoint was 0.97), the musical meaning of the viewpoint is very unclear. It should take the length of the last phrase in consideration, but even the IDyOM documentation states that it remains unknown what the feature actually does (Pearce, 2005). Also, with an F-measure this high, there is a good possibility that it is cheating in some way as the results that will be discussed in the following sections are significantly lower, making this an outlier. It also differs from the other viewpoints in the output it generates: it omits a column in the data-file with the probability for an event being in the middle of a phrase, but it remains unclear why, and the addition of this viewpoint to the set of viewpoints to test is therefore dodgy.

To check whether the linking of viewpoints actually improves, a model has been created that only uses phrase as source viewpoint, which can be regarded as a base line. This means that the phrase as source viewpoint discusses how the viewpoint behaves in the training data and what has preceded in the current excerpt, while the phrase as target viewpoint describes what IDyOM predicts for the current event. The values for this base model are depicted in Table 1 on the following page.

(19)

Table 1: Results for the base case with only phrase as source viewpoint with the true positive (tp), false positive (fp), false negative (fn), true negative (np), and the Information Content (IC).

Source viewpoints tp fp fn tn Precision Recall F-measure IC

phrase 3 3 149 794 0.5 0.02 0.04 0.90

4.2 Single linked viewpoints

The models created by linking phrase to a single other viewpoint as source viewpoints can be seen in Table 2 on the next page. From the remaining 47 viewpoints, the eight viewpoints listed below got zero true positives (which is even worse than the base case of only using phrase in Table 1). With zero true positives, the precision, recall and F-measure for these viewpoints are zero, making their results irrelevant.

• accidental • cpcint-5 • cpintfiph • fib • ioi-contour • ioi • thrfiph • thrliph

These were therefore omitted in Table 2 on the following page, which now only contains the 39 remaining viewpoints that worked without giving errors and could identify at least one cadence correctly.

4.3 Information Content (IC)

As mentioned in Section 3.4, it is interesting to test whether a correlation can be found between the normal IDyOM output, the weighted IC, and the classification performance of models. Figure 4 on page 21 shows the IC values set out against the F-measures of the different models for all viewpoints that are listed in Table 1. As can be seen, the values are very scattered and an almost straight line can be drawn through the whole graph while the F-measure rises. The calculated correlation between IC and F-measure is 0.05, which also demonstrates that there is no correlation between the two measures. Likewise, the correlation between IC and precision is -0.07 and between IC and recall it is also 0.05. Therefore, no correlation so far between IC and the classification performance has been found. As the IC thus does not hold any interesting information for the task in this project, it was omitted in the upcoming tables.

(20)

Table 2: Single viewpoints linked with phrase, sorted by F-measure.

Source viewpoints tp fp fn tn Precision Recall F-measure IC phrase ⊗ keysig 31 44 121 753 0.41 0.20 0.27 0.87 phrase ⊗ pulses 27 38 125 759 0.42 0.18 0.25 0.86 phrase ⊗ barlength 23 35 129 762 0.40 0.15 0.22 0.89 phrase ⊗ cpintfip 19 23 133 774 0.45 0.13 0.20 0.73 phrase ⊗ cpcint-2 17 17 135 780 0.50 0.11 0.18 0.75 phrase ⊗ mpitch 16 13 136 784 0.55 0.11 0.18 0.82 phrase ⊗ cpcint 15 13 137 784 0.54 0.10 0.17 0.71 phrase ⊗ cpint 13 17 139 780 0.43 0.09 0.14 0.72 phrase ⊗ thrtactus 12 13 140 784 0.48 0.08 0.14 0.81 phrase ⊗ cpitch-class 11 8 141 789 0.58 0.07 0.13 0.86 phrase ⊗ posinbar 11 12 141 785 0.48 0.07 0.13 0.84 phrase ⊗ newcontour 11 20 141 777 0.35 0.07 0.12 0.85 phrase ⊗ cpitch 11 25 141 772 0.31 0.07 0.12 0.86 phrase ⊗ metaccent 9 10 143 787 0.47 0.06 0.11 0.84 phrase ⊗ thrbar 9 11 143 786 0.45 0.06 0.10 1.31 phrase ⊗ onset 10 44 142 753 0.19 0.07 0.10 0.82 phrase ⊗ mpitch-class 7 4 145 793 0.64 0.05 0.09 0.84 phrase ⊗ dur-ratio 7 5 145 792 0.58 0.05 0.09 0.71 phrase ⊗ cpint-size 7 7 145 790 0.50 0.05 0.08 0.72 phrase ⊗ registral-direction 7 23 145 774 0.23 0.05 0.08 0.86 phrase ⊗ cpcint-4 6 5 146 792 0.55 0.04 0.07 0.73 phrase ⊗ cpcint-3 5 5 147 792 0.50 0.03 0.06 0.75 phrase ⊗ cpcint-size 5 5 147 792 0.50 0.03 0.06 0.71 phrase ⊗ cpintfib 4 3 148 794 0.57 0.03 0.05 1.13 phrase ⊗ dur 4 4 148 793 0.50 0.03 0.05 0.87 phrase ⊗ closure 4 8 148 789 0.33 0.03 0.05 0.84 phrase ⊗ intervallic-difference 4 9 148 788 0.31 0.03 0.05 0.85 phrase ⊗ proximity 3 0 149 797 1.00 0.02 0.04 0.73 phrase ⊗ bioi 3 7 149 790 0.30 0.02 0.04 0.86 phrase ⊗ crotchet 2 0 150 797 1.00 0.01 0.03 0.86 phrase ⊗ deltast 2 1 150 796 0.67 0.01 0.03 0.85 phrase ⊗ tactus 2 1 150 796 0.67 0.01 0.03 0.85 phrase ⊗ tessitura 2 1 150 796 0.67 0.01 0.03 0.86 phrase ⊗ octave 2 4 150 793 0.33 0.01 0.03 0.83 phrase ⊗ registral-return 2 4 150 793 0.33 0.01 0.03 0.80 phrase ⊗ cpcint-6 1 0 151 797 1.00 0.01 0.01 0.73 phrase ⊗ contour 1 5 151 792 0.17 0.01 0.01 0.73 phrase ⊗ ioi-ratio 1 6 151 791 0.14 0.01 0.01 0.78 phrase ⊗ bioi-contour 1 8 151 789 0.11 0.01 0.01 0.81 phrase ⊗ bioi-ratio 1 8 151 789 0.11 0.01 0.01 0.78

(21)

Figure 4: IC set out against the F-measure for single linked viewpoints.

4.4 Multiple viewpoints

The follow-up of linking a single viewpoint to phrase as source viewpoint is linking more viewpoints. To make this feasible for this project, models were only made for combinations of the five viewpoints that performed the best on their own. The viewpoints with the highest F-measures are listed below, followed by a description of what the viewpoints capture (Pearce, 2005) along with the F-measure that they reached between parentheses.

• keysig (0.27) defines the key signature of the excerpt with an integer. C major is defined as 0, a positive count is the amount of sharps and a negative count is the number of flats. This viewpoint is limited to having a minimum of -7 and a maximum of 7 to avoid double accidentals.

• pulses (0.25) is defined by the number of pulses (beats) in a bar. Thus it is essentially the upper numeral in a time signature.

• barlength (0.22) defines the time unit of a beat in a bar. This is basically the lower numeral of the time signature.

• cpintfip (0.20) shows the chromatic interval between the current note and the first note in the piece. Note that the interval of the first note with itself is omitted.

• cpcint-2 (0.18) is the absolute value of taking the modulo 2 of the view-point cpcint. cpcint is the modulo 12 of the chromatic pitch interval, which means that descent and ascent in intervals is preserved, while intervals greater than or equal to a perfect octave are minimised. This viewpoint, however, seems utterly useless in a musical sense, it simply checks whether intervals can be completely defined by two semitones or not.

(22)

• mpitch (0.18) stands for Meredith’s morphetic pitch (Meredith, 2006) and defines the note by an integer from the middle C = 35, while ignoring all accidentals.

The definition of all viewpoints available in IDyOM can be found on the website9. As can be seen, the viewpoint cpcint-2 belongs to the best performing viewpoints. Howecer, this viewpoint was omitted as the musical meaning of this viewpoint is non-existent. The viewpoint simply denotes whether an interval can be divided into only whole tones or whether a half tone is needed. This has no harmonic meaning, and therefore the best performing viewpoint after cpcint-2 was added to the list to use.

The next models were created with the five selected viewpoints by combining them to form all possible different combinations of two for linking with phrase. This means that these models have three linked source viewpoints. The results of this model can be seen in Table 3.

Table 3: Results for phrase linked to two other viewpoints, ranked by F-measure. Source viewpoints tp fp fn tn Precision Recall F-measure phrase ⊗ keysig ⊗ cpintfip 31 34 121 763 0.48 0.20 0.29 phrase ⊗ cpintfip ⊗ mpitch 30 29 122 768 0.51 0.20 0.28 phrase ⊗ keysig ⊗ mpitch 29 25 123 772 0.54 0.19 0.28 phrase ⊗ keysig ⊗ pulses 35 62 117 735 0.36 0.23 0.28 phrase ⊗ keysig ⊗ barlength 33 56 119 741 0.37 0.22 0.27 phrase ⊗ pulses ⊗ barlength 28 43 124 754 0.39 0.18 0.25 phrase ⊗ barlength ⊗ cpintfip 25 30 127 767 0.46 0.16 0.24 phrase ⊗ barlength ⊗ mpitch 23 24 129 773 0.49 0.15 0.23 phrase ⊗ pulses ⊗ cpintfip 23 33 129 764 0.41 0.15 0.22 phrase ⊗ pulses ⊗ mpitch 18 24 134 773 0.43 0.12 0.19 Table 3 shows improvement in the F-measure of these more complicated models. It therefore seems plausible that by linking more viewpoints to the source viewpoints would improve the performance of the program even more. However, due to the fast increase of the complexity by linking viewpoints, IDyOM was not able to run these more complex models without having reached the maximum memory capacity of the computer. But IDyOM was able to run a model with three linked viewpoints to phrasewith the viewpoints that have less appearances and no changes per excerpt, namely: keysig, pulses, and barlength. These define the key signature and time signature, which is normally consistent during an excerpt. The result of this model is shown in Table 4.

The results in Table 4 do show an improvement in the F-measure. However, after taking a look at Table 3 it seems more interesting to combine the harmonic

9_{https://code.soundsoftware.ac.uk/projects/idyom-project/wiki/}

(23)

Table 4: Results for phrase linked to three other viewpoints as source viewpoints. Source viewpoints tp fp fn tn Precision Recall F-measure phrase ⊗ keysig ⊗ pulses ⊗ barlength 36 54 116 743 0.40 0.24 0.30

viewpoints that worked well together in pairs, namely: keysig, cpintfip, and mpitch.

IDyOM also has another option of using several features. This is done by creating separate models with features linked to phrase. Then these models are combined and this combination can be used for the prediction of the target viewpoint. This makes the models easier to compute, but does make the viewpoint phrase more dominant, which can impact the performances. These models consist of two simpler models, both containing phrase, which makes it a more prevalent viewpoint in these models. One of the options that is tested in this project is the combination of a model with one linked viewpoint with a model that has two linked viewpoints. This option is tested only for the four best performing viewpoints and the combinations of the models and their results can be seen in Table 5.

(24)

Table 5: Results for combining two models.

Source viewpoints tp fp fn tn Precision Recall F-measure (phrase ⊗ keysig)(phrase ⊗ barlength ⊗ keysig) 29 34 123 763 0.46 0.19 0.27

(phrase ⊗ keysig)(phrase ⊗ keysig ⊗ pulses) 28 34 124 763 0.45 0.18 0.26 (phrase ⊗ pulses)(phrase ⊗ pulses ⊗ barlength) 27 37 125 760 0.42 0.18 0.25 (phrase ⊗ barlength)(phrase ⊗ pulses ⊗ barlength) 26 32 126 765 0.45 0.17 0.25 (phrase ⊗ keysig)(phrase ⊗ barlength ⊗ pulses) 22 15 130 782 0.59 0.14 0.23 (phrase ⊗ pulses)(phrase ⊗ keysig ⊗ barlength) 21 12 131 785 0.64 0.14 0.23 (phrase ⊗ cpintfip)(phrase ⊗ keysig ⊗ cpintfip) 21 14 131 783 0.60 0.14 0.22 (phrase ⊗ pulses)(phrase ⊗ pulses ⊗ keysig) 21 14 131 783 0.60 0.14 0.22 (phrase ⊗ barlength)(phrase ⊗ keysig ⊗ pulses) 20 10 132 787 0.67 0.13 0.22 (phrase ⊗ barlength)(phrase ⊗ keysig ⊗ barlength) 20 11 132 786 0.65 0.13 0.22 (phrase ⊗ pulses)(phrase ⊗ keysig ⊗ cpintfip) 17 8 135 789 0.68 0.11 0.19 (phrase ⊗ keysig)(phrase ⊗ keysig ⊗ cpintfip) 17 10 135 787 0.63 0.11 0.19 (phrase ⊗ barlength)(phrase ⊗ keysig ⊗ cpintfip) 16 5 136 792 0.76 0.11 0.18 (phrase ⊗ keysig)(phrase ⊗ barlength ⊗ cpintfip) 16 6 136 791 0.73 0.11 0.18 (phrase ⊗ keysig)(phrase ⊗ pulses ⊗ cpintfip) 16 6 136 791 0.73 0.11 0.18 (phrase ⊗ cpintfip)(phrase ⊗ keysig ⊗ pulses) 16 13 136 784 0.55 0.11 0.18 (phrase ⊗ cpintfip)(phrase ⊗ barlength ⊗ cpintfip) 16 19 136 778 0.46 0.11 0.17 (phrase ⊗ cpintfip)(phrase ⊗ pulses ⊗ cpintfip) 15 15 137 782 0.50 0.10 0.16 (phrase ⊗ cpintfip)(phrase ⊗ pulses ⊗ barlength) 14 6 138 791 0.70 0.10 0.16 (phrase ⊗ cpintfip)(phrase ⊗ keysig ⊗ barlength) 14 10 138 787 0.58 0.09 0.16 (phrase ⊗ barlength)(phrase ⊗ barlength ⊗ cpintfip) 13 3 139 794 0.81 0.09 0.15 (phrase ⊗ pulses)(phrase ⊗ barlength ⊗ cpintfip) 13 3 139 794 0.81 0.09 0.15 (phrase ⊗ pulses)(phrase ⊗ pulses ⊗ cpintfip) 13 4 139 793 0.76 0.09 0.15 (phrase ⊗ barlength)(phrase ⊗ pulses ⊗ cpintfip) 10 4 142 793 0.71 0.07 0.12

(25)

Table 6: Cadences found by all three best performing models with two linked viewpoints.

File no. Composer Note Cadence type ex04 Beethoven 29 authentic cadence ex11 Beethoven 10 half cadence ex12 Beethoven 28 half cadence ex15 Chopin 13 authentic cadence ex15 Chopin 19 authentic cadence ex16 Chopin 16 authentic cadence ex19 Haydn 7 authentic cadence ex25 Mozart 9 authentic cadence ex25 Mozart 15 authentic cadence ex25 Mozart 27 authentic cadence ex26 Mozart 7 half cadence ex26 Mozart 9 authentic cadence ex26 Mozart 14 authentic cadence ex29 Mozart 7 half cadence ex29 Mozart 9 authentic cadence ex39 Schumann 16 authentic cadence ex41 Schumann 9 authentic cadence ex41 Schumann 15 authentic cadence ex41 Schumann 17 authentic cadence ex41 Schumann 29 authentic cadence

4.5 Found cadences

Even though the F-measures of the models were not that high (see Section 4.2 on page 19 and Section 4.4 on page 21), it is still interesting to check which cadences were found. To get an impression of the best recognisable cadences, the three best performing models of Table 3 on page 22 are regarded, namely those with the following combinations of source viewpoints: phrase ⊗ keysig ⊗ cpintfip, phrase⊗ cpintfip ⊗ mpitch, and phrase ⊗ keysig ⊗ mpitch. The cadences that were found by these three models were compared, and the cadences that were detected by all three are listed in Table 6. This table lists the file numbers, the composers of the excerpts, and the indication of which type of cadence was detected at the note in the file (this information originates from Appendices B on page 37 and C on page 38). Note that these three labels use the same three viewpoints, which can influence which cadences are found. Also, as can be seen in Appendix C, the files ex15, ex25 and ex39 had pre-processing problems as the downbeat and measure have shifted during the exportation of the bass line.

(26)

5 Discussion and Conclusion

5.1 Evaluation of results

The baseline of using only phrase as a source viewpoint could only find three cadences and had an F-measure of 0.04. On the other end, the best model that could be found by linking two viewpoints to phrase had an F-measure of 0.29 and could find 31 cadences (Table 3 on page 22). This is a significant improvement of the performance. Table 4 on page 23 shows that the addition of more linked viewpoints can improve the performance even more. However, this also means that even the best model did not detect cadences in all 45 excerpts in the corpus.

One other option of combining viewpoints that has been looked at in this project, can be seen in Table 5 on page 24. These models are created by combining the models which contain one viewpoint linked to phrase (see Table 2 on page 20) with the models were two viewpoints were linked to phrase (see Table 3 on page 22). This option of combining viewpoints did not have any beneficial effects on the F-measure, which means that it does not improve the overall performance. However, most of these models do have an improvement in precision. This is most likely because this combination of viewpoints puts more emphasis on phrase as it is linked twice to other viewpoints and thus has more importance in the models. The viewpoint phrase mostly consists of events that are denoted as being in the middle of a phrase (0). This means that by adding another phrase to the source viewpoints, the models that are created are more conservative and thus become more reticent in classifying an event as the ending (or beginning) of a phrase. This leads to less false negatives, which causes the precision to meliorate. Even higher precision is reached when the viewpoint cpintfip is combined with either barlength or pulses in Table 5, however, as the recall is really low, it could just mean that the model is so strict that faults are also less likely to be made.

Lastly, Section 3.4 on page 15 mentioned to check whether the Information Content (IC) given by IDyOM could give an indication of the performance of the models. To test this, the F-measure of the models for one viewpoints (see Table 2 on page 20) were given, along with the weighted IC per model. However, after plotting the IC against the F-measures of the models (see Figure 4 on page 21), no correlation could be found. Even the calculated correlation between IC and the F-measure, precision, or recall did not indicate that there is a relation between the two measures. Therefore, there is no link between IC and the performance of IDyOM for cadence recognition. It seems plausible to think that a smaller IC would also give an indication of a better performance. However, even when a lower IC means that IDyOM has less difficulty predicting, the performance of classifying cadences does not improve. This is most likely because IDyOM is not made for the identification of cadences, meaning that the IC implies the correct prediction of all configurations of phrase. Thus, IDyOM is not optimising for cadence recognition, which may indicate the stagnant rise between the F-measure in Tables 3 and 4.

(27)

5.2 Musical meaning

5.2.1 Evaluation of best viewpoints

In Section 4.2 all the viewpoints that worked with the data set were tested separately. In Section 4.4 the five best performing viewpoints and their descriptions are given. But it is also interesting to check what these viewpoints musically mean and why they could contribute best to the recognition of cadences. To understand this, it should be known that cadences are dependent on their harmonic meaning as well as their placement within the measures in the music.

After taking a look at the descriptions of the viewpoints, the harmonic context can be described best by the viewpoints keysig, cpintfip, and mpitch. While initially the viewpoint cpcint-2 did belong to the best performing viewpoints with a harmonic meaning, it was omitted as it has no musical meaning. This viewpoint namely captures whether the interval can be divided into only full tones or not (as mentioned in Section 4.4). Musically speaking, it makes no sense to use this, and even the online IDyOM documentation mentions that the viewpoint was only implemented because it was possible to do so. The other three, however, can be interpreted in their meaning for harmonic cadences. The importance of harmonic features is even more clear in Table 3 on page 22. This tables shows that the three models with the best performances were a combination of two harmonic viewpoints. That means that especially when models become more complex, the importance of harmonics for cadence recognition is emphasised.

The first of these viewpoints, keysig, describes the key of the excerpt in terms of accidentals. This means that this viewpoint can give an indication of which notes have an important role (see Appendix A). This information is far more significant in combination with the pitch of events. Table 3 shows that combinations of keysig with mpitch or cpintfip have a very good F-measure in comparison to other models. This means that the addition of a key indicator improves the model along with a viewpoint indicating pitch. As mpitch stands for the pitch without any accidentals, it can be interpreted as plain notes, which makes it easier to compare the different excerpts in the data set. The viewpoint keysig, on the other hand, can give the additional information about the key and the accidentals. On the same note, cpintfip also is related to the key, as it describes the relation between the current event and the first event. In the common practice period, it was very common to introduce the key with the first note by starting with the tonic or a note related to the tonic. Thus the relation between the first and the last note can give a high indication of a cadence, as cadences mostly end in a chord related to the key as well (see Appendix A).

The placement of cadences within measures is captured by the remaining two viewpoints: barlength and pulses. When these viewpoints are combined, they can be used to describe the time signature of the musical piece. They can therefore indicate the placement of cadences within measure and capture some rhythmic context. This is important for cadences as they mostly fall on the downbeat in a

(28)

measure. However, as the extraction of bass lines resulted in the loss of anacruses, some excerpts do not display this feature correctly anymore, which can explain the lower results for the rhythmic features. However, these lower results can also be explained by cadences in Classical music being harmonic cadences. This means that they are defined by harmonic features and less by rhythmic features.

Thus, Table 2 shows that the several rhythmic and harmonic features result in the best performance. This was to be expected as with these viewpoints, the several aspects that describe cadences are more likely to be captured. This indicates that cadences are multidimensional and using more viewpoints together, does indeed improve the performance (see Tables 3 and 4).

5.2.2 Evaluation of recognised cadences

The second musical interpretation can be done by taking a look at which cadences were recognised by the models. Table 6 on page 25 shows the cadences that were found by all the three best models with two linked viewpoints. These models all only had harmonic viewpoints, which explains why IDyOM was capable of recognising cadences in the excerpts that have shifted measures due to problems with the extraction of the bass lines. This can also explain why the rhythmic features performed worse.

When taking a look at the composers that had excerpts in the corpus wherein cadences were recognised, it is noticeable that all these composers have at more than one excerpt (see Appendix B on page 37). The composers with only one excerpt in the corpus did not have any recognised cadences, which are: Brahms, Campbell, and Grieg. This could mean that the usage of cadences per composer vary too much to be identified with the models used in this project. The variance in the use of cadences can also explain why some composers with several excerpts did not have any recognised cadences at all. The first composer, Bach, composed in a much earlier time span than the other composers in the corpus and still has elements of the Baroque period. Tchaikovsky, on the other hand, is the only composer in the list from Russia, which could have lead to a different way of composing. However, the explanation for Schubert not having any recognised cadences can be found in the excerpts themselves. Schubert had the excerpts where the cadences were the hardest to recognise for annotation and these cadences could not easily be recognised as the more common cadences. Furthermore, Schubert is the only composer in the corpus with pedal points in the excerpts. Pedal points are sustained notes for one voice, while the other voices keep moving freely. The sustained voice is most commonly the bass, which also makes it harder to recognise cadences when only bass lines are used. Lastly, Schubert was also the only composer who had two excerpts that are from the same musical composition, which causes less variety in the excerpts that represent him.

It is also remarkable that Mozart had the most cadences that could be recognised by the different models, which may be due to the internal structure of the musical pieces. Table 6 on page 25 shows that once a cadence has been recognised in a

(29)

sequence, cadences that appear later in the sequence are much more likely to be also recognised. This indicates that the models that IDyOM has created are more dependent on the STM, as the STM uses information of the preceding events. This also suggests that IDyOM finds the recognition of phrasal structures to be more dependent on the internal factor of a sequence and is less comparable with cadences in the training set, which are used for the LTM.

Lastly, it is interesting to check what type of cadences were recognised the most. Table 6 shows that authentic cadences are recognised the most. This can be easily explained as authentic cadences are the most prevalent in music of the Classical period (Temperley, 2004). This prevalence can also be seen in annotations of the Kostka-Payne corpus as depicted in Appendix C on page 38. The corpus also shows that the second most used cadence, the half cadence, is also recognised some times by the best models, while the cadences with hardly any appearance in the corpus do not have enough training data to be represented enough for recognition.

Thus, while IDyOM performed with a low F-measure for cadence recognition, still some information about the music can be found in the cadences that it could detect. While a lot of these may also indicate that the corpus is too small, it still shows musically reasonable results, indicating that the cadence recognition done by IDyOM is meaningful.

5.3 Limitations and future work

Even with the musically interesting interpretations, a lot of obstacles were met during this research that have probably deteriorated the results. The first obstacle was that multiple adaptations had to be made to make the data usable for recognition in IDyOM (see Section 3.1). And even after these adaptations, anacruses could not be properly extracted. This might have impacted the recognition of cadences, as cadences mostly fall on downbeats. When an anacrusis was not correctly identified, the beats might have shifted during extraction, which could also lead to rhythmic features being less useful for cadences recognition.

The first problem that arose during the experiments with IDyOM was reading multiple voices in. While Pearce (2005) mentions in his dissertation that only monophonic music has been used, the documentation and the code of IDyOM implies that the use of multiple voices has been added and should be possible. This was, however, not the case, even after making several adaptations in the **kern files to facilitate the reading of multiple voices. This impediment meant that only the bass lines were used in this project for the recognition of cadences. While the bass lines are the most defining feature for the harmonic context, the recognition could easily be improved by adding the melody along with the bass line. Ultimately, interpretation of the full chords gives leeway to use the complete harmonic context. With the interpretation of several voices by IDyOM, a viewpoint could, for example, be created to express the functions of the chords which demonstrate the harmonic meaning of each chord. This would align better with the function analyses humans perform for the interpretation of musical notation. In the end, a program for cadence

(30)

recognition should be able to detect cadences within the full transcript, which also means that ornamental notes should be automatically ignored, instead of giving the program a reduced version of the transcript. Thus, future work should focus on cadence recognition with more voices for more musical context.

Another complication is the feature selection; in this case which viewpoints should be linked as source viewpoints. Conklin and Witten (1995) already men-tioned that the amount of models that can be created by linking increases vastly with more viewpoints. The linking of viewpoints makes the number of possible systems rise immensely and in the worst case scenario, an exhaustive search is needed to find the optimal combination of viewpoints. This feature selection problem also highly limited the options for this project. Even with the remaining viewpoints that worked for the used corpus, there still were 39 viewpoints that can be mixed and matched for the creation of models. The option that was used to bridle the possible models was by only taking a look at the viewpoints that performed the best on their own and combining these for the more complex models. Eventually, only four viewpoints were use for the combinations of models with one and two linked viewpoints. Even with these limitations, a numerous amount of models has been created, even when only a fraction of the possibilities has been tested. Table 3 also shows that the model with the highest F-measure does not consist of the viewpoints that performed the best on their own. This means that even better combination of viewpoints could still be out there, which could have been overlooked by only looking at these limited viewpoint combinations. A first step in improving the feature selection could be done by using IDyOM’s implemented feature selection. By adding :select as source viewpoint to the code instead of manually chosen viewpoints, IDyOM starts the viewpoint selection with a hill-climbing procedure (Pearce, 2005). This makes it possible to not be limited to only a few of the available viewpoints.

Another problem with IDyOM, as noted in Section 4.4 on page 21, is that the addition of more viewpoints immensely increases the complexity of the models. The models with more linked viewpoints were too complex for IDyOM to handle, which made it impossible in this project to test models that could grasp several musical properties, instead of being limited to having a maximum of two extra viewpoints next to phrase. To tackle this, future research could focus on working with another framework than IDyOM for cadence recognition.

Lastly, there were also a lot of limitation to the data set that was used. The Kostka-Payne corpus is a very small corpus, especially in comparison with corpora used for language processing. This makes the usage of machine learning a very difficult task, as there is hardly enough data to actually learn from. However, finding musical data for research remains a problem in the field of MIR and in this case, the Kostka-Payne corpus was the most representative data set for music of Classical period. A bigger data set would probably not be feasible for this project as the cadences had to be manually annotated. This also leads to the second limitation of the data set in this project. As the data set was manually annotated, the annotation of cadences was subject to my personal interpretation of the music and my available knowledge. This means that these annotations are not

(31)

general and different interpretations could lead to different annotations. Especially modulations within the music make it very unclear whether a cadence is heard. These modulations give the feeling of an authentic cadence being heard, while essentially no cadence actually took place. To illustrate this feeling of a cadence, these types were annotated as cadences, but they may be harder to detect. Thus, feeling that a cadence was experienced is very subjective, which may have impacted the results. Another decision that may have impacted the results, is the choice of annotating cadences by calling their last note the ending of a phrase. Cadences are a progressions of chords and only calling the last chord the cadence may not be the optimal solution for the annotation of cadences.

5.4 Conclusion

This project has taken a look into whether it is possible to recognise cadences within symbolic music of the common practice period. To reach this goal, the program Information Dynamics of Music (IDyOM) has been used to recognise cadences in the bass lines of the Kostka-Payne corpus. IDyOM was used by creating several models with different features available in IDyOM called viewpoints. By testing these viewpoints individually, the importance of these features for cadence recognition was tested. This resulted in finding out that cadences are described by harmonic features as well as rhythmic features. By further combining the best features, the performance of IDyOM could be ameliorated, indicating that cadences are multidimensional as expected. The importance of the sequential information was not tested in this project, but this feature could also be of great importance. Thus, the best features for automatic cadence describe the multidimensional appearance of cadences; they describe the harmonic features which remain the most important, while rhythmic features, as well as sequential information, contribute to cadence recognition.

(32)

References

Barth´elemy, J. and Bonardi, A. (2001). Figured bass and tonality recognition. In Proceedings of the 2nd International Symposium on Music Information Retrieval, Bloomington (IN), United States.

Burgoyne, John Ashley, F. I. and Downie, J. S. (2016). Music information retrieval. A New Companion to Digital Humanities, pages 213–228.

Conklin, D. and Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1):51–73.

Cuthbert, M. S. and Ariza, C. (2010). music21: A toolkit for computer-aided musicology and symbolic music data. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 637–642, Utrecht, Netherlands.

Eppe, M., Confalonieri, R., Maclean, E., Kaliakatsos, M., Cambouropoulos, E., Schorlemmer, M., Codescu, M., and K¨uhnberger, K. (2015). Computational invention of cadences and chord progressions by conceptual chord-blending. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pages 2445–2451, Buenos Aires, Argentina. AAAI Press.

Kostka, S. and Payne, D. (1995). Workbook for tonal harmony. McGraw-Hill, 3rd edition. Data set produced by David Temperley (chord-list file) and Bryan Pardo (midi files with chord’s quality).

Meredith, D. (2006). The ps13 pitch spelling algorithm. Journal of New Music Research, 35(2):121–159.

Pauwels, J., Kaiser, F., and Peeters, G. (2013). Combining harmony-based and novelty-based approaches for structural segmentation. In Proceedings of the 14th International Society for Music Information Retrieval Conference, pages 601–606, Curitiba, Brazil.

Pearce, M. T. (2005). The construction and evaluation of statistical models of melodic structure in music perception and composition. PhD thesis, Departent of Computing, City University, Londen, UK.

Sleator, D. and Temperley, D. (2001). The melisma music analyzer. Web: http://www. link. cs. cmu. edu/music-analysis.

Temperley, D. (2004). The cognition of basic musical structures. MIT press. Van Kranenburg, P. and Karsdorp, F. (2014). Cadence detection in western

tradi-tional stanzaic songs using melodic and textual features. In Proceedings of the 15th International Society for Music Information Retrieval Conference, pages 391–396, Taipei, Taiwan.

(33)

Weihs, C., Jannach, D., Vatolkin, I., and Rudolph, G. (2016). Music Data Analysis: Foundations and Applications. CRC Press.

(34)

Musical Terms

accidental An accidental is a note of a pitch that does not belong to the tones within the key. These can be spotted by the addition of a flat, a sharp, or natural. 21, 22

anacruses See anacrusis. 9, 28, 29, 38

anacrusis A note or few notes preceding the first complete bar in a musical piece. 9, 29, 34

fermata A musical symbol above a note that indicates that note should sound longer than the normal duration for the note value. How much the note should be prolonged is entirely up to the musician. 9

interval The musical distance between two tones. 12, 21, 22 modulation The change of key within a musical piece. 31

Phrygian cadence A cadence form which is most recognisable by its descending bass line, leading to a musical “comma”. 6

staff Five horizontal lines with spaces in between where each line and space represents a different pitch. A cleff is used to indicate which lines correspond to which pitch. 9

tonic The first scale in a function analysis. It is the basic chord of the key and feels the most like “home”. 27, 35

Acronyms

IC Information Content. 12, 15–17, 19, 26

IDyOM Information Dynamics of Music. 2, 8, 11–19, 22, 23, 26–31 LTM Long Term Model. 11, 12, 14, 29

MIR Music Information Retrieval. 2, 4, 6, 7, 30 STM Short Term Model. 11, 12, 14, 29

Features for harmonic cadence recognition in music of the Classical period