Chord Recognition with Stacked Denoising Autoencoders

(1)

Chord Recognition with Stacked

Denoising Autoencoders

Author:

Nikolaas Steenbergen

Supervisors:

Prof. Dr. Theo Gevers

Dr. John Ashley Burgoyne

A thesis submitted in fulfilment of the requirements

for the degree of Master of Science in Artificial Intelligence

in the

Faculty of Science

(2)

Abstract

In this thesis I propose two different approaches for chord recognition based on stacked denoising autoencoders working directly on the FFT. These approaches do not use any intermediate targets such as pitch class profiles/chroma vectors or the Tonnetz, in an attempt to remove any re-strictions that might be imposed by such an interpretation. It is shown that these systems can significantly outperform a reference system based on state-of-the-art features. The first approach computes chord proba-bilities directly from an FFT excerpt of the audio data. In the second approach, two additional inputs, filtered with a median filter over dif-ferent time spans, are added to the input. Hereafter, in both systems, a hidden Markov model is used to perform a temporal smoothing after pre-classifying chords. It is shown that using several different tempo-ral resolutions can increase the classification ability in terms of weighted chord symbol recall. All algorithms are tested in depth on the Beatles Isophonics and the Billboard datasets on a restricted chord vocabulary containing major and minor chords and an extended chord vocabulary containing major, minor, 7th and inverted chord symbols. In addition to presenting the weighted chord average recall, a post-hoc Friedman mul-tiple comparison test for statistical significance on performance is also conducted.

(3)

Acknowledgements

I would like to thank Theo Gevers and John Ashley Burgoyne for su-pervising my thesis. Thanks to Ashley Burgoyne, for his helpful thorough advice and guidance. Thanks Amogh Gudi for all the fruit full discussions about deep learning techniques while lifting weights and sweating in the gym. Special thanks to my parents, Brigitte and Christiaan Steenbergen and my brothers Alexander and Florian, without their help, support and love, I would not be where I am now.

(4)

List of Figures

1 Piano keyboard and MIDI note range . . . 9

2 Conventional autoencoder training . . . 21

3 Denoising autoencoder training . . . 23

4 Stacked denoising autoencoder training . . . 24

5 SDAE for chord recognition . . . 33

6 MR-SDAE for chord recognition . . . 34

7 Post-hoc multiple-comparison Friedman tests for Beatles restricted chord vocabulary . . . 40

8 Whisker plot for the Beatles restricted chord vocabulary . . . 41

9 Post-hoc multiple-comparison Friedman tests for Beatles extended chord vocabulary . . . 42

10 Whisker plot for Beatles extended chord vocabulary . . . 43

11 Post-hoc multiple-comparison Friedman tests for Billboard re-stricted chord vocabulary . . . 45

12 Post-hoc multiple-comparison Friedman tests for Billboard ex-tended chord vocabulary . . . 46

13 Visualization of weights of the input layer of the SDAE . . . 48

14 Plot of sum of absolute values for the input layer of the SDAE . 48 15 Absolute training error for joint optimization . . . 64

(7)

List of Tables

1 Semitone steps and intervals. . . 10

2 Intervals and chords . . . 11

3 WCSR for the Beatles restricted chord vocabulary . . . 41

4 WCSR for the Beatles extended chord vocabulary . . . 43

5 WCSR for the Billboard restricted chord vocabulary . . . 45

6 WCSR for the Billboard extended chord vocabulary . . . 47

(8)

1 Introduction

The increasing amount of digitized music available online has given rise to de-mand for automatic analysis methods. A new subfield of information retrieval has emerged that concerns itself only with music: music information retrieval (MIR). Music information retrieval concerns itself with different subcategories, from analyzing features of a music piece (e.g., beat detection, symbolic melody extraction, and audio tempo estimation) to exploring human input methods (like “query by tapping” or “query by singing/humming”) to music clustering and recommendation (like mood detection or cover song identification).

Automatic chord estimation is one of the open challenges in MIR. Chord estimation (or recognition) describes the process of extracting musical chord labels from digitally encoded music pieces. Given an audio file, the specific chord symbol and temporal position and duration have to be automatically determined.

The main evaluation programme for MIR is the annual “Music Information Retrieval Exchange” (MIREX) challenge1_{. It consists of challenges in different}

sub-tasks of MIR, including chord recognition. Often improving one task can influence the performance in other tasks, e.g., finding a better beat estimate can improve the performance of finding the temporal positions of chord changes, or improve the task of querying by tapping. The same is the case for chord recognition. It can improve performance of cover song identification, in which starting from an input song, cover songs are retrieved: chord information is a useful if not vital feature for discrimination. Chord progressions also have an influence of the “mood” transmitted through music. Thus being able to retrieve the chords used in a music piece accurately could also be helpful for mood categorization, e.g., for personalized Internet radios.

Chord recognition is also valuable to do by itself. It can aid musicologists as well as hobby and professional musicians in transcribing songs. There is a great demand for chord transcriptions of well-known and also lesser-known songs. This manifests itself in many Internet pages that hold manual transcriptions of songs, especially for guitar.2 _{Unfortunately, these mostly contain transcriptions}

only of the most popular songs and often several different versions of the same song exist. Furthermore, they not guaranteed to be correct. Chord recognition is a difficult task which requires a lot of practice even for humans.

1

http://www.music-ir.org/mirex

2_{E.g. ultimate guitar: http://www.ultimate-guitar.com/, 911Tabs http://www.911tabs.}

(9)

2 Musical Background

In this section I give an overview of important musical terms and concepts later used in this thesis. I first describe how musical notes relate to physical sound waves in section 2.1, then how chords relate to notes in section 2.2 and later different other aspects of music that play a role for automatic chord recognition in section 2.3.

2.1 Notes and Pitch

Pitch describes the perceived frequency of a sound. In Western tonality pitches are labelled by the letters A to G. The transcription of a musically relevant pitch and its duration is called note. Pitches can be ordered by frequency, whereby a pitch is said to be higher if the corresponding frequency is higher.

The human auditory system works on a logarithmic scale, which also mani-fests itself in music: Musical pitches are ordered in octaves, repeating the note names, usually denoted in ascending order from C to B: C, D, E, F, G, A, B. We can denote different octave relationships with an additional number as a subscript added to the symbol described previously. So a pitch A0 is one

octave lower than the corresponding pitch A1 one octave above. Two pitches

one octave apart double in corresponding frequency. Humans typically perceive those two pitches as the same pitch (Shepard, 1964).

In music an octave is split into twelve roughly equal semitones. By definition each of the letters C to B are two semitone steps apart, excepting the steps from E to F and B to C, which both are only one semitone apart. To denote those notes that are in between the named letters, the additional symbols ] for a semitone step in increasing frequency and [ for a step in decreasing frequency directions are used. For example we can describe the musically relevant pitch between C and D both as C] and D[. Because this system only defines the relationship between pitches, we need a reference frequency. In modern Western tonality usually the reference frequency of A4 at 440 Hz is standard (Sikora,

2003). In practice slight deviations of this reference tuning may occur, e.g., due to instrument mistuning or similar. This reference pitch thus defines the respective frequencies of other notes implicitly through the octave and semitone relationships. We may compute the corresponding frequencies for all other notes given a reference pitch with following equation:

fn = 2

n

12 ∗ f_r, (1)

where fn the frequency for n semitone steps from the reference pitch fr.

The human ear can perceive a frequency range of approximately 20 Hz to 20 000 Hz. In practice this frequency range is not fully used in music. For example the MIDI standard, which is more than sufficient for musical purposes in terms of octave range, covers only notes in semitone steps from C−1,

corre-sponding to about 8.17 Hz, to G9, which is 12 543.85 Hz. A standard piano

keyboard covers the range from A0 at 27.5 Hz to C8 4 186 Hz. Figure 1

de-picts a standard piano keyboard in relation to the range of frequencies of MIDI standard notes, with indicated physical sound frequencies.

(10)

C − 1 8 Hz C 0 16 Hz C1 33 Hz C 2 65 Hz C 3 131 Hz C4 262 Hz A 4 440 Hz C 5 523 Hz C 6 1047 Hz C7 2093 Hz C 8 4186 Hz C9 8372 Hz G 9 12544 Hz 0 127 MIDI note range 1 88 piano note range Figure 1: Piano k eyb oard and MIDI note range. White k eys depict the range of the standard piano, for those notes that are describ ed b y letters. Blac k k eys deviate semitone from a note describ ed b y a lette r. The gra y area depicts extensions o v er the note range of a piano, co v e red b y the MIDI standard.

(11)

2.2 Chords

For the purpose of this thesis we define a chord as three or more notes played simultaneously. The distance in frequency of two notes is called an interval. In a musical context we can describe an interval as the number of semitone steps two notes are apart (Sikora, 2003). A chord consists of a root note, usually the lowest note in terms of frequency. The interval relationship of the other notes played at the same time defines the chord type. Thus a chord can be defined as a root-note and a type. In the following we use the notation <root-note>:<chord-type>, proposed by Harte (2010). We can refer to the notes in musical intervals in order of ascending frequencies as: root-note, third, fifth, and if there is a fourth note seventh. In Table 1, we can see the intervals for chords considered in this thesis and the semitone step distance for those intervals. The root note and fifth have fixed intervals. For the seventh and third, we differentiate between major and minor intervals, differing by one semitone step.

For this thesis we restrict ourselves to two different chord vocabularies to be recognized, the first one containing only major and minor chord types. Both major and minor chords consist of three notes: the root note, the third and the fifth. The interval between root note and third distinguishes major and minor chord types (see tables 1 and 2) a major chord contains a major third, while the minor chord contains a minor third. We distinguish between twelve root notes for each chord type, for a total of 24 possible chords.

Burgoyne et al. (2011) propose a dataset which contains songs from the Billboard charts from 1950s through the 1990s. This major-minor chord vocab-ulary accounts for 65% of the chords. We can extend this chord vocabvocab-ulary to take into account 83% of the chord types in the Billboard dataset by including variants of the seventh chords, by adding an optional fourth note to a chord. Hereby, in addition to simple major and minor chords, we add 7th, major 7th and minor 7th chord-types to our chord-type vocabulary. Major 7th chords and minor 7th chords are essentially major and minor chords, whereby the added fourth note has the interval major seventh and minor seventh respectively.

In addition to different chord types, it is possible to change the frequency order of the notes for different intervals by “pulling” one note below the root-note in terms of frequency. This is called chord inversion. Thus our extended chord vocabulary containing major, minor, 7th, major 7th and minor 7th also contains all possible inversions. We can denote this through an additional iden-tifier in our chord syntax: < root-note>:<chord-type>/<inversion-ideniden-tifier>, where the inversion-identifier can either be 3, 5, or 7 played below the root-note. For example E:maj7/7 would be a major 7 chord, consisting of the root note E,

interval number of semitone-steps

root-note 0 minor third 3 major third 4 fifth 7 minor seventh 10 major seventh 11

(12)

chord-type intervals notes major 1,3,5 minor 1,[3,5 7 1,3,5,[7 major7 1,3,5,7 minor7 1,[3,5,[7

Table 2: Intervals and chords. root-note denoted as 1, third as 3, fifth as 5 and seventh as 7. We denote minor as [

a major third, fifth, and major seventh, and the major seventh is played below the root note in terms of frequency.

It is possible, however, that in parts of the song, no or only non-harmonic instruments (e.g., percussion) are playing. To be able to interpret this case we define an additional non-chord symbol, thus adding an additional chord symbol to our 24 different chord symbols for the restricted chord vocabulary, leaving us with 25 different symbols. The extended chord vocabulary contains major, minor, 7th, major 7th and minor 7th chord types (depicted in table 2) and all possible inversions. So, for each root-note, this leaves us with 3 different chord symbols for major and minor, and four different chord symbols for extended chords, thus 216 different symbols and an additional non-chord symbol.

Furthermore, we assume that chords cannot overlap, although this is not strictly true, for example, due to reverb, multiple instruments playing chords, etc. However, in practice this overlap is negligible and reverb is often not that long. Thus we regard a chord to be a continuous entity with designated start point, end point and a chord symbol (either consisting of the root note, chord type and inversion, or a non-chord symbol).

2.3 Other Structures in Music

A musical piece has several other components, some contributing additional harmonic content, for example vocals, which might also carry a linguistically interpretable message. Since a music piece has an overall harmonic structure and an inherent set of music theoretical harmonic rules, this information also influences the chords played at any time and vice versa, but does not necessarily contribute to the chord played directly.

The duration and start and end point in time of a chord played is influenced by rhythmic instruments, such as percussion. These do not contribute to the harmonic content of a music piece but nonetheless are interdependent with other instruments in terms of timing, thus the beginning and end of a chord played.

These additional harmonic and non-harmonic components are part of the same frequency range as components that directly contribute to the chord played. From this viewpoint, if we do not explicitly take into account addi-tional components, we are dealing with an addiaddi-tional task of filtering out this “noise” due to these extra components in addition to the task of recognizing chords themselves.

(13)

3 Related Work

Most musical chord estimation methods can broadly be divided into two sub-processes: preprocessing of features from wave-file data, and higher-level classi-fication of those features into chords.

I first describe in section 3.1 the preprocessing steps of the raw wave-form data, as well as the extensions and the refinements of its computation steps to take more properties of waveform music data into account. An overview of higher-level classification organized by methods applied is given in section 3.2. These not only differ in the methods per se, but also in what kind of musical context they take into account for the final classification. More recent methods take more musical context into account and seem to perform better. Since the methods proposed in this thesis are based on machine learning, I have decided to organize the description of other higher level classification approaches from a technical perspective rather than from a music-theoretical perspective.

3.1 Preprocessing / Features

The most common preprocessing step for feature extraction from waveform data is the computation of so called pitch class profiles (PCPs), a human-perception-based concept coined by Shepard (1964). He conducted a human perceptual study in which he found that humans are able to perceive notes that are in octave relation as equivalent. A similar representation can be computed from wave form data for chord recognition. A PCP in a music-computational sense is a representation of the frequency spectrum wrapped into one musical octave, thus an aggregated 12-dimensional vector of the energy of the respective input frequencies. This is often called a chroma vector. A sequence of chroma vectors over time is called a chromagram. The terms PCP and chroma vector in chord recognition literature are used interchangeably. It should be noted, however, that only the physical sound energy is aggregated: this is not purely music har-monic information. Thus the chromagram may contain additional non-harhar-monic noise, such as drums, harmonic overtones and transient noise.

In the following I will give an overview of the basics of calculating the chroma vector and different extensions proposed to improve the quality of these features. 3.1.1 PCP / Chroma Vector Calculation

In order to compute a chroma vector, the input signal is broken into frames and converted to the frequency domain, which is most often done through a discrete Fourier transform (DFT), using a window function to reduce spectral leakage. Harris (1978) compares 23 different window functions and finds that the performance depends very much on the properties of the data. Since mu-sical data is not heterogeneous, there is no single best-performing windowing function. Different window functions have been used in the literature, and often the specific window function is not stated. Khadkevich and Omologo (2009a) compare the performance impact of using Hanning, Hamming and Blackman windowing functions on musical wave form data applied to the chord estima-tion domain. They state that the results are very similar for those three types. However, the Hamming window performed slightly better for window lengths of

(14)

1024 and 2048 samples (for a sampling rate of 11025 Hz), which are the most common lengths in automatic chord recognition systems today.

To convert from the Fourier domain to a chroma vector, two different meth-ods are used. Wakefield (1999) sums energies of frequencies in the Fourier space closest to the pitch of a chroma vector bin (and its multiples) in order to ag-gregate the energy in a discrete mapping from spectral frequency domain to the corresponding chroma vector bin, converting the input directly to a chroma vector. Brown (1991) developed a so called constant-Q transform, using a ker-nel matrix multiplication to convert the DFT spectogramm into logarithmic frequency space. Each bin of the logarithmic frequency representation corre-sponds to the frequency of a musical note. After conversion into logarithmic frequency domain, we then can simply sum up the respective bins, to obtain the chroma vector representation. For both methods the aggregated sound energy in the chroma vector is usually normalized either to sum to one or with respect to the maximum energy in a single bin. Both methods lead to similar results and are used in current literature.

3.1.2 Minor Pitch Changes

In Western tonality music instruments are tuned to the reference frequency of A4 above middle C (MIDI note 69), whose standard frequency is 440 Hz. In

some cases the tuning of the instruments can deviate slightly, usually less than a quartertone from this standard tuning: 415–445 Hz (Mauch, 2010). Most humans are unable to determine an absolute pitch height without a reference pitch. We can hear a mistuning of one instrument with some practice, but it is difficult to determine a slight deviation of all instruments from the usual reference frequency described above.

The bins for the chroma vectors are relative to a fixed pitch, thus minor deviations in the input will affect its quality. Minor deviations of the reference pitch can be taken into account through shifting the pitch of the chromagram bins. Several different methods have been proposed: Harte and Sandler (2005) use a chroma vector with 36 bins, 3 per semitone. Computing a histogram of energies with respect to frequency for one chroma vector and the whole song and examining the peak positions in the extended chroma vector enables them to estimate the true tuning and derive a 12-bin chroma vector, under the as-sumption that the tuning will not deviate during the piece of music. This takes a slightly changed reference frequency into account. G´omez (2006) first restricts the input frequencies from 100 to 5000 Hz to reduce the search space and to remove additional overtone and percussive noise. She uses a weighting function which aggregates spectral peaks not to one, but to several chromagram bins. The spectral energy contributions of these bins are weighted according to a squared cosine distance in frequency. Dressler and Streich (2007) treat minor tuning differences as an angle and use circular statistics to compensate for minor pitch shifts, which was later adapted by Mauch and Dixon (2010b).

Minor tuning differences are quite prominent in Western tonal music, and adjusting the chromagram can lead to performance increase, such that several other systems make use of one of the former methods, e.g.: Papadopoulos and Peeters (2007, 2008), Reed et al. (2009), Khadkevich and Omologo (2009a), Oudre et al. (2009).

(15)

3.1.3 Percussive Noise Reduction

Music audio often contains noise that can not directly be used for chord recog-nition, such as transient or percussive noise. Percussive and transient noise normally is short, in contrast to harmonic components, which are rather stable over time. A simple way to reduce this is to smooth subsequent chroma vectors through filtering or averaging. Different filters have been proposed. Some re-searchers, e.g., Peeters (2006), Khadkevich and Omologo (2009b), Mauch et al. (2008), use a median filter over time after tuning and before aggregating the chroma vectors, to remove transient noise. G´omez (2006) uses several differ-ent filtering methods and derivatives based on a method developed by Bonada (2000) to detect transient noise and leave a window out of the chroma vector calculation of 50 ms before and after transient noise, reducing the input space. Catteau et al. (2007) calculate a “background spectrum” by convolving the log-frequency spectrum with a Hamming window of length of one octave, which they subtract from the original chroma vector to reduce noise.

Because there are methods to estimate a beat from the audio signal (Ellis, 2007), and chord changes are more likely to appear on these metric positions, several systems aggregate or filter the chromagram only in between those de-tected beats. Ni et al. (2012) use a so called harmonic percussive sound separa-tion algorithm described in Ono et al. (2008), which attempts to split the audio signal into percussive and harmonic components. After that they use the median chroma feature vector as representation for the complete chromagram between two beats. A similar approach is used by Weil et al. (2009), who also use a beat tracking algorithm, and average the chromagram between two consecutive beats. Glazyrin and Klepinin (2012) calculate a beat-synchronous smoothed chromagram and propose a modified Prewitt filter from image recognition for edge detection applied to music to suppress non-harmonic spectral components. 3.1.4 Repeating Patterns

Musical pieces inherit a very repetitive structure, e.g., in popular music higher-level structures such as verse and chorus are repeated, and usually those are repetitions of different harmonic (chord) patterns themselves. These structures can be exploited to improve the chromagram through recognizing and averag-ing or filteraverag-ing those repetitive parts to remove local deviation. Repetitive parts can also be estimated and used later in the classification step to increase per-formance. Mauch et al. (2009) first perform a beat estimation and smooth the chroma vectors in a prefiltering step. Then a frame-by-frame similarity matrix from the beat-synchronous chromagram is computed and the song is segmented into an estimation of verse and chorus. This information is used to average the beat synchronous chromagram. Since beat estimation is a current research topic itself and often does not work perfectly, there might be errors in the beat positions. Cho and Bello (2011) argue that it is advantageous to use recurrent plots with a simple threshold operation to find similarities on a chord level for later averaging, thus leaving out the segmentation of the song into chorus and verse and beat detection. Glazyrin and Klepinin (2012) build upon and alter the system of Cho and Bello. They use a normalized self-similarity matrix on the computed chroma vectors using Euclidean distance as a comparison measure.

(16)

3.1.5 Harmonic / Enhanced Pitch Class Profile

One problem of the computation of PCPs in general is to find an interpretation for overtones (energy in integer multiples of the fundamental frequency), since these might generate energy in frequencies that contribute to chroma vector bins other than the actual notes of the respective chord. For example the overtones of A4 (440 Hz) are at 880 Hz and 1320 Hz, which is close to E6 (MIDI note 68)

at approximately 1318.51 Hz. Several different ways to achieve this have been proposed. In most cases the frequency range that is taken into account is re-stricted, e.g., approx from 100 Hz to 5000 Hz (Lee, 2006; G´omez, 2006). Most of the harmonic content is contained in this interval. Lee (2006) refines the chroma vector by computing the so called “harmonic product spectrum”, in which the product of the energy for octave multiples (up to a certain number) for each bin is calculated. Later the chromagram on basis of this harmonic product spectrum is computed. He states that multiplying the fundamental frequency with its oc-tave multiples can decrease noise on notes that are not contained in the original piece of music. Additionally he finds a reduction of noise induced by “false” harmonics compared to conventional chromagram calculation. G´omez (2006) proposes an aggregation function for the computation of the chroma vector, in which the energy of the frequency multiples are summed, but first weighted by a “decay” factor, which is dependent on the multiple. Mauch and Dixon (2010a) use a non-negative least-squares method to find a linear combination of “note profiles” in a dictionary matrix to compute the log-frequency representation similar to the constant-Q transform mentioned earlier.

3.1.6 Modelling Human Loudness Perception

Human loudness perception is not directly proportional to the power or am-plitude spectrum (Ni et al., 2012), thus the different representations described above do not model human perception accurately. Ni et al. (2012) describe a method to incorporate this through a log10scale for the sound power in respect

to frequency. Pauws (2004) uses a tangential weighting function to achieve a similar goal for key detection. They find an improvement on the quality of the resulting chromagram compared to non-loudness-weighted methods.

3.1.7 Tonnetz / Tonal Centroid

Another representation of harmonics is the so called Tonnetz, which is attributed to Euler in the 19th century. It is a planar representation of musical notes on a 6-dimensional politype, where pitch relations are mapped onto its vertices. Close musical harmonic relations (e.g., fifths and thirds) have a small Euclidean distance. Harte et al. (2006) describe a way to compute a Tonnetz from a 12-bin chroma vector, and report a performance increase for a harmonic change detection function, compared to standard methods.

Humphrey et al. (2012) use a convolutional neural network from the FFT to model a projection function from wave form input to a Tonnetz. They perform experiments on the task of chord recognition with a Gaussian mixture model, and report that the Tonnetz output representation outperforms state-of-the-art chroma vectors.

(17)

3.2 Classification

The majority of chord recognition systems compute a chromagram using one or a combination of methods described above. Early approaches use predefined chord templates and compare them with the computed frame-wise chroma features from audio pieces, which are then classified.

With the supply of more and more hand-annotated data, more data-driven learning approaches have been developed. The most prominent data-driven model adopted is taken from speech recognition, the hidden Markov model (HMM). Bayesian networks are also used frequently, which are a generalization of HMMs. Recent approaches propose to take more musical context into account to increase performance, such as a local key, bass note, beat and song structure segmentation. Although most chord recognition systems rely on the compu-tation of single chroma vectors, more recent approaches compute two chroma vectors for each frame. A bass and treble chromagram (differing in frequency range) are computed, as it is reasoned that the sequence of bass notes have an important role in the harmonic development of a song and can colour the treble chromagrams due to harmonics.

3.2.1 Template Approaches

The chroma vector as an estimate of the harmonic content of a frame of a music piece should contain peaks at bins that correspond to chord notes played. Chord template approaches use chroma-vector-like templates. These can be either predefined through expert knowledge, or learned from data. Those templates are then compared with a fitting function with the computed chroma vector of each frame respectively. The frame is then classified as the chord symbol corresponding to the best-fitting template.

The first research paper explicitly concerned with chord recognition is by Fujishima (1999), which constitutes a non-machine-learning system. Fujishima first computes simple chroma vectors as described above. He then uses pre-defined 12-dimensional binary chord patterns (either 1 or 0 for present and non-present notes in the chroma vector in the chord) and computes the inner product with the chroma vector. For real-world chord estimation, the set of chords consists of schemata for “triadic harmonic events, and to some extent more complex chords such as sevenths and ninths”. Fujishima’s system was only used on synthesized sound data, however. Binary chord templates with an enhanced chroma vector using harmonic overtone suppression were used by Lee (2006). Other groups use a more elaborate chromagram with tuning (36 bins) for minor pitch changes reducing chord types to be recognized (Harte and Sandler, 2005; Oudre et al., 2009). Oudre et al. (2011) extend the methods already mentioned, by comparing different filtering methods as described in sec-tion 3.1.3 and measures of fit (Euclidean distance, Kullback-Leibler divergence and Itakura-Saito divergence) to select the most suitable chord template. They also take harmonic overtones of chord notes into account, such that bins in the templates for notes not occurring in the chord do not necessarily have to be zero. Glazyrin and Klepinin (2012) use quasi-binary chord templates, in which the tonic and the 5th are enhanced and the template is normalized afterwards. The templates are compared to smoothed and fine-tuned chroma vectors.

(18)

be modelled as a Gaussian, or as mixture of Gaussians as used by Humphrey et al. (2012), in order to get an probabilistic estimate of a chord likelihood. To eliminate short spurious chords that only last a few frames, they use a Viterbi decoder. They do not use the chroma vector for classification, but a Tonnetz as described in section 3.1.7. The transformation function is learned by a convolutional neural network from data.

It should be noted that basically all chord template approaches can model chord “probabilities” that can in turn be used as input for higher level clas-sification methods or for temporal smoothing such as hidden Markov models, described in section 3.2.2 as shown by Papadopoulos and Peeters (2007). 3.2.2 Data-Driven Higher Context Models

The recent increase in availability of hand-annotated data on chord recognition has spawned new machine-learning-based methods. In chord-recognition litera-ture, different approaches have been proposed, from neural networks, to systems adopted from speech recognition to support vector machines and others. More recent machine learning systems seem to capture more and more context of mu-sic. In this section I describe higher level classification models found organized by machine learning methods used.

Neural Networks Su and Jeng (2001) try to model the human auditory sys-tem with artificial neural networks. They perform a wavelet transform (as an analogy to the ear) and feed the output into a neural network (as an analogy for the cerebrum ) for classification. They use a self-organizing map to determine the style of chord and the tonality (C, C# etc.). It was tested on classical music to recognize 4 different chord types (major, minor, augmented, and diminished). Zhang and Gerhard (2008) propose a system based on neural networks to de-tect basic guitar chords and their voicings (inversions) with the help of a voicing vector and a chromagram. The neural network in this case first is trained to identify and output the basic chords; a later post processing step will determine the voicing. Osmalsky et al. (2012) build a database with several different in-struments playing single chords individually, part of it recorded in a noisy and part of it in a noise-free environment. They use a feed-forward neural net with a chroma vector as input to classify 10 different chords and experiment with different subsets of their training set.

HMM Neural networks do not take time dependencies between subsequent inputs into account. In music pieces there is a strong interdependency of sub-sequent chords, which renders a classification of chords for a whole music piece difficult to model based solely on neural networks. Since a template and neural net based approaches do not explicitly take temporal properties of music into account, a widely adopted method is to use a hidden Markov model. It has proven to be a good tool for the related field of speech recognition. The chroma vector is treated as observation, which can be modelled by different probability distributions, and the states of the HMM are the chord symbols to be extracted. Sheh and Ellis (2003) pioneered HMMs for real-world chord recognition. They propose that the emission distribution be a single Gaussian with 24 di-mensions, trained from data with expectation maximization. Burgoyne et al.

(19)

(2007) state that a mixture of Gaussians is more suitable as the emission dis-tribution. They also compare the use of Dirichlet distributions as the emission distribution and conditional random fields as the higher level classifier. HMMs are used with slightly different chromagram computations and training initialisa-tions according to prior music theoretic knowledge by Bello and Pickens (2005). Lee (2006) build upon the systems of Bello and Pickens and Sheh and Ellis, generate training data from symbolic files (MIDI) and use an HMM for chord extraction. Papadopoulos and Peeters (2007) compare several different methods of determining the parameters of the HMM and observation probabilities. They conclude that a template-based approach combined with an HMM with a “cogni-tive based transition matrix” shows the best performance. Later Papadopoulos and Peeters (2008, 2011) propose an HMM approach focusing on (and extract-ing) beat estimates to take into account musical beat addition, beat deletion or changes in meter to enhance recognition performance. Ueda et al. (2010) use Harmonic Percussive Sound Separation chromagram features and an HMM for classification. Chen et al. (2012) cluster “song-level duration histograms” to take time duration explicitly into account in a so-called duration-explicit HMM. Ni et al. (2012) is the best performing system of 2012 MIREX challenge in chord estimation. It works on the basis of an HMM, bass and treble chroma and beat and key detection.

Structured SVM Weller et al. (2009) compare the performance of HMMs and support vector machines (SVMs) for chord recognition and achieve state-of-the-art results using support vector machines.

n-grams Language and music are closely related. Both spoken language and music rely on audio data. Thus it makes sense to apply spoken-language-recognition approaches to music analysis and chord spoken-language-recognition. A dominant approach for language recognition is an n-gram model. A bigram model (n = 2) is essentially a hidden Markov model, in which one state only depends on the previous one. Cheng et al. (2008) compare 2-, 3-, and 4-grams, thus making one chord dependent on multiple previous chords. They use it for song sim-ilarity after a chord recognition step. In their experiments the simple 3- and 4-grams outperform the basic HMM system of Harte and Sandler (2005); they state that n-grams are able to learn the basic rules of chord progressions from hand annotated data.

Scholz et al. (2009) use a 5-gram and compare different smoothing tech-niques and find that modelling more complex chords with 7ths and 9ths should be possible with n-grams. They do not state how features are computed and interpreted.

Dynamic Bayesian Networks Musical chords develop meaning in their in-terplay with other characteristics of a music piece, such as bass note, beat and key: they can not be viewed as an isolated entity. These interdependencies are difficult to model with a standard HMM approach. Bayesian networks are a generalization of HMMs, in which the musical context can be modelled more intuitively. Bayesian networks give the opportunity to model interdependencies simultaneously, creating a more sound model for music pieces from a music-theoretic perspective. Another advantage of a Bayesian network is that it can

(20)

directly extract multiple types of information, which may not be a priority for the task of chord recognition, but is an advantage for the extended task of general transcription of music pieces.

Cemgil et al. (2006) were among the first to introduce Bayesian networks for music computation. They do not apply the system to chord recognition but to polyphonic music transcription (transcription on a note-by-note basis). They implement a special case of the switching Kalman filter. Mauch (2010) and Mauch and Dixon (2010b) make use of a Bayesian network and incorporate beat detection, bass note and key estimation. The observations of the Bayesian network in the system are treble and bass chromagrams. Dixon et al. (2011) compare a similar system to a logic based system.

Deep Learning Techniques Deep learning techniques have beaten the state of the art in several benchmark problems in recent years, although for the task of chord recognition it is a relatively unexplored method. There are three recent publications using deep learning techniques. Humphrey and Bello (2012) call for a change in the conventional approach of using a variation of chroma vector and a higher level classifier, since they state recent improvements seem to bring only “diminishing return”. They present a system consisting of a convolutional neural network with several layers, trained to learn a Tonnetz from a constant-Q-transformed FFT, and subsequently classify it with a Gaussian mixture model. Boulanger-Lewandowski et al. (2013) make use of deep learning techniques with recurrent neural networks. They use different techniques including a Viterbi-like algorithm from HMMs and beam search to take temporal information into account. They report upper-bound results comparable to the state of the art using the Beatles Isophonics dataset (see section 6.5 for a dataset description) for training and testing. Glazyrin (2013) uses stacked denoising autoencoders with a 72-bin constant-Q transform input, trained to output chroma vectors. A self-similarity algorithm is applied to the neural network output and later classified with a deterministic algorithm, similar to the template approaches mentioned above.

(21)

4 Stacked Denoising Autoencoders

In this section I give a description of the theoretical background of stacked denoising autoencoders used for the two chord recognition systems in this thesis following Vincent et al. (2010). First a definition of autoencoders and their training method is given in section 4.1, then it is described how this can be extended to form a denoising autoencoder in section 4.2. We can stack denoising autoencoders to train them in an unsupervised manner and possibly get a useful higher level data abstraction by training several layers, which is described in section 4.3.

4.1 Autoencoders

Autoencoders or autoassociators try to find an encoding of given data in the hidden layers. Similar to Vincent et al. (2010) we define the following:

We assume a supervised learning scenario. A training set of n touples of inputs x and targets t. Dn = {(x1, t1), .., (xn, tn)}, where x ∈ Rd if the

input is real valued, or x ∈ [0, 1]d. Our goal is to infer a new, higher level representation y, of x. The new representation again is y ∈ Rd0 _or

y ∈ [0, 1]d0 _{depending if real valued or binary representation is assumed.}

Encoder A deterministic mapping fθthat transforms the input x to a hidden

representation y is called an encoder. It can be described as follows:

y = fθ(x) = s(W x + b), (2)

where θ = {W, b}, W a d × d0 weight matrix and b an offset (or bias) vector of dimension d0. The function s(x) is a non linear mapping, e.g., a sigmoid activation function _1+e1−x. The output y is called the “hidden representation”.

Decoder A deterministic mapping gθ0that maps hidden representation y back

to input space by constructing a vector z = gθ0(y) is called a decoder. Typically

this is in form of a mapping:

z = gθ0(y) = W0y + b0 (3)

or a mapping followed by a non-linearity:

z = gθ0(y) = s(W0y + b0) (4)

where θ0= {W0, b0}, W0 _{a d}0_{× d weight matrix and b}0 _{an offset (or bias) vector}

of dimension d. Often the restriction W> = W0 is imposed on the weights. z can be regarded as an approximation of the original input data x, reconstructed from the hidden representation y.

(22)

Input x

Hidden representation y fθ

Autoencoder training output z

gθ0

L(x, z) Loss function

Figure 2: Conventional autoencoder training. Vector x from the training set is projected by fθ(x) to hidden representation y, hereafter projected back to

input space using gθ0(y) to compute z. The loss function L(x, z) is calculated

and used as training objective for minimization.

Training The idea behind such a model is to get a good hidden representation y, from which the decoder is able to reconstruct the original input as closely as possible. It can be shown that finding the optimal parameters for such a model can be viewed as a maximization of the lower bound between the mutual information of the input and the hidden representation in the first layer (Vincent et al., 2010). To estimate the parameters we define a loss function. This can be for a binary input x ∈ [0, 1]d _{the cross entropy:}

L(x, z) = −

d

X

k=1

xklog(zk) + (1 − xk) log(1 − zk) (5)

or for real valued input x ∈ Rd:

L(x, z) = ||x − z||2, (6)

The “squared error objective”. Since we use real valued input data, this squared error objective is used in this thesis as loss function.

Given this loss function we want to minimize the average loss (Vincent et al., 2008): θ∗, θ0∗= arg min θ,θ0 1 n n X i=1

L(x(i), z(i)) = arg min

θ,θ0 1 n n X i=1 Lx(i), gθ0 f_θ(x(i)) , (7)

Where θ∗, θ0∗denote the optimal parameters for encoding and decoding func-tion for which the loss funcfunc-tion is minimized, which might be tied. This can be achieved iteratively by backpropagation. n is the number of training samples. Figure 2 visualizes the training procedure for an autoencoder.

If the hidden representation y is of the same dimensionality as the input x, it is trivial to construct a mapping that yields zero reconstruction error, the identity mapping. Obviously this constitutes a problem since merely learning the identity mapping does not lead to any higher level of abstraction. To evade this problem a bottleneck is introduced, for example by using fewer nodes for a

(23)

hidden representation thus reducing its dimensions. It is also possible to impose a penalty on the network activations to form a bottleneck, and thus train a sparse network. These additional restrictions force the neural network to focus on the most “informative” parts of the data leaving out noisy “uninformative” parts. Several layers can be trained in a greedy manner to achieve a yet higher level of abstraction.

Enforcing Sparsity To prevent autoencoders from learning the identity map-ping, we can penalize activation. This is described by Hinton (2010) for re-stricted Boltzman machines, but can be used for autoencoders as well. The general idea is that it is less informative if we have nodes that fire very fre-quently, i.e. a node that is always active does not add any useful information and could be left out. We can enforce sparsity by adding a penalty term for large average activations over the whole dataset to the backpropagated error. We can compute the average activation of a hidden unit j over all training samples with:

ˆ pj = 1 n n X i=1 f_θj(x(i)) (8)

In this thesis the following addition to the loss function is used, which is derived from the KL divergence:

Lp= β h X j=1 p log p ˆ pj + (1 − p) log(1 − p 1 − ˆpj ) , (9)

where ˆp the average activation over the complete training set for hidden unit j, n the number of training samples, p is a target activation parameter and β a penalty weighting parameter, all specified beforehand. The bound h is the number of hidden nodes. For a sigmoidal activation function p is usually set to a value that is close to zero, for example 0.05. A frequent setting for β is 0.1. This ensures that units will have a large activation only on a limited amount of training samples and otherwise have an activation close to zero. We now simply add this weighted activation error term to L(x, z), described above.

4.2 Autoencoders and Denoising

Vincent et al. (2010) propose another training criterion in addition to the bot-tleneck. They state that an autoencoder can also be trained to “clean a partially corrupted input”, also called denoising.

If noisy input is assumed, it can be beneficial to corrupt (parts of) the input of the autoencoder while training and use the uncorrupted input as target. The autoencoder is hereby encouraged to reconstruct a “clean” version of the corrupted input. This can make the hidden representation of the input more robust to noise, and can potentially lead to a better higher level abstraction of the input data.

Vincent et al. (2010) state that different types of noises can be considered. There is “masking noise”, i.e., setting a random fraction of the input to 0, “salt and pepper noise”, i.e., setting a random fraction of the input to either 0 or 1, and, especially for real-valued input, isotropic additive Gaussian noise, i.e. adding noise from a Gaussian distribution to the input. To achieve this, we

(24)

corrupt the initial input x into ˜x according to a stochastic mapping ˜x ∼ qD(˜x|x).

This corrupted input is then projected to the hidden representation as described before by means of y = fθ(˜x) = s(W ˜x + b). Then we can reconstruct z = gθ0(y).

The parameters θ and θ0are trained to minimize the average reconstruction error between output z and the uncorrupted input x, but in contrast to “conventional” autoencoders, z is now a deterministic function of ˜x instead of x.

For our purpose, under usage of additive Gaussian noise, we can train the denoising autoencoder with a squared error loss function: L2(x, z) = ||x − z||2.

Parameters can be initialized at random and then optimized by backpropaga-tion. Figure 3 depicts training of a denoising autoencoder.

Corrupted input ˜x

Hidden representation y fθ

Uncorrupted input x

qD

Denoising autoencoder output z

gθ0

L(x, z) Loss function

Figure 3: Vector x form training set is corrupted with qD and converted to

hidden representation y. The loss function L(x, z) is calculated from the output and the uncorrupted input and used for training

4.3 Training Multiple Layers

If we want to train (or initialize training parameters for supervised backprop-agation for) deep networks, we need a manner to extend the approach from a single layer, as described in the previous sections, to multiple layers.

As described by Vincent et al. (2010), this can be easily achieved by repeat-ing the process for each layer separately. Depicted in figure 4 is such a greedy layer wise training. First we propagate the input x through the already trained layers. Note that we do not use additional corruption noise yet. Next we use the uncorrupted hidden representation of the previous layer as input for the layer we are about to train. We train this specific layer as described in the previous sections. The input to the layer to be trained is first corrupted by qD and then

projected into latent space by using f_θ(2). We then project it back to “input” space of the specific layer with g_θ(2)0 . Using an error function L, we can optimize

the projection functions with respect to the defined error, and therefore possibly obtain a useful higher-level representation. This process can be repeated several times to initialize a deep neural network structure, circumventing usual prob-lems that arise when initializing deep networks at random and then applying

(25)

backpropagation.

Next we can apply a classifier on the output of this deep neural network trained to supress noise. Alternatively we can add another layer of hidden nodes for classification purposes on top of the previously unsupervised trained network structure and apply standard backpropagation to fine-tune the network weights according to our supervised training training targets t.

x f_θ(1) x f_θ(1) f_θ(2) qD g_θ(2)0 L(y, z2) Loss function x f_θ(1) f_θ(2)

Figure 4: Training of several layers in a greedy unsupervised manner. The input is propagated without corruption. To train an additional layer the output of the first layer is corrupted by qD and the weights are adjusted with f

(2) θ ,g

(2) θ0

with the respective loss function. After training for this layer is completed, we can train subsequent layers.4

4.4 Dropout

Hinton et al. (2012) were able to improve performance on several other recog-nition tasks, including MNIST for hand written digit recogrecog-nition and TIMIT a database for speech recognition, by randomly omitting a fraction of hidden nodes from training for each sample. This is in essence training a different model for each training sample and iteration on one training sample only. According to Hinton et al. this prevents the network from overfitting. In the testing phase we make use of the complete network again. Thus what we effectively are doing with dropout is averaging: averaging many models trained on one training sam-ple each. This has yielded an improvement in different modelling tasks (Hinton et al., 2012).

(26)

5 Chord Recognition Systems

In this section I describe the structure of three different approaches to classify chords.

1. We first describe the structure of a comparison system: a simplified ver-sion of the Harmony Progresver-sion Analyzer as proposed by Ni et al. (2012). The features computed can be considered state of the art. We discard, however, additional context information like key, bass and beat tracking, since the neural network approaches developed in this thesis do not take this into account (although it should be noted that in principle the ap-proaches developed in this thesis could be extended to take this additional context information into account as well). The simplified version of the Harmonic Progression Analyzer will serve as a reference system for per-formance comparison.

2. A neural network initialized by stacked denoising autoencoder pretraining with later backpropagation fine-tuning can be applied to an excerpt of the FFT to estimate chord probabilities directly, which then can be smoothed with the help of an HMM, to take temporal information into account. We substitute the emission probabilities with the output of the stacked denoising autoencoders.

3. This approach can be extended by adding filtered versions of the FFT over different time spans to the input. We extend the input to include two additional vectors, median-smoothed over different timespans. Here again additional temporal smoothing is applied in a post-classification process. In section 5.1 we describe the comparison system and briefly the key ideas incorporated in the computation of state-of-the-art features. Since the two other approaches described in this thesis make use of stacked denoising autoencoders that interpret the FFT directly, we describe beneficial pre-processing steps in section 5.2.1. In section 5.2.2 we describe a stacked denoising autoencoder ap-proach for chord recognition in which the outputs are chord symbol probabilities directly, and in section 5.2.3 we propose an extension of this approach inspired by a system developed for face recognition and phone recognition by Tang and Mohamed (2012) under usage of a so called multi-resolution deep belief network and apply it to chord recognition with the use of stacked denoising autoen-coders. Appendix A describes the theoretical foundation of applying a joint optimization of the HMM and neural network for chord recognition.

5.1 Comparison System

In this section we describe a basic comparison system for the other approaches implemented. It reflects the structure of most current approaches and uses state-of-the-art features for chord recognition.

Most recent chord recognition systems rely on an improved computation of the PCP vector and take extra information into account such as bass notes or key information. This extra information is usually incorporated into a more elaborate higher-level framework, such as multiple HMMs or a Bayesian net-work.

(27)

The comparison system consists of the computation of state-of-the-art PCP vectors for all frames, but only a single HMM for later classification and tempo-ral alignment of chords, which allows for a more fair comparison to the stacked denoising autoencoder approaches. The basic computation steps described in the following are used in the approach described by Ni et al. (2012). They split the computation of features into a bass chromagram and a treble chromagram, and track them with two additional HMMs. The computed frames are aligned according to a beat estimate. To make this more elaborate system comparable, again we only compute one chromagram containing both bass and treble and use a single HMM for temporal smoothing and do not align frames according to an beat estimate.

We first describe the very basic steps of PCP features predominantly used in chord recognition for 15 years in section 5.1.1, hereafter in section 5.1.2 we describe extensions of the basic PCP used in the comparison system.

5.1.1 Basic Pitch Class Profile Features

The basic pipeline for computing a pitch class profile as a feature for chord recognition consists of two steps:

1. The signal is projected from time to frequency domain through a Fourier transform. Often files are downsampled to 11 025 Hz to allow for faster computation. This is also done in the reference system. The range of fre-quencies is restricted through filtering, to only analyse frefre-quencies below, e.g., 4000 Hz (about the range of the keyboard of a piano, see figure 1) or similar, since other frequencies carry less information about the chord notes played and introduce more noise to the signal. In the reference sys-tem a frequency range from approximately 55 Hz to 1661.2 Hz is used, as this interval is proposed in the original system (Ni et al., 2012).

2. The second step consists of a constant-Q transform, which projects the amplitude of the signal in the linear frequency space to a logarithmic representation of signal amplitude, in which each constant-Q transform bin represents the spectral energy in respect to the frequency of a musical note.

3. In a third step the bins representing one musical note and its octave mul-tiples are summed and the resulting vector is sometimes normalized. In the following section we describe the constant-Q transform and computation of the PCP in more detail.

Constant-Q transform After converting the signal from time to frequency domain through a discrete or fast Fourier transform, we can apply an additional transform to make the frequency bins logarithmically spaced. This transform can be viewed as a set of filters in time domain, which filter a frequency band according to a logarithmic scaling of center frequencies of the constant-Q bins. Originally it was proposed to be an additional term in the Fourier transform, but it has been shown by Brown and Puckette (1992) to be computationally more efficient to filter the signal in Fourier space, thus applying the set of filters transformed into Fourier space to the signal also in Fourier space. This

(28)

can be realized with a matrix multiplication. This transformation process to logarithmically spaced bins is called the constant-Q transform (Brown, 1991).

The name stems from the factor Q, which describes the relationship between center frequency of each filter and the filter width Q = fk

∆fk. Q is a so-called

quality factor which stays constant, fk is the center frequency and ∆fk the

width of the filter. We can choose the filters such that they filter out the energy contained in musically relevant frequencies (i.e., frequencies corresponding to musical notes):

fkcq = (2 1 B)kcq_f

min, (10)

where fmin is the frequency for the lowest musical note to be filtered, fkcq the

center frequency corresponding to constant-Q bin kcq. B denotes the number of

constant-Q frequency bins per octave, usually B = 12 (one bin per semitone). Setting Q = 1

2B1−1 establishes a link between musically relevant frequencies and

filter width of our filterbank.

Different types of filters can be used to aggregate the energy in relevant frequencies and to reduce spectral leakage. For comparison system we make use of a Hamming window as described as well by Brown and Puckette (1992):

w(n, fkcq) = 0.54 + 0.46 cos 2πn M (fkcq) (11) where n = −M (fkcq) 2 , . . . , M (f_kcq)

2 − 1, M (fkcq) is the window size, computable

with Q and corresponding center frequency fkcqfor constant-Q bin, and kcqand

n the current input bin in time domain and sampling rate of the input signal fs

(Brown, 1991):

M (fkcq) = Q

fs

fkcq

. (12)

We can now compute the filters and thus the respective sound power in the signal filtered according to a musically-relevant set of center frequencies.

Instead of applying these filters in time domain, it is computationally more efficient to do so in spectral domain, by projecting the window functions to Fourier space first. We can apply the filters hereafter through a matrix multi-plication in frequency space. As denoted by Brown and Puckette (1992) for bin kcq of the constant-Q transform can write:

Xcq[kcq] = 1 N N −1 X k=0 X[k]K[k, kcq], (13)

where kcq describes the constant-Q transform bin, X[k] the signal amplitude

at bin k in Fourier domain, N is the number of Fourier bins and K[k, kcq] the

value of the Fourier transform of our filter w(n, fkcq) for constant-Q transform

kcq at Fourier bin k.

Choosing the right minimum frequency and quality factor will result in constant-Q bins corresponding to harmonically-relevant frequencies. Having transformed the linearly-spaced amplitude per frequency to a musically spaced constant-Q transform bin, we can now continue to aggregate notes that are one octave apart, hereby reducing the dimension of the feature vector significantly.

(29)

PCP Aggregation Shepard’s (1964) experiments on human perception of music suggest that humans can perceive notes one octave apart as belonging to the same group of notes, known as pitch classes. Given these results we compute pitch class profiles based on the signal energy in logarithmic spectral space. As described by Lee (2006): P CP [k] = Ncq−1 X m=0 |Xcq_{(k + mB)|,} ₍₁₄₎

where k = 1, 2, ..., B is the index for the PCP bin, Ncqis the number of octaves

in the frequency range of the constant-Q transform. Usually B = 12, so that one bin for each musical note in one octave is computed. For pre-processing, e.g., correction of minor tuning differences, B = 24 or B = 36 are also sometimes used. Hereafter the resulting vector is usually normalized, typically with respect to the L1, L2 or L∞norm.

5.1.2 Comparison System Simplified Harmony Progression Analyzer In this section I describe the refinements made to the very basic chromagram computation defined above. The state-of-the-art system proposed by Ni et al. (2012) takes additional context into account. They state that tracking the key and the bass line provides important context that provides useful additional information for recognizing musical chords.

For a more accurate comparison with stacked denoising autoencoder ap-proaches, which cannot easily take such context into account, we discard the musical key, bass and beat information that is used by Ni et al. We compute the features with the code that is freely available from their website5_{and adjust}

it to a fixed stepsize of 1024 samples with a sampling rate of 11 025 Hz thus a step size of approximately 0.09s per frame, instead of a beat-aligned step size.

In addition to a so-called harmonic percussive sound separation algorithm as described by Ono et al. (2008), which attempts to split the signal into an hamonic and a percussive part, Ni et al. implement a loudness-based PCP vector and correct for minor tuning deviations.

5.1.3 Harmonic Percussive Sound Separation

Ono et al. (2008) describe a method to discriminate between the percussive con-tribution to the Fourier transform and the harmonic one. This can be achieved by exploiting the fact that percussive sounds most often manifest themselves as bursts of energy spanning a wide range of frequencies but only during a lim-ited time. On the other hand, harmonic components span a limlim-ited frequency range but are more stable over time. Ono et al. present a way to estimate the percussive and harmonic parts of the signal contribution in Fourier space as an optimization problem which can be solved iteratively:

Fh,i is the short-time Fourier transform of an audio signal f (t) and Wh,i=

|Fh,i|2is its power spectrogram. We minimize the L2norm of power spectrogram

gradients, J (H, P ), with Hh,ithe harmonic component and Ph,i the percussive

5

(30)

component, with h the frequency bin and i the time in Fourier space: J (H, P ) = 1 2σ2 H X h,i (Hh,i−1− Hh,i)2+ 1 2σ2 P X h,i (Ph−1,i− Ph,i)2, (15)

subject to the constraint that

Hh,i+ Ph,i= Wh,i (16)

Hh,i≥ 0, (17)

and

Ph,i≥ 0, (18)

where Wh,iis the original power spectrogram, as described above, and σH and

σP are parameters to control the smoothness vertically and horizontally. Details

for an iterative optimization procedure can be found in the original paper. 5.1.4 Tuning and Loudness-Based PCPs

Here we describe further refinements of the PCP vector, first how to take mi-nor deviations (less than a semitone) from the reference tuning into account, and later an addition proposed by Ni et al. (2012) to model human loudness perception.

Tuning To take into account minor pitch shifts of the tuning of the specific song, features are fine-tuned as described by Harte and Sandler (2005). Instead of computing a 12-bin chromagram directly, we can compute multiple bins for each semitone, as described in section 5.1.1 for setting B > 12 (e.g., B = 36). We can then compute a histogram of sound power peaks with respect to frequency and select a subset of constant-Q bins to compute the PCP vectors, to shift our reference tuning according to small deviations for a song.

Loudness Based PCPs Since human loudness perception of sound in re-spect to frequencies is not linear, Ni et al. (2012) propose a loudness weighting function.

First we can compute a “sound power level matrix”: Ls,t= 10 log10

||Xs,t||2

pref

, s = 1, ..., S, t = 1, ..., T, (19) where pref indicates the fundamental reference power, and Xs,tthe constant-Q

transform of our input signal as described in the previous section (s denoting the constant-Q transform bin and t the time). They propose to use A-weighting (Talbot-Smith, 2001), in which we add a specific value depending on the fquency. An approximation to human sensitivity of loudness perception in re-spect to frequency is then given by:

L0s,t= Ls,t+ A(fs), s = 1, ..., S, t = 1, ..., T, (20)

where

(31)

and RA(fs) = 122002_f4 s (f2 s + 20.62)p(fs2+ 107.72)(fs2+ 737.92)(fs2+ 122002) . (22)

Having calculated this we can proceed to compute the pitch class profiles as described above, using L0_s,t.

Ni et. al. normalize the loudness-based PCP vector after aggregation ac-cording to:

Xp,t=

Xp,t0 − minp0X_p00_,t

maxp0X_p00_,t− minp0X_p00_,t

, (23)

where X_p,t0 denotes the value for PCP bin p time t. Ni et al. state that due to this normalization, specifying the reference sound power level pref is not

necessary.

5.1.5 HMMs

In this section we give a brief overview of the hidden Markov model (HMM), as far as important for this thesis. It is a widely used model for speech as well as chord recognition.

A musical song is highly structured in time – certain chord sequences and transitions are more common than others – but PCP features do not take any time dependencies into account by themselves. A temporal alignment can in-crease the performance of a chord recognition system. Additionally, since we compute the PCP features from the amplitude of the signal alone, which is noisy in regards to chord information due to percussion, transient noise or other, the resulting feature vector is not clean. HMMs in turn are used to deal with noisy data, which adds another argument to use HMMs for temporal smoothing. Definition There exist several variants of HMMs. For our comparison system we restrict ourselves to an HMM with a single Gaussian emission distribution for each state. For the stacked denoising autoencoders we use the output of the autoencoders directly as a chord estimate and as emission probability. An HMM with a Gaussian emission probability is a so-called continuous-densities HMM. It is capable of interpreting multidimensional real valued input such as the PCP vectors we use as features, described above in section 5.1.1.

An HMM estimates the probability of a sequence of latent states correspond-ing to a sequence of lower-level observations. As described by Rabiner (1989), an HMM can be defined as a 5-tuple consisting of:

1. N , the number of states in the model.

2. M , the number of distinct observations, which in the case of a continuous densities HMM is infinite.

3. A = {aij}, the state transition probability distribution, where aij =

P (qt+1 = Sj|qt = Si), 1 ≤ i, j ≤ N , and qt denotes the current state

at time t. If the HMM is ergodic (i.e., all transitions to every state from every state are possible) for all i and j, aij > 0. Transition probabilities

satisfy the stochastic constraints

N

P

j=1

Chord Recognition with Stacked Denoising Autoencoders