Predicting song popularity using machine learning techniques

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

Predicting song popularity using Machine

Learning techniques

V.E.G.M. Van den Akker

Student number: 10538755

Date of version: January 12, 2018 Master’s programme: Econometrics

Specialisation: Big Data Business Analytics Supervisor: Dhr. dr. N. P. A. van Giersbergen Second reader: Dhr. dr. J .C. M. van Ophem

Automatic music classification for search engines and organizing music are popular subjects for research the past couple of years, due to the rapid growth in online music availability. This research tries to classify songs into popularity groups using the corresponding acoustic features of the songs in combination with the lyrics. Before the classification task can be executed, this research uses Natural Language Processing techniques Word2Vec and Latent Dirichlet Allocation to process the lyrics. For the classification task this research uses a range of Logistic Regression models and Support Vector Machine models with and without lyrics. The performance of the classifications is visualized by the Receiving Operating Characteristics curve and measured by the Area Under the Curve, after which the Delong test is used to test if the AUC of the performed models significantly differ from each other. The AUC of the best performing Support Vector Machine Model was found to be equal to 0.5409375, the data in this model includes the ten most frequent words as input variables for the lyrics and is based on the most and least popular quantile of the ordered data set. The best performed Logistic Regression model is the model without the lyrics and is based on the most and least popular quantile, the AUC of this model is equal to 0.6146032, this model is the best performing model of this study. Further research can improve the classification of this research mainly by using a bigger data set.

(2)

Statement of Originality

This document is written by Student Veerle Van den Akker, who declares to take full respon-sibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Introduction

Due to the massive increase in computer usage and the internet over the past decade, there has been an explosion of the volume of available music online. The amount of music available on the internet is rapidly growing, and there are a lot of services through which you can stream or download music. Due to this increase in availability, people are becoming unable to use all the information. In order to keep it comprehensible for people it is necessary to organize the digital music collections and to make music recommendations systems that can be used by people. Therefore automatic music classification for search engines and organizing music has an important place in Music Information Retrieval (MIR). Music Information Retrieval is the interdisciplinary science of retrieving information from music. Those involved in MIR have a background in library science, information science, musicology, music theory, audio engineering, computer science, law and business (Downie, 2003).

The last decade, a lot of research has been done in the Music Information Retrieval using different techniques and theories. One of the major fields in Music Information Retrieval is to classify genres of music. Mayer, Neumayer and Rauber (2008) present findings from investigat-ing advanced lyrics features in combination with acoustic feature sets in order to classify songs in genres. They state that not only the most obvious representation of a song has descriptive power but also the lyrics, which describes a song in terms of content words that are orthogonal to its sound, and which due to its rhyme structure differs greatly from other texts. Using the lyrics in combination with audio features of a song and statistic features such as words per minute Mayer et al. (2008) investigate to what extent these sets of features can be exploited for genre classification.

Another major field in the Music Information Retrieval is to automate indexing of music. Logan, Kositsky and Moreno (2014) explored the use of song lyrics for automatic indexing of music. They applied a standard text processing technique to characterize the semantic content of the lyrics namely Probabilistic Latent Semantic Analysis. This technique measures the sim-ilarity between text documents by converting each document to a characteristic vector. Every component of the vector represents the likelihood of whether that document is about a pre-learnt topic. The topics are distinguished by the frequency of the words. In addition to song

(6)

similarity with the use of lyrics they determined artist similarity and used again the lyrics for this classification task.

Besides genre classification and music similarity it is also interesting to classify artists of songs. Knees, Pampalk and Widmer (2008) use features of artists that can be found on the web. They obtained certain information of the artists by making use of internet search engines, they used constraints to make sure that they retrieved only the top-ranked information pages. The obtained information about the artist is used to classify the artists. Secondly, they investigate the impact of fluctuations from the retrieved content on the results over time. In addition, they also try to classify artists into a couple of genres. They primarily use the Support Vector Ma-chines method to classify the artists. Besides Support Vector MaMa-chines they apply the k-nearest neighbours method for classification, specifically to evaluate the performance of the extracted features in similarity based applications.

The above mentioned method, the Support Vector Machine (SVM) method, is a common method used for classification tasks. This method is also used by Knees et al. (2008). Another research in which this model is used is that from C¸ oban (2017). This research uses Support Vector Machine algorithms and investigates the impact of feature selection and different feature groups on Music Genre Classification. Prior to applying SVM, C¸ oban extracted textual features from Turkish lyrics using different feature extractions models, one of the models they used for their text analysis is the Word2Vec model.

All of the above mentioned researches focus on classification or clustering songs using the information about the corresponding artist, the lyrics, acoustic features or the corresponding genre. Other research is available where the researches use a combination of different features for automatic genre classification or similarity measures. Berenzweig, Logan, Ellis and Brian (2004) stated that it is impossible to compare results from different authors, due to the fact that the different studies do not share a required common ground since they all use different databases and different evaluation methods. None of the research done in the Music Retrieval Information industry uses data from the top 40 from the Netherlands and their corresponding acoustic as well as lyrical characteristics to classify song popularity in the Netherlands. The main goal of this research is to investigate how good songs can be classified into popularity groups using several song characteristics like lyrics, but also acoustic features. Examples of the acoustic features which are used in this research are: the beats per minute, the zero cross rating and the musical key of a song. In this research, a popular song is defined as a song that has received a high ranking in the weekly top 40 chart of the Netherlands. The data is obtained from the weekly ranking lists from 2014 up to and including 2016.

The remainder of this thesis is organized as follows. In Chapter 2 previous research is dis-cussed which is related to the research question of this research. The related work is followed by an introduction to the techniques and methods which are used in this research to classify the popularity of a song. Next, in Chapter 4 the data that is used in this research is described together with an explanation of how that data is obtained. Thereafter the results from the

(7)

CHAPTER 1. INTRODUCTION 3 several classification models are discussed and these results are compared to each other. Ulti-mately Chapter 6 formulates a conclusion and in this chapter is discussed which extensions can be made to this thesis for possible further research and suggests potential improvements for by future research.

(8)

Related Work

As already mentioned in the introduction, automatic music classification for search engines and organizing music has an important place in Music Information Retrieval. The importance and relevance is increasing with the rapid growth of the internet and the digital music collections. In this section previous work that has been done in the field of music classification is discussed. In particular the applied methods from previous research are discussed and how the previous studies extracted several song characteristics, including how the song characteristics are used for the different classification tasks.

There are several studies that investigated which features of music are useful for music clas-sification. For example the Mel Frequency Cepstral Coefficients (MFCCs), these are features which are commonly used in research from the MIR community. MFCCs are a short-time spec-tral decomposition of an audio signal. The MFCCs were initially features for speech recognition but are now standard features for music classification and music similarity studies. The usage of the MFCCs for music classification and similarity is promoted by Logan (2000): she stated that the logarithmic Mel Scale performs not worse than a linear scale on a music discrimination task. However, the results of her study do not indicate whether MFCCs are suited well for music similarity or classification tasks. The idea of MFCCs is to capture the short time spectrum in accordance with human perception. Therefore, a lot of researches emphasize the use of MFCCs in music classification. An example of a research which uses the MFCCs for music classification is the study of Xu, Maddage and Shao (2005), they state that the Mel-frequency cepstrum has proven to be highly effective in recognizing structure of music signals. They use the MFCCs and other song features to characterize music content. Another study which emphasizes the use of MFCCs in music classification is the research of Li, T., Ogihara and Li, Q. (2003). They stated that MFCCs in combination with timbral features are suitable to use for genre classifi-cation by various machine learning classificlassifi-cation algorithms. The timbral features are used to differentiate mixtures of sounds that possibly have similar rhythmic and pitch contents. The timbral features that Li et al. (2003) used in their research are: MFCCs, Spectral Centroid, Spectral Roll-off, Spectral Flux, Zero Crossing and Low Energy. These features are explained more in detail in Section 4.2.

(9)

CHAPTER 2. RELATED WORK 5 Another study which evaluated which audio features to use for a classification task is the study of McKinney and Breebaart (2003). They evaluated four audio feature sets in their ability to classify in general audio classes and to classify popular music genres. They used the following feature sets in their research: (1) low-level signal properties; (2) Mel-Frequency Cepstral Coefficients (MFCCs); (3) psychoacoustic features including roughness, loudness and sharpness; and (4) an auditory model representation of temporal envelope fluctuations. The low-level signal properties and MFCCs are two feature sets which are commonly used in the MIR industry, the other two feature sets are new. The two new feature sets are based on models of human auditory processing. The low-level signal feature set consists of the Root-Mean Squared level of a song, the spectral centroid, the bandwidth, the zero crossing rate, the spectral roll-off frequency, band energy ratio, delta spectrum magnitude, pitch and pitch strength. Most of these low-level signal features are also used in the research of Li et al. (2003). In the study of Mckinney and Breebaart (2003) the psychoacoustic feature set is based on estimates of the percepted roughness, loudness and sharpness. The fourth feature set in their research is based on a model representation of temporal envelope processing by the human auditory system. The low-level signal properties which are used by Li et al. (2003) are almost all also used in the research of Boletsis, Gratsani, Chasanidou, Karydis and Kermanidis (2011). Boletsis et al. (2011) performed a large scale similarity measurement on musical data using mainstream content and context methodologies and in addition they tested the accuracy of the examined methodologies between objective metadata and real-life user listening data. If the studies above are compared than it is clear that a lot of studies use their own audio features but they have also a lot of audio features in common, namely the features that describe the timbral of a song. The timbral of a song is very important for the human hearing to distinct two sounds, timbral features make it possible that a human can differentiate sounds that possibly have a similar rhytmic and pitch.

Another research which uses the MFCCs is the study of Berenzweig et al. (2004). They explored music similarity by measuring it in several ways. For their acoustic based similarity measures they mainly used techniques which are solely based on the audio content. These are the opposite of subjective measures, which involve human judgments. They used probabilistic feature modeling and comparison for their similarity measure. Before they could start their similarity measurements they needed to transform the raw audio into a feature space, which is a numerical representation in which dimensions measure different properties of the input. Many features have been proposed but Berenzweig et al. (2004) concentrate on the features which are derived from the MFCCs. They state that the Mel-Cepstrum captures the overall spectral shape, which contains important information about the timbre, the quality of the singer’s voice and the production effects.

The researches discussed until now only use audio content features, as already mentioned in the previous chapter, Mayer et al. (2008) used in their research not only audio features but also lyrics features for their classification. Audio features are mainly used since this is the

(10)

most obvious representation of a song, but Mayer et al. (2008) state that not only the obvious representation of an audio file (their sound) has potential for typical music information retrieval tasks but also their lyrics which describe a song in terms of content words. They assume that a song’s text content can help in better understanding the meaning of a song, besides this the lyrics also exhibits a certain structure of a song since they are organized in choruses and verses. Their conclusion is that the combination of audio features and lyrics features is beneficial in two ways, namely: it gives a possible reduction in dimensionality and it significantly improves the accuracy of the classification.

Not only Mayer et al. (2008) emphasize the use of lyrics in music classification, namely in the study of Ç oban (2017) is investigated if the use of audio features or lyrics features is more effective for automatic music genre classification for Turkish songs. Ç oban tries to answer the question if it is possible to improve the MIR performance by the usage of a combination of both lyrics and audio features in the genre classification. Ç oban used as audio features the most commonly used timbral features such as the Root Mean Squared, the zero crossing rate, the compactness of a song, the beat histograms and the MFCCs. For the textual features he used several models, amongst which the traditional Bags of Words (BoW), Word2Vec, Structural and Statistical Text Features (SSTF) and the NGram representation. Ç oban emphasizes the use of lyrics features since lyrics are textual content which have their own structure, which often contain rhymes. These statistical and structural features are easier to obtain from lyrics than from melodic content according to Ç oban. Before Ç oban could use the lyrics he applied several text preprocessing techniques, namely: conversion of lowercase, removing punctuations, stemming of the document, removing stop words, text normalization, ASCII conversion (since the lyrics are in Turkish) and term weighting. Not all reprocessing steps were applied for all models since it depends on which representation model was used for the lyrics. For the classification Ç oban uses the Support Vector Machine model, which is most common used in MIR researches and appears to be the most successful method in MIR as a classifier (Ç oban, 2017). Accordingly to the experimental results, Ç oban (2017) obtained the best performance by using both the lyrics features and the audio features. Therefore he suggests to use both features in classification tasks in MIR researches. Regarding the different feature models used for the lyrics, the BoW and Ngram features are most successful as lyrics features, however these models have the disadvantage that they result in a high dimensional feature space. The Word2Vec performs not bad, but this model has little features and therefore it is challenging to compare these results with the other results of the other models. All together, the Word2Vec has promising performance.

As discussed earlier several studies are focused on classification tasks in the music industry, namely grouping songs by genre or artist are very common, although little research has been done on the classification and grouping of the popularity of a song. One of the few researches that has been done on this subject is the research of Dhanaraj and Logan (2005). They tried to detect if there are certain factors that can be quantified in songs which make it more likely that

(11)

CHAPTER 2. RELATED WORK 7 a song is going to be received in the musical top charts. Although societal, cultural and other qualitative factors play a role in the popularity of the song, Dhanari and Logan (2005) assume that the psychology that makes a song popular is not entirely unpredictable. They assume that the popularity of a song is partly based on the quality of music which is perceived by a big group of people. Therefore they investigated if certain features make a song more likely to be popular. They used acoustic features and lyrics-based features in their research. They sought for an unknown intrinsic universal quality in the acoustic content of a song, therefore they extracted features from a song which describe the main sounds which are present in a song. The features that describe the main sound are according to Dhanari and Logan (2005) the MFCCs, with a focus on the timbral aspects of those features. For their lyrics they used a method which describes the semantic content of each song, namely the Probabilistic Latent Semantic Analysis (PLSA). For the classification task they used the Support Vector Machine algorithm and the Boosting Classifier. They concluded that there is some distinguishable thread connecting hit songs, but they state that there is more research needed.

In this section a range of researches regarding music classification are presented and dis-cussed. The majority of those researches are focused on genre classification and music similar-ity, only little research has been done done concerning the popularity of a song, therefore this study tries to investigate how well songs can be classified in popularity groups. A couple of researches discussed in this section state that the usage of lyrical characteristics in combination with acoustic features is best for music classification, thus this study uses a combination of these features as well. The most common method that these researches use is the Support Vector Machine model, this model is also used in this study and is more in detail discussed in the next section.

(12)

Methodology

This chapter explains the methods and models which are used in this research to analyze the data and to perform the classification task. First Section 3.1 introduces the techniques and methods which are used for the natural language processing. The methods discussed in this section are the Word2Vec model and the Latent Dirichlet Allocation. Next Section 3.2 discusses the methods and models which are used for the classification task. In this section, first k-fold cross validation is briefly discussed since this method is used in combination with the classification models. Next the Support Vector Machine model is presented and explained and lastly the Logistic Regression model is introduced.

3.1 Natural Language Processing

This section covers the two techniques which are used to process the lyrics of the songs. The first method that is introduced is the Word2Vec model, the outcome of the Word2Vec model is used as input for the classification models. Secondly, the Latent Dirichlet Allocation model is discussed which is used to detect if there are certain topics in the lyrics.

3.1.1 Word2Vec

The Word2Vec method is widely used for Natural Language Processing (NLP) tasks. The Word2Vec method is created by a team of researchers from Google. The method is popular since it can capture semantic relations among words in text data. There are two variations of the Word2Vec method, namely the Continuous Bag of Words (CBOW) model and the Skip-Gram Model. The differences between these two models are their input, their output, the parameters and the goal of the models. Actually, the two different models are the opposite of each other. Namely, the CBOW model can be used to predict the word given a context and the Skip-Gram model can be used to predict the context of a given word. The context is defined as the words that are likely to appear around the target word.

The two models will be briefly discussed to get a better understanding of how the Word2Vec model processes the text data. The CBOW model uses as input of the model the surrounding

(13)

CHAPTER 3. METHODOLOGY 9 words of the target word. Then this will be processed by the hidden layers to predict the target word and eventually the output is a word. The Skip-Gram model is as mentioned above the opposite of the CBOW model. Namely, the Skip-Gram model uses the target word as the input layer. As output this model delivers the surrounding words of the input word. In Figure 3.1 the two models are presented and this figure shows clearly that the two models are the opposite of each other. It may be clear that if you use the output of the CBOW model as input for the Skip-Gram model, the new output will be the same as the input of the CBOW model which was used. The restrictions to this is that the models need to be well trained and they need to be trained using the same data set otherwise the output of one of the models is not equal to the input of the other model (Rong, 2014).

(a) Continuous Bag of Words model (b) Skip-Gram model

Figure 3.1: Word2Vec models

As mentioned already it is important that the models are well trained. The training of the two models is somewhat different, therefore the training of the CBOW model is first discussed and next the training of the Skip-Gram model is introduced. The training of the CBOW model is as follows: the model parameters are stepwise updated by going through context-target word pairs which are generated from a training corpus. This will cause that the effect on the vectors will accumulate. After many iterations the relative positions of the input and output vectors will stabilize in order to complete the training. The training of the Skip-Gram model is somewhat different, namely in the training of the Skip-Gram model the prediction error is summed across all context words in the output layer. Therefore it is necessary to update every element of every layer, which results in an output matrix for each training. Since every element of every layer is updated in the Skip-Gram model, the training of the CBOW model is faster. Another advantage of the CBOW model is that it has a better performance when the data is limited. However, the advantage of the Skip-gram model is that it does a better job in performance for infrequent words in the document.

The main idea of both the Word2Vec models is mapping each word into a n-dimensional vector and detect the semantic similarity of words by calculating the distance towards the surrounding words, namely by calculating the euclidean distance. The output of the Word2Vec

(14)

model will be that every word is represented as a 100-dimensional vector which captures the semantic relations among the words. Since a lot of words share a similar meaning the euclidean distance between those words will be very small. The Word2Vec feature space is used as an input for the classification models since these representation of the lyrics still captures the semantic relationship between the words in the lyric instead of using the lyrics as total as an input for the classification models. This study will use the CBOW model since the study of C¸ oban (2017) shows that the CBOW model performed better than the Skip-Gram model and secondly the CBOW model has a better performance when the data is limited and the data is limited in this study also because lyrics are very repetitive and therefore the amount of unique words in a lyric will be little. Besides these reasons the CBOW model is computationally less intensive which is also an advantage.

3.1.2 Latent Dirichlet Allocation

This section introduces the Latent Dirichlet Allocation (LDA) which is another commonly used method in NLP. It is a generative probabilistic model which can be used for discrete data. LDA is a three-level hierarchical Bayesian model, which tries to model a collection of items into a set of topics (Blei, Ng, Jordan, 2003). In this research LDA is used for text data, therefore the explanation of this model is focused on how the model handles documents, corpora and words. The method allows sets of observations to be explained by unobserved groups, which explain why some parts of the data are similar to each other. These unobserved groups are referred to as topics. A word is a discrete unit from a vocabulary which is indexed by {1, ..., V } with V the number of words in the vocabulary. The unit vectors are labelled as w and are of length V , the element in the unit vector is 1 for the vth word and the other elements are then 0. The document is defined as a sequence of N words, and its notation is as follows d = {w1, ...., wN}.

In this research a document refers to a lyric of a song. Then the corpus is a collection of M documents, this means that in this research the corpus consists of M lyrics. The notation of the corpus is as follows: D = {d1, ..., dM}.

In the LDA model, documents are presented as a random mixture of latent topics. The first step in the Latent Dirichlet process is that for each document a topic distribution is generated from a Dirichlet distribution with parameter alpha, so θ ∼ Dir(α). Then for each topic a word distribution is generated from a Dirichlet distribution with parameter β. The third step is that for each word of a document a topic is chosen accordingly to a multinomial distribution with the parameter that is generated by the Dirichlet topic distribution, zn∼ M ultinomial(θ). The

last step in this model is that a word from a multinomial probability distribution is chosen which is conditioned on the topic, so choose a word wn from p(wn|zn, β), where p(wn|zn, β) is

the multinomial probability conditioned on the latent topic zn. Figure 3.2 shows the graphical

(15)

CHAPTER 3. METHODOLOGY 11

Figure 3.2: Graphical representation of te LDA Model

From this figure in may be clear that a LDA topic node zn is sampled for each word wn.

Namely, there are three levels to the LDA representation. The parameters α and β are on the level of the corpus, so these parameters are sampled once in the process. The variables θd are

document level variables, these variables are sampled once for every document and the variables zdnand wdn are word level variables, so these variables are sampled once for every word in each

document (Blei et al., 2003). This means that each document can have multiple topics.

The main output of the Latent Dirichlet Allocation is the set of K topics. The interpretation of the topics can be hard. Therefore, it is convenient to order the topics from most positive to most negative, since the LDA will be used for the classification of a popular or unpopular song. To order the topics the weighted average rating for each topic is calculated according to the scores of the song. The scores of the songs are discussed later.

For LDA various inference techniques have been proposed. For instance, the Gibbs sampling and Variational inference methods can be used. The Gibbs sampling is a special case of Markov Chain Monte Carlo (MCMC). As mentioned by Asuncion, Welling, Smyth and Teh (2009) the Gibbs sampling is comparable to the other inference techniques, based on the performance of the different methods. They examined that the performance of the different inference techniques is highly comparable since the differences in performance largely disappears when using the appropriate hyperparameters. Therefore, for this study, the Gibbs sampling is used, as this method is less difficult to implement and the results are comparable with the usage of the other inference techniques.

The Latent Dirichlet model is a model which uses a document term matrix as input, yet since not all words in a text are equally important it is possible to use the Term-Frequency Inverse Document Frequency (TF-IDF) before implementing LDA. The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. This helps to adjust for the fact that some words appear more frequently in general. The TF-IDF metric quantifies the relative importance of a word to a document and is calculated for every word as follows:

T F − IDF (t, d, D) = T F (t, d) ∗ IDF (t, D) = T F (t, d) ∗ log

|D|

1 + |{d ∈ D : t ∈ D}|

, (3.1) where t denotes to the terms, d refers to the document and D denotes to the collection of

(16)

documents which can be seen as D = {d1, d2, ..., dn}, where n is the number of documents.

Thus |D| is the size of the document space. The denominator of the IDF (t, D) consists of |{d ∈ D : t ∈ D}|, which is the total number of times term t appeared in all of the documents d ∈ D. The calculation of term frequency part of the T F − IDF formula is very simple, namely T F (t, d) calculates the number of times each word appeared in each document. In this study, d refers to the document which is a lyric of a song.

3.2 Classification

In this section the models are presented which are used for the classification task of this research. Before the classification models are discussed, k-fold cross validation is introduced, since this method is used in combination with the classification techniques. Next the Support Vector Machine model is presented and lastly the Logistic Regression is discussed.

3.2.1 Cross Validation

Since the data available is limited, it is preferable to use k-fold cross validation to set the training and test set for the model. The main reason for using k-fold cross validation is that it has a lower variance than a single hold-out set estimator. Namely, if the training set is composed by 80% of the total data set than the remaining 20% of the data set will be used for the test set, it will cause a lot of variation in the performance estimate of the model for different data samples. On the other hand, k-fold cross validation allows to use the proportion (k − 1)/k of the data set to train the model and allows to use the other proportion of the data set to test the model (Bishop, 2006, pp. 32-34). Therefore, k-fold cross validation will reduce the variance since it averages over k different partitions, ultimately resulting in a less sensitive performance estimate. The accuracy of the model by using k-fold cross validation is the overall number of correct classifications divided by the number of instances in the dataset (Kohavi, 1995). K-fold cross validation is not always the best method to use for all classification models. For instance if the object order in the data matters, it is not useful to use the cross validation. Another disadvantage of using cross validation is that the number of train iterations that need to be performed is increased by a factor of k, which can be problematic if the model is already computational hard.

3.2.2 Support Vector Machine

Until now the techniques are discussed which are applied for the preparation of the classification. For the classification task this research uses the Support Vector Machine (SVM) model. Support Vector Machine is a machine learning technique which became popular some years ago for solving problems in classification, regression, and novelty detection. SVM has the important property that the determination of the model parameters corresponds to a convex optimization problem, which means that any local solution is also a global optimum (Bishop, 2006, pp. 325-356).

(17)

CHAPTER 3. METHODOLOGY 13 SVM aims to create a decision boundary between two or more classes. This research will focus on the two-class classification. The Support Vector Machine model works as follows: suppose there are several data points which need to be classified into two classes. The goal of SVM is then to find a hyperplane which separates the data points in exactly two classes. As it is possible that there exist many possible hyperplanes which separate the data points in exactly two classes, SVM chooses the hyperplane which shows the largest margin between the data points. The margin is defined as the distance between the hyperplane and the closest one of the data points. The location of the hyperplane, which is the boundary, is determined by a subset of the data points, these data points are known as support vectors. Summarized, SVM aims to maximize the margins while minimizing the error. The two-class classification problem using linear models is of the form:

y(x) = wTφ(x) + b, (3.2) where y(x) indicates the discrimination index, φ(x) denotes a fixed feature-space transformation, w is the weight vector and b is the bias parameter. The training data set consists of N input vectors x1, ...., xN, with corresponding target values t1, ...., tN, where tn ∈ {−1, 1}. If the

data set is linearly separable in the feature space, then there exists at least one choice for the parameters w and b such that for points with tn= +1 the following is satisfied y(xn) > 0 and

that data points with tn= −1 corresponds to y(xn) < 0. This means that for all training data

points the following needs to be satisfied: tny(xn) > 0.

The model presented in Equation 3.2 assumes that the training data points are linearly separable in the feature space φ(x). The result of this model is that the Support Vector Machine gives exact separation of the training data, even if the corresponding decision boundary is not linear. In practice, the exact linear separation of the training data is not always possible, since the class-conditional distributions may overlap. This can lead to poor generalization. Therefore it is needed that the Support Vector Machine allows some of the data points to be misclassified. This implies that some data points need to be allowed to be on the ’wrong side’ of the hyperplane, but those misclassified data points need to be penalized corresponding to their distance from the hyperplane. Hence, the slack variable ξ ≥ 0 is introduced, which is used to give a slight penalization to the optimization problem for misspecification. In other words, the goal is to maximize the margin while softly penalizing points that lie on the wrong side of the hyperplane (Bishop, 2006, pp. 325-356). This optimization problem can be solved with Lagrange and is presented as follows:

min w,b,ξ 1 2kw 2_{k + C} N X n=1 ξn s.t tn(wTφ(x) + b) ≥ 1 − ξn≥ 0, (3.3)

where n = 1, ..., N , the variable ξn is a slack variable for each training data point and C > 0 is

the specified penalty parameter of the error term. Data points for which ξn = 0 are correctly

classified, this means that they are either on the margin or on the correct side of the margin (Bishop, 2006, pp. 325-356).

(18)

The Support Vector Machine is fundamentally a two-class classifier. However, the SVM can be used for classification problems with more than two classes. Various methods have been pro-posed for combining multiple two-class Support Vector Machines in order to build a multiclass classifier. This research will not elaborate on the multiclass classifier since this study only uses SVM as a two-class classifier.

The model described above can only be used for linear classification. A non-linear classifica-tion can be made by applying kernel funcclassifica-tions to maximum-margin hyperplanes. This method allows to work with complex feature spaces without the need to explicitly address them. It cal-culates the inner product of two data points which makes it possible to construct a non-linear decision boundary in high dimensional spaces. The right choice of the Kernel function is of high-priority for the succession of the classification task of the Support Vector Machine model. There are four basic kernels which can be used for SVM, they are as follows:

K(x, x0) = xTx0 Linear Kernel (3.4a) K(x, x0) = (γxTx0+ c)d, γ > 0 Polynomial Kernel (3.4b) K(x, x0) = tanh(γxTx0+ c) Sigmoidal Kernel (3.4c) K(x, x0) = exp(−γkx − x0k2), γ > 0, Radial basis function (RBF) Kernel, (3.4d) where γ is the kernel parameter. The choice for the right Kernel function is hard but a good starting point is acquiring prior knowledge of other researchers and testing from there.

3.2.3 Logistic Regression

The Logistic Regression is another commonly used method for classification tasks. In essence, the Logistic Regression model calculates the class membership probability for one of the two categories in the data set. The goal of the model is to model the probability of the dependent variable being equal to 0 or 1. The Logistic Regression model takes the form:

pi = P[yi= 1|X = x] =

ex0β ex0_β

+ 1 (3.5) The SVM model, which is commonly used for classification, does not produce coefficients of features, but only the actual classification results. Therefore, it is very useful to use the Logistic Regression. The Logistic Regression produces estimated coefficients on the features. Logistic models are usually fit by maximum likelihood (Friedman, Hastie & Tibshirani, 2001, pp. 119-127 ). To test the validity of the Logistic Regression model the McFadden’s pseudo R-squared can be used, which tests the predictive power of the model. The McFadden’s R-squared is defined as:

R2_{M cF adden}= 1 − log(Lc) log(Lnull)

, (3.6)

where Lnull is the value of the likelihood function for a model with no predictors and Lc is the

(19)

CHAPTER 3. METHODOLOGY 15 negative, since the likelihood contribution from each observation is a probability between zero and one. If the model does not do a good job on prediction in comparison with the null model then Lc is not much larger than Lnull and therefore the McFadden’s R squared is close to 0,

this means that the model does not have a great predictive power. Conversely, if the model is good in prediction in comparison with the null model than the McFadden’s R squared is close to 1.

A potential problem in the Logistic Regression is the overfitting of the model, which can occur if the model contains a lot of variables compared to the observations. The solution to this problem is adding a penalization to the regression. Which is the same as in the Support Vector Machine, namely a penalty term which penalizes the model for misclassification. The optimization problem of the penalized Logistic Regression takes the form:

min ( − n X i=1 {yiln pi+ (1 − yi) ln(1 − pi)} + λP (β) ) , (3.7) where λ is the penalty parameter and P (β) is the penalty function and the left side of the equation is the negative of the log-likelihood function which is defined as the cross-entropy error function. This research will use two different penalty functions, namely the Lasso and the Ridge. The Lasso penalty function which is first proposed by Tibshirani (1996) is a penalized least squares method which does simultaneously continuous shrinkage and automatic variable selection. By shrinking it removes the effect of the least relevant coefficients. The other penalty function which is used in this research leads to the Ridge regression which minimizes the residual sum of squares subject to the bound on the L2-norm of the coefficients (Hoerl and Kennard,

1970). The result of using this penalization is that the coefficient will be smaller but non-zero coefficients. A combination of the two penalty functions result in the elastic net. The two different penalty functions are defined as follows:

Ridge : P (β) = p X j=1 β_j2 (3.8a) Lasso : P (β) = p X j=1 |β_j| (3.8b) The choice of the λ in Equation 3.7 is very important. The λ controls the trade-off between the penalty and the fit of the model. Namely, if the λ is too small this will result in the fact that the model tends to overfit the data and the estimation of the model will then have a high variance. The opposite is that the λ is too large, this causes that the model tends to underfit the data and the results are potentially biased.

Not only the validity of the Logistic Regression models itself has to be tested but also the comparison between the models and in addition the comparison between the Logistic Regression models and the Support Vector Machine models. The performance of the different models can be measured by using the accuracy ratio and the Receiving Operating Characteristics (ROC). The ROC can be used to graph the sensitivity and specificity over a range of various thresholds,

(20)

this can be done by the ROC curve. The curve is created by plotting the True Positive Rate (TPR) which is the sensitivity on the y-axis against the False Positive Rate (FPR) on the x-axis. The TPR is calculated as the percentage of correctly classified positive observations relative to the total number of true observations, so TPR = _TP+FNTP where TP is the number of True Positive classifications and FN is the number of False Negative classifications. The FPR is calculated as the False Positive classifications relative to the true negative observations, namely FPR = _FP+TNFP , where FP are the False Positive classifications and TN are the True Negative classifications. The best possible model has no False Positive (FP) and no False Negative (FN) classifications, which is the left upper corner of the ROC graph. For this reason the ROC curve can be used to visualize the performance of a model, namely the more convex the graph the better the model performs since it is closer to the upper left corner. The diagonal line can be added to the ROC graph, this line represents random classification. The classification performance of a model can be expressed in a numerical value using the ROC curve, namely the Area Under the Curve (AUC) which takes the form:

AU C = Z ∞

−∞

T P R(T )(F P R0(T ))dT, (3.9) where T ∈ (0, 1) are the various thresholds, T P R(T ) is the True Positive Rate and F P R0(T ) is the derivative of the False Positive Rate. The value of AUC is always between 0 and 1. If the value of the AUC is close to 0.5 this means that the two classes are statistical identical. The closer the AUC gets to 1 the better the model performs. If two models are compared using the ROC curve and the AUC the Delong test can be used. The Delong test is used to test the statistical significance of the difference between the two under the curve areas of the models.

(21)

Chapter 4

Data

In this chapter the data that is used for this study is introduced. The first section explains how the popularity of a song is obtained and calculated. The next section briefly explains how music works and how music can be presented. This study uses several acoustic features for the clas-sification, these acoustic features are obtained by two different programs. These two programs are introduced, and with the introduction of those programs also the corresponding acoustic features are discussed. Section 4.3 explains how the lyrics from the songs are collected and a preliminary data description is presented. The last section shows the results of the exploratory data analysis, which is used to gather initial insights in the data before the classification models are estimated.

4.1 Top 40 Netherlands

This research uses the Top 40 of the Netherlands. The weekly top charts between 2014 and 2016 are used, between those years there were in total 469 unique songs. This research uses the weekly top 40 chart to obtain an objective measurement of the popularity of a song. The data is limited to the weekly top charts from 2014 to 2016 since it is highly probable that the popularity of a song is time dependent. The objective measurement of the popularity is applied by giving a score to every song according to their rank in the top 40 on a weekly basis. Namely, if a song was the number one song in a certain week this song earned 40 points for that week, the second song in that certain week earned 39 points and further decreasing until the song which was the number 40 in that week earned 1 point. This distribution of points is done for every week. Some songs were in 2014 in the top chart but entered the top for the first time in 2013 and a couple of songs that were present in the top chart in 2016 were still present in the top chart for a couple of weeks in 2017. Therefore, it is necessary to correct for the censoring in the scores of these songs. Thus, for every song present between 2014 and 2016 this research checked when the song entered the top 40 and when it left the top 40. The total amount of weeks that a song was present in the top chart varies from 2 weeks to 102 weeks in the data set, which can be seen from Figure 4.1. The next step in processing the popularity of song is

(22)

aggregating the weekly scores per song to obtain the total score. Since there is a lot of variation in the weeks there is also a lot of variation in the total score of the songs, namely the song with the highest total score has a score of 1449 and the song with the lowest total score has a score of 4. The mean of the total score is 330. Since there is a lot of variation in the scores of the songs and this study uses a two-class classification the ordered data set is divided into four quantiles, the first quantile and the second quantile are considered to contain the non popular songs and the third and fourth quantile consists of the popular songs. The ordered data set is divided into four quantiles, since in Section 5.4 only the first and fourth quantile are used for the classification, as an attempt to improve the classification. Considering the fact that the artist also may influence the popularity of a song it is interesting to check how many songs on average per artist were present in the top chart. The average songs per artist can be seen from the table below and is approximately 2.

Figure 4.1: Overview of the data obtained from the top chart

4.2 Acoustic song analysis

Before the extraction of acoustic song features can be performed it is important to understand how the sound of a song is represented. The sound of a song consists in essence of sound waves, which are vibrations of the air. These vibrations can be visualized in a waveform, an example of a waveform of a song can be seen in Figure 4.2. There are two main properties which are key to the vibration in a song. Namely, the amplitude and the frequency. The amplitude is the size of the vibration of a sound and it determines how loud the sound is. So a larger vibration generates a louder sound. The second property of a sound is the frequency, which is the speed of the vibration, this determines the pitch of the sound. The frequency of a sound is measured as the number of wave cycles that occur in one second. The unit of the frequency measurement is Hertz (Hz). A frequency of 1 Hz is a one wave cycle per second. If for instance a sound is 10 Hz, this means that there are ten wave cycles per second in the sound. This means that the higher the Hz the shorter and closer the waves are to each other. Each song is a function of time to amplitude which can be seen in the waveform presentation of a song in Figure 4.2.

(23)

CHAPTER 4. DATA 19

Figure 4.2: The waveform of a song

4.2.1 MIRToolbox

This research uses two different programs to extract different song features. One of those programs is the MIRToolbox which is a Matlab toolbox dedicated to extraction of musically related features from audio recordings like wav and mp3 files. The toolbox has many functions but in Figure 4.3 an overview of the main features is presented (Lartillot & Toiviainen, 2007).

Figure 4.3: Overview of the musical features that can be extracted with MIRtoolbox. Source: (Lartillot & Toiviainen, 2007)

The MIRtoolbox is widely used for different purposes, amongst the purposes are timbre analysis, speech recognition, music similarity and rhythm analysis. This research uses the toolbox to incorporate as many features possible which identify a song according to previous studies. Figure 4.3 shows an overview of all the main features. The different processes start from the audio signal which is on the left side of the figure. This overview also captures the complexity of the different features, as the simplest computations of features are on the top of the figure and the more detailed and complex computations are organised more towards the bottom of the figure. An example of one of the simplest features is the zero crossing rate, this feature is only based on a simple description of the audio waveform itself. An example of one of the

(24)

hardest features to compute is the pulse clarity feature, which roughly describes how easily a listener can perceive underlying rhythmic and metrical pulsation of a piece of music. Each feature is related to one of the musical dimensions which are traditionally defined in the music theory. Namely: the pitch, the tonality, the dynamics of a song, the rhythm and the timbre. In Figure 4.3 the features related to rhythm are presented in bold italics (Tempo, Pulse Clarity and Fluctuations), simple italics highlight the features that are related to the timbre, the boldface characters highlight the features related to the pitch, the tonality and the dynamics and all the operators in grey italics can be applied to many others different representations (for instance to statistical moments).

In the remainder of this subsection the different features obtained by the MIRtoolbox are discussed. First the Mel Frequency Cepstral Coefficients are explained in detail and followed by the other song features.

Mel Frequency Cepstral Coefficients

A song is considered a complex signal since it consists of a combination of instrumental music and a human voice. Considering these aspects of a song it is useful to use features of a song which rely on the cepstrum. The cepstrum is the Inverse Fourier Transform (IFT) of the logarithm of the spectrum of a signal. This research uses the Mel Frequency Cepstral Coefficients (MFCCs), it may be clear from the name of the features that these features are based on the cepstrum of a signal. The MFCCs are a common way to describe the timbre of a song and they are short term spectral based features. The strength of the MFCCs lie in the fact that they are able to represent the amplitude spectrum of a song in a compact representation. The computational process of these features are presented in Figure 4.4 and every step of this process is discussed below the figure.

Figure 4.4: Steps for computing MFCCs

As can be seen from Figure 4.4 the first step of the process is to load the audio file, next the audio signal is divided into a couple number of frames of a fixed duration. Since the audio sequence is not infinite there may consist overlap in the frames if the signal is divided in frames of a fixed duration. The application of the Fourier Transform, which is used later in the process, requires to replace the infinite time before and after the sequence by zeroes. This means that the sequence of infinite signals will possibly lead to discontinuities at the borders. To avoid problems due to these discontinuities a windowing function can be used for the sampling of frames, a window function which is particular good with the Fourier Transformation is the hamming window function. Therefore, in this study the hamming window function is used to

(25)

CHAPTER 4. DATA 21 divide the sequence in frames. The use of these steps provides a cepstral feature for every frame. The next step in the process is that the amplitude spectrum for each frame needs to be computed. This is obtained by applying Discrete Fourier Transform (DFT) to each frame and it is achieved by making use of the following equation:

ck= 1 N N −1 X j=0 fjexp −i2πjk N k = 0, 1, ..., (N/2) − 1, (4.1) where N is the number of sampling points within a frame. In the next step of the process the logarithm of the amplitude spectrum is taken. The logarithm of the amplitude spectrum is applied instead of a linear function since the relation between perceived loudness and the amplitude spectrum is more logarithmic than linear. Thus, so far we have an N -dimensional spectrum with N the frame size. The following step that is taken is used to smoothen the spectrum to make it perceptually meaningful. This needs to be done conform the human auditory system which has led to the development of the Mel Scaling. The Mel Scaling is developed by Stevens, Volkmann and Newman (1937). There is no single Mel-Scale formula but the approximated formula which is widely used for Mel-Scale is presented in Equation 4.2.

FM el= 1000 log(2) 1 + FHz 1000 , (4.2)

where FM el is the resulting frequency on the Mel-Scale measured in Mels and FHz is the normal

frequency measured in Hz. The relationship between the normal frequency and the Mel fre-quency is linear below 1kHz and logarithmic above 1kHZ, which is conform the auditory system of the human hearing. The relationship between the frequency and the Mel frequency can be seen in Figure 4.5.

Figure 4.5: The relationship between Mel and Frequency

The output of applying the Mel Scaling are the Mel spectra vectors, these vectors are a rep-resentation of the spectrum that is sensitive in a way similar to how human hearing works. Some aspects of this shape are more relevant than others, therefore the Discrete Cosine Trans-formation (DCT) is used. The DCT is used instead of Fourier Transform since the DCT is

(26)

better in keeping important information in the low coefficients of the Mel Frequency spectrum. It is important to capture these low coefficients since these lower-order coefficients will make good features in the sense that they represent some simple aspects of the spectral shape. After computing the MFCCs for all the frames, the coefficients in the vectors are the weighted sum of all spectral components obtained after the DCT. As a result only a few coefficients needs to be used since signature of the complete frequency is still embedded in them. Therefore, the first 13 MFCCs are enough to present the timbre of a song. The MFCCs can be represented as follows: M F CC(m) = K X k=1 log(Dk)cos mπ k k − 1 2 , (4.3) where m is the number of coefficients and is equal to m = 0, 1, ..., k − 1 and Dk is the Mel

frequency spectrum input. Song Features

The MFCCs are not the only important features describing a song. This section introduces the other features that are used in the classification models in this research. The features are ordered according to their computational complexity, starting with the most simple ones. The used features besides the MFCCs are (1) the temporal length, (2) the zero crossing rate, (3) the Root Mean Squared (RMS) energy, (4) the lower energy, (5) the centroid, (6) the spectral roll-off and (7) the pulse clarity. It is key to understand what every feature means to be able to interpret the importance of the features in the classification. Therefore a short explanation per feature follows.

The first feature used in this research besides the MFFCs is the temporal length of a song. The temporal length indicates a form of repetitiveness of a song and is indicated in samples, since most music songs are organized in repeated units of equivalent lengths.

The second feature that is used is the zero-crossing rate, this feature is based on a simple description of the audio waveform. It is used to examine the similarity between two or more accelerator sensors. This feature measures whether two sets of time series measurements exhibit similar patterns. It computes the sign-changes rate along the signal, this means that it counts how many times the waveform crosses the x-axis.

The third feature that is used besides the MFFCs is the Root Mean Squared (RMS) energy, it is a good measure of the power of a signal of the sound. The RMS indicates the global energy of the signal and can be computed by simply taking the root average of the square of the amplitude. From this feature the lower energy can be computed, as the lower energy of a song is the percentage of frames where the RMS energy is lower than a selected threshold. For instance, a song with very loud frames but also lots of silent frames would have a high low-energy rate. It is a measurement to indicate how much of a sound is quiet relative to the rest of the sound.

(27)

CHAPTER 4. DATA 23 The MFCCs are features which are commonly used to describe the timbre, but basic statistics of the spectrum also provide some timbral characteristics. The spectral distribution of a song can be described by statistical moments, one of those moments is the centroid. This feature is the geometric center of the distribution of a song and is a measure of central tendency for the random variable. Another basic statistic of the spectrum that is used in this study which gives some timbral characteristics is the spectral roll-off of a song. The spectral roll off point is a measure of the right skewness of the power spectrum, it is indicated as the fraction of bins in the power spectrum at which 85% of the power is at lower frequencies.

A high level musical dimension feature is the Pulse Clarity. This feature estimates the rhythmic clarity, this indicates the strength of the beats. This feature transmits how easily in a given musical piece the listener can perceive the underlying rhythmic or metrical pulsation.

Figure 4.6 shows an overview of the variables obtained using the MIRtoolbox including the mean, the minimum value and the maximum value of the variables. From this overview it is clear which variables have a large variation, namely the zero-crossing rate and the spectral roll-off have, as there is a big gap between the highest and lowest values of those variables according to the data set used for this research.

Figure 4.6: Overview of MirToolbox variables

4.2.2 Mixed in Key

Besides the MIRtoolbox this study uses another program to obtain the acoustic song features of a song called Mixed in Key. One feature that is provided by the program is the key detection of a song. This feature scans the songs and shows the results using the Camelot Wheel notation which is presented in Figure 4.7. On this wheel, musical keys can be interpreted as ”hours” on a clock. The letter B represents Major keys, and letter A represents Minor keys.

(28)

Figure 4.7: Camelot Wheel Mixed in Key

Songs with the same number are quite equal to each other concerning their melodies and bass lines. This technique is used by DJ’s so they can make smooth transitions between tracks and to ensure that the vocals, melodies, and bass lines of their songs sound great together. For instance, if you have a song which is a 8A then songs which go well together with the 8A are 8B, 9A or 7A.

Mixed in Key also provides the tempo of a track, also known as the beats per minutes of a song. The beats per minutes for music in the used data set varies from 80 to 158. This means that the song with a beats per minute value of 80 has the slowest tempo and the track with a a beats per minute value of 158 has the highest tempo of all songs.

Besides the musical Key, the program also analyzes the music files and provides the general energy level of a track. This energy level ranges from 1 to 10. The lowest energy level is a 1, that means that the certain track does not have a traceable beat. Energy Levels 2, 3 and 4 are considered Chill out and Lounge music. The most common energy levels are 5, 6 and 7. Energy level 5 covers some Deep House, Tropical House, and Minimal genres. Energy level 6 is very often used in night clubs. Energy Level 7 is rather high-energy but, the highest energy level is 10, which only very few tracks have, since it is not very mainstream. In the data set used in this study the lowest energy level which occurs is a 2 and the highest energy level is a 9. In Figure 4.8 an overview is presented of the variables which are obtained from the Mixed in Key program with their corresponding mean. Also the highest and the lowest values of the variables are presented.

(29)

CHAPTER 4. DATA 25

4.3 Lyrics

This research uses the songs which occur in the top 40 of the Netherlands between 2014 and 2016. Since no large lyrics data set was publicly available due to copyright this research collected their own lyrics based on the top 40 of the Netherlands. However, song lyrics are widely available on the internet but they are in the form of user-generated content. This research chose to use Lyricsmode1 _{since this website has a large set of lyrics and it is subjectively high in consistency.}

Even so, this database was not able to deliver all the lyrics for all the songs which occur in the top 40 of the Netherlands therefore this research tried to complement the missing lyrics by using songteksten2 which is a Dutch website. Since the top 40 of the Netherlands is used as data set some lyrics contain Dutch words or are in total in Dutch. These lyrics and all the other non-English lyrics are removed since it is better to have a data set which is consistent in language for the execution natural language processing tasks. Therefore, this research only uses the songs which contain fully English lyrics. The songs without a lyric and the songs with non-English lyrics are omitted from the data set, the remaining data set consists of 346 songs. Also the songs without observations in Mixed in Key or the Mirtoolbox are removed from the data set. So eventually, the data set which is used for this research contains 326 unique songs. So in total there are 146 observations removed from the original 469 unique songs.

Before the lyrics can be used for analysis a lot of cleaning needs to be done. The lyrics are made into a corpus which can easily be used for the text preparation. To be able to work with a consistent data set all the lyrics are converted to lower-case letters and all the punctuations are removed. Next all the stop words are removed using an English dictionary and lastly stemming is used which reduces inflected words to their word stem, base or root form. To see the 100 most frequent words of the lyrics a word cloud is presented in Figure 4.9.

Figure 4.9: Wordcloud lyrics

To have a better overview of the most frequent words Figure 4.10a shows the top 10 most frequent words which are present in the lyrics. This graph shows that there are several words which are very often used but are quite meaningless, for example the word ”got” is used more than 600 times but does not really have a meaning. Therefore, as mentioned in Section 3.1.2, the

1

http://lyricsmode.com

2

(30)

Term-Frequency Inverse Document Frequency (TF-IDF) is calculated. The TF-IDF quantifies the relative importance of a word in a document. Figure 4.10b demonstrates the top ten most frequently encountered words after applying the TF-IDF to the document term matrix. The main difference between the two graphs is the frequency. In Figure 4.10b the frequency is much lower than the frequency of the words in Figure 4.10a, this is due to the fact that TF-IDF quantifies the relative importance of a word in a document. As mentioned above there are a lot of words present in the text which do not have a meaning, but it is remarkable that the words present in the TF-IDF frequency plot are not that common if you think about lyrics of a song and besides that the words are also hard to interpret. For instance, the word ”whoa”, which is present in Figure 4.10b, there is not a clear explanation or meaning for this word. The words in Figure 4.10a are more likely if you think about lyrics of a popular songs, for example the words ”love” and ”feel”. The fact that the frequency of the most frequent words after TF-IDF reduces a lot is not hard to understand since lyrics are very repetitive, so therefore the words that occur very often are removed by applying TF-IDF.

(a) Whole data set (b) TF-IDF

Figure 4.10: Top ten most frequent words present in lyrics

This research uses the Word2Vec model to present the lyrics in the classification models, which has been proven to be useful in combination with classification tasks. The Word2Vec method also captures semantic relationships among words which TF-IDF can not capture. For these two reasons it is not necessary to use the TF-IDF model. This decision is based on the fact that after applying TF-IDF to the lyrics there is a big loss in important words which may characterize certain types of songs. The greatest example for this loss is the word ”love” which is present in Figure 4.10a but is not present after applying TF-IDF. The TF-IDF model is useful in many text analysis but may not be as useful for the lyrics of songs. For these reasons the TF-IDf model is not used in this research, but instead only Word2Vec will be used.

(31)

CHAPTER 4. DATA 27

4.4 Exploratory Data Analysis

In this section the results regarding the Exploratory Data Analysis (EDA) are presented and discussed. EDA is very important to apply in research since it provides insights in your data before the data is used in statistical models. In this study EDA is used first to gain insights in the relationship between the explanatory variables and the target variable and second to gain insight in the relationship among the explanatory variables itself.

4.4.1 Correlation

One of the most important things to consider in exploratory data analysis is the correlation between the explanatory variables. If the correlation between the variables is close to -1 or 1, this means that the variables are highly correlated. Highly correlated explanatory variables can cause instability in the model. To fix the possible instability one of the variables which are highly correlated should be omitted from the model.

Figure 4.11 shows the correlations between the explanatory variables defined in this study. From this figure can be concluded that the variable Energy Level highly correlates with the variable Centroid and with the first Mel Frequency Cepstral Coefficient (MFCC1). Besides the correlation of the Energy Level variable with the other variables, the Centroid variable and MFCC1 are also highly correlated. The Energy Level variable is thus omitted from the model, since it highly correlates with two other variables and the p-values of the correlations are smaller than 0.05, namely they are approximately equal to zero. The variable Centroid and the MFCC1 variable seem to be highly correlated as well, but the p-value of the correlation is equal to 0.1013, meaning that it is not significant at the 5% significance level. Therefore none of the two variables are omitted from the model, since the p-value of the correlation is not significant and it is possible that these variables have explanatory power. Thus, based on the correlation between the variables and their p-values only the variable Energy Level is omitted in the remainder of this study.

(32)

Figure 4.11: Correlation plot of the features

4.4.2 Numeric Variable Exploration

Besides the correlation of the explanatory variables it is also interesting to investigate the frequency of the distributions of the numeric explanatory variables in association with the target variable. The target variable in this study is defined as popularity, with the two possible values popular or unpopular, which is based on the scores of the top 40 of the Netherlands.

In Figure 4.12a the non-normalized histogram of the tempo of a song in association with the popularity of a song is presented. In Figure 4.12b the normalized histogram of the same relation is presented, which is useful to identify if there is a certain pattern in the relationship between the predictive variable and the target variable. There should always be a non-normalized histogram be provided next to the normalized histogram, since the normalized histogram does not provide any information about the frequency of the variable.

(a) Histogram (b) Normalized histogram

(33)

CHAPTER 4. DATA 29 From Figure 4.12b there is no real evidence that certain values of the tempo of a song indicate a popular or an unpopular song, since the distribution of popularity is approximately 50% for all the various values of the explanatory variable. These visualizations are made for all the explanatory variables, but none of the variables present a certain pattern which can be used to cluster the variable.

Even though for some values it appears in the normalized histogram that the values con-tribute to popularity of a song, this is caused by those values only containing one observation, which can only be seen from the non-normalized histogram. Therefore it is very useful to pro-vide a non-normalized histogram next to the normalized histogram. An example of a variable where this happens is the variable Musical Key. From Figure 4.13b it looks like that all the songs with Musical Key 6A/6B are popular, but this value only contains one observation which can be seen in Figure 4.13a.

(a) Histogram (b) Normalized histogram

Figure 4.13: Histogram of the Musical Key with popularity overlay

Analyzing all variables separately results in the observation that none of the variables possess a certain pattern in terms of the relationship between the explanatory variable and the target variable. Consequently, none of the variables are clustered in certain groups but used in their original format in the classification models in the next chapter.

Predicting song popularity using machine learning techniques

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Predicting song popularity using Machine

Learning techniques

V.E.G.M. Van den Akker

Statement of Originality

Contents

Chapter 1

Introduction

Related Work

Methodology

3.1

Natural Language Processing

3.2

Classification

Chapter 4

Data

4.1

Top 40 Netherlands

4.2

Acoustic song analysis

4.3

Lyrics

4.4

Exploratory Data Analysis