LEARNING RECOMMENDATION ENSEMBLES FOR&MUSIC DISCOVERY

(1)

Final&version:&26

th

_{&of&August&2014.&Supervisor:&Manos&Tsagkias&}

Manos&Tsagkias,&signature:&

&

Ilya&Markov,&signature:&

&

L

EARNING&

R

ECOMMENDATION&

E

NSEMBLES&FOR&

M

USIC&

D

ISCOVERY

&

Bart&Eijk&10575308&

Thesis&Master&Information&Science&–&Track&Human&Centered&Multimedia&

University&of&Amsterdam&Faculty&of&Science&

(2)

Learning Recommendation Ensembles for Music

Discovery

Master thesis. Company: Shuffler.fm, supervisor: Manos Tsagkias

Bart Eijk

University of Amsterdam Amsterdam, The Netherlands

bart.eijk@gmail.com

ABSTRACT

We compare the di↵erence in recommender performance between implicit data (listens) and explicit data (favou-rites) in the setting of music discovery. By comparing the results of content-based and collaborative filtering algorithms for the task of predicting tastemaker site subscriptions, we show that users with a very high num-ber of listens, have a worse recommender output for listen-based recommender algorithms and users with a very high number of favourites have a better recom-mender output for favourite-based recomrecom-menders, as compared to users with an average ratio of listens-to-favourites.

Furthermore, we analyze the performance of an ensem-ble recommender that combines content-based and col-laborative filtering algorithms with diverse input signals (listens, favourites and site subscriptions). Our conclu-sion is that performance of an ensemble recommender is dependent on the ratio of relevant items to non-relevant items in the individual recommender’s outputs.

Keywords

ensemble recommender systems, collaborative filtering, content-based filtering, clickdata, learn to rank, data fusion

1. INTRODUCTION AND MOTIVATION

In current days, we are subject to an information over-load. News stories, videos, music and e-books are all widely available on the web, which has led to an in-crease in our media consumption. The problem we face is not one of scarcity, it is the abundance of information and lack of ways to find what we are looking for. More-over, we are often unaware of our specific information needs, which is especially true in the case of music and movies. [5, p. 271]

Music consumption, too, has changed in the last years. Less often, people have a personal music library, which acts as a representative reflection of their musical taste. The way music is listened to nowadays has shifted from analogue media towards online listening. YouTube statis-tics show that 9 out of the 10 most popular video’s on

YouTube are music videos and - at the moment of writ-ing - Spotify1 _{has over 24 million active users, of which}

over 6 million are paying users. [15, 23]

In this setting - the amount of music available on the internet and the change in shift towards the web in mu-sic consumption - we find the challenge of mumu-sic rec-ommendation: providing a user with new and relevant music. This can either be done by authorities in the music scene (such as magazines and blogs) or automat-ically by a music recommender.

The first research in the area of recommender systems dates back to 1994, when the first papers on collab-orative filtering were published [18]. Recently, music recommendation has become a research subject of its own, as shown by the increasing academic interest. [4, p. 2]

Information about how and when a user has interacted with items plays a large role in recommender systems. In music recommendation, a variety of user data is avail-able (such as playcounts or favouritecounts for songs), along with various methods for generating recommen-dations (e.g. using meta-information of tracks or by finding users that are similar to the active user). This master thesis describes the development and eval-uation of an ensemble recommender system at Shuf-fler.fm, an online music discovery service that relies on the aforementioned authorities in music: online tastemaker blogs. We look into the available data and propose di↵erent input signals for di↵erent recommender algo-rithms. At Shu✏er.fm, both implicit data (playcounts: what songs the user has played) and explicit data (favourite-counts: what songs the user likes) is available. By an-alyzing the di↵erences in performance between recom-mender algorithms that use either of the above-mentioned input signals, we show how explicit data relates to im-plicit data with regards to recommender performance.

(3)

Two research questions are posed, which are central to this master’s thesis:

1. What is the recommendation e↵ectiveness of a recommender system trained on implicit and explicit data?

2. Can recommendation e↵ectiveness be im-proved by training a recommendation ensemble based on recommender systems that use either implicit or explicit data?

After analyzing the di↵erences between recommender systems that use either explicit or implicit data, we pro-pose an ensemble method that combines the outputs of the individual algorithms by using weights (which are computed by machine learning methods). We hypoth-esize that by combining di↵erent recommenders into an ensemble, the weaknesses of the individual recom-menders are smoothed out, thus the overall e↵ectiveness of the recommender increases.

2. RELATED WORK

Recommender systems are created with the aim of pro-viding a user with interesting new items, based on the existing interactions between items and the user. The aim of a recommender system can be formalized as the recommendation problem: “estimating ratings for items that have not been seen by a user.” [1, p. 2]

Celma (2010) mentions a two-way approach (based on [19]) to the recommendation problem. The problem of predicting user ratings for unseen items is split up into two subproblems: the prediction problem and the problem of presenting a user with a list of n items. [4, p. 21-22]

The prediction problem can be formalized as follows: • Let U be the set of m users U = {u1, u2, um} and

I be the set of n items I =_{i1, i2, in}.

• Every user ui has a set of items Iui that the user

has rated (note: a rating can be binary - such as a song that the user did or did not play - or on an arbitrary scale (e.g. 1-5)). The items that the user has rated are part of the set of items I, i.e. Iui✓ I. In case the user is new to the system, the

set is empty Iui=?.

• The predicted rating by user ua for item ij is

ex-pressed by the function Puaij.

The recommendation problem now boils down to deter-mining function Puaij. Recommender systems utilize

both explicit and implicit interaction between the user and the set of items and use this data as input for the function Puaij. One approach is to rely on the

informa-tion of similar users to the current user to estimate the rating for unrated item ij. This is called collaborative

filtering. Another popular approach is to find similar items Isim(j)to unrated item ij- based on item

charac-teristics - so that Isim(j) ✓ Iua: content-based filtering.

[4]

In the next sections we will elaborate on collabora-tive filtering and content-based filtering, after which we explore hybrid/ensemble recommenders: recommender system that use characteristics or methods of both col-laborative and content-based filtering.

2.1 Collaborative filtering

Collaborative filtering is an approach that builds upon users similar to the active user ua. A recommender

algorithm will look for similar users to the active user, so that Iua\Iui(where uiis any similar user) is maximized.

The system may now recommend items to the active user from the set Iui \ Iua, i.e. items that have been

rated by user ui but not by user ua.

Collaborative systems have drawbacks such as the new user problem (a new user has no rated items Iui=?.),

new item problem (new items have not yet been rated by users, which make them difficult to recommend), the sparsity of ratings, and real-time recommendations for larger scale systems. [1, p. 18-19]

To deal with performance and scaling issues, Amazon.com uses a collaborative filtering technique that clusters sim-ilar users. [13] By clustering users, the problem a clas-sification task: what cluster(s) does the user fit in? A user can now be represented by a single class, or a vec-tor that contains the similarity of that user to a number of classes, and recommendations are done according to the most prevalent items in a user cluster.

Although collaborative filtering is a popular approach in recommender systems, it also has drawbacks when used as for music discovery purposes. In the case of Shu✏er.fm, the new item problem poses a challenge, since Shu✏er.fm centers around music blogs with a con-tinuous stream of new music. A new song has little to no user interaction history, which leads to very few in-put data for the recommender. Besides, recommended items are often very similar to input items or already known to the user, which is a drawback for a music discovery service, that thrives on novel items. [24]

2.2 Content-based filtering

Another approach to the prediction problem is content-based filtering, which relies on characteristics of items. In music recommenders, if a user likes song x, the sys-tem will recommend songs that are similar to song x, based on meta-information, such as artist, composer, genre, album. Other music recommender systems use extracted audio features as input for the algorithm. [24] The challenge in content-based filtering is the

(4)

extrac-tion of informaextrac-tion from items. Content-based filtering has strong roots in text retrieval. Because of “signif-icant and early advancements made by the informa-tion retrieval and filtering communities and because of the importance of several text-based applications, many current content-based systems focus on recom-mending items containing textual information, such as documents, Web sites (URLs), and Usenet news mes-sages” [1, p. 5-6]

Assuming that the features of item im have been

ex-tracted, im can be now represented by a content

vec-tor: ~Cm={cm,1, cm,2, cm,NT} where NT is the number

of features in the content vector. Recommending new items to a user is done by comparing the content vec-tors of every item in Iui to content vectors of unseen

items, using a similarity measure such as cosine simi-larity. Then, the most similar items are recommended to the user. [24]

Content-based algorithms only recommend items based on item similarity, which - in the case of music discovery - limits the recommender to items from the same genre, artist or style. Moreover, recommended items may even be too similar to already-known items. This problem is often referred to as over-specialization. [1, p. 10]

2.3 Hybrid recommendation systems

We have discussed the two main approaches to recom-mender algorithms and the drawbacks and limitations of both methodologies. To overcome such drawbacks and limitations, the two approaches are often combined to create more robust recommender systems that are able to cater to a heterogeneous user population. One way to combine recommenders is by using an en-semble of recommenders. Enen-semble recommenders have been a popular subject of study as of recently, with promising results in music applications [3]. The ad-vantage of ensemble recommenders on large datasets is scalability and versatility. An ensemble recommender system may incorporate tens of di↵erent algorithms and is therefore able to deal with di↵erent kinds of users in a better way than individual algorithms. In addition, the performance of an ensemble recommender can be easily improved by adding new algorithms or removing old ones, as opposed to a single recommender algorithm that needs to be rewritten or replaced when better per-formance is required.

Lommatzsch, et al (2013) describe a recommender en-semble based on semantic data of movie ratings, where individual recommender algorithms are combined by machine learning techniques. The authors find that the ensemble recommender has a Mean Average Precision of 0.13 as opposed to a block matrix recommender that has a MAP of 0.10. [14]

A challenge in ensemble recommenders is determining what combination of recommender leads to the best re-sults. Bellog´ın (2011) introduces a performance pre-dictor for ensemble systems, that can aid in solving the problem of dynamic weighting in ensemble recom-menders. The predictor is shown to have a strong (around 0.5) positive correlation with recommender performance, thus recommenders can be combined into an ensemble by adhering to the predictor. [2]

Dooms (2013) proposes an ensemble recommender with personalized weights for each user, to overcome the dif-ferences in recommender performance between users. In the research, data from several movie rating databases (such as MovieLens 1M) are used as an input for “over 20 algorithms from the recommendation framework My-MediaLite” [6, p. 444] A genetic algorithm then deter-mined the ideal weights for individual users, by mini-mizing RMSE (Root Mean Squared Error). For every user (out of ten sampled users), the ensemble recom-mender significantly outperforms the best-performing single recommender algorithm.

Ma, et al. (2009) describe a recommender ensemble in which the user’s preference for items and the social relationships of a user are used in learning a ‘social trust ensemble’. The algorithm outperforms other state-of-the-art algorithms, especially in the case of few input signals. They motivate their choice for taking social relationships into consideration by stating that a user’s behaviour on the internet is influences by trusted friends of the user. [16]

Based on the former research, we propose an ensemble recommender with personalized weights for each user, learned by o✏ine learn-to-rank (Google’s sofia-ml [20]) using normalized counts for content-based algorithms. Ensemble recommenders have been shown to signifi-cantly improve performance on music data [3].

2.4 Implicit data in recommender systems

Recommender systems often cope with a lack of explicit data (i.e. a rating or explicit interest from user ua for

item ij). Web store users do not purchase items often,

so the system incorporates implicit data as well (i.e. the items the user has viewed). Das, et al. (2007) mention two main di↵erences between implicit click data and explicit ratings:

1. Assuming clicks are positive votes for an item leads to noisy data, because users may click items for di↵erent reasons.

2. Because of the binary nature of clicks, they can only tell us that a user has a positive interest in a certain item. Clicks can not be used to capture negative interest. [5]

(5)

A user study on judging the abstracts of web pages returned by a web search has shown that implicit click feedback shows “reasonable agreement with explicit judge-ment of the pages”. [11, p. 160] The researchers con-cluded that there is a di↵erence in implicit and explicit feedback, but that the abundance of implicit data avail-able on the internet might overcome this gap.

Jawaheer, et al. (2010) use Last.fm data to compare the reliability of implicit data, compared to explicit data in a music recommendation application. The re-search shows that there is a (small) di↵erence between implicit and explicit data. However, the authors pro-vide no methods on combining both data in a recom-mender systems and use only one collaborative filtering algorithm in their analysis. [10] In the current research, both implicit and explicit input signals are used to train multiple algorithms.

3. CASE STUDY: SHUFFLER.FM

To give a better view of the input data and the context of the recommender system, Shu✏er.fm is discussed in this section. We first explain the aim of Shu✏er.fm by formulating a user goal, after which we establish the recommendation setting.

Shu✏er.fm is described on its website as “... an online music discovery service. [...] Our crawlers go through thousands of music blogs and magazines to gather the newest music from around the web. We collect the most popular and recent tracks into one source: Shu✏er.fm.” [21]

In more abstract terms of data and information, Shuf-fler.fm can be viewed as a website that acts as a medi-ator between the user’s information need and potential tastemaker websites that can supply in that informa-tion need: “With the help of the online tastemakers like magazine editors, bloggers, curators and DJs. We help you filter the music information overload and bring you the latest and greatest by the web’s leading voices in music culture.” [22]

From the perspective of a user, we verbalize the setting of Shu✏er.fm as:

I want to find new and recent music that is relevant to my musical taste.

A user can choose to listen to the most popular recent tracks, or choose a specific artist, genre or tastemaker website and listen to the corresponding songs. Shuf-fler.fm presents songs to the user in a frame, accompa-nied by a top bar that can be used to navigate through the current playlist, play/pause/skip/‘star’ the current song and control the volume (see Figure 1).

3.1 Recommendation task

Figure 1: A screenshot of Shu✏er.fm, currently playing a starred song, which is second in the playlist. The song is featured on tastemaker website Pitchfork.

At Shu✏er.fm, a user can subscribe to tastemaker sites, artists or other user’s profiles. The recommendation task is to recommend tastemaker music sites to the user, which is why we use the term subscription to denote a (tastemaker) site subscription. Secondly, the recommender should display multiple tastemaker web-sites instead of only a single one. By o↵ering multiple recommendations, a user can select a website that satis-fies his/her current information need, without the need to dismiss a recommendation or refresh the web page. A tastemaker site is defined as a website (e.g. an on-line music magazine or a blog) that is considered an authority on music, i.e. the website provides regularly timed updates on current trends in music. Songs on Shu✏er.fm are exclusively provided by tastemaker sites, which means that every track on Shu✏er.fm belongs to one or more tastemaker sites (a popular new track may be featured on 40 di↵erent tastemaker sites).

The focus on tastemaker site recommendations is not an arbitrary choice; Shu✏er.fm’s unique selling point is the aspect of music curation. Shu✏er.fm does not o↵er artists’ entire music catalogues. Instead, its focus is on recent music that was featured on tastemaker sites, which is a list of websites curated by the Shu✏er.fm sta↵.

4. DATA

We now elaborate on the available input data at Shuf-fler.fm. We provide statistics on genres, users, listens, favourites and subscriptions, and emphasize the need for three di↵erent input signals and two di↵erent rec-ommender algorithms.

(6)

Figure 2: A visual representation of users: x :listens, y : favourites, colour of the dot: subscriptions).

Figure 3: A logarithmic plot of the trackcount per genre of all songs on Shu✏er.fm

As of May 14th 2014, Shu✏er.fm counts 47,504 active users2_{, 1,863 active tastemaker blogs} 3_{, 385,822}

(non-unique) and 43,994 genres4_.

4.1.1 Music genres on Shuffler.fm

The most common genres on Shu✏er.fm are electronic/ dance, indie/alternative and hip-hop/rap. We note the popularity of these genres, since the content-based al-gorithms may show a genre bias. Moreover, we note a long tail type distribution (alike to Zipf’s law) when observing the number of tracks per genre (see Figure 3). A more in-depth view of genres is provided in Appendix A.1, where graphs and a table containing sites and their subscriptioncount and most common genre can be found.

4.1.2 Listens, favourites and subscriptions

Listens, favourites and subscriptions are the three most common interactions between users and items at

Shuf-2_{A user is active when he/she has a confirmed e-mail and}

has logged in at least once.

3_{Active tastemaker blogs have posted at least one track in}

the past 2 months and are not disabled or deleted.

4_{Genre information is prone to noise, e.g.}

(7)

Mean Median Listens 94.2 12.0 495.2 Favourites 10.5 3.0 37.5 Subscriptions 4.5 2.0 14.5

Table 1: Descriptive statistics of listen-, subscription- and favouritecounts

fler.fm. Listens and favourites are both types of data that denote a relationship between a user and a track. A subscription denotes a relationship between a user and a site. Descriptive statistics of favourites, listens and subscriptions of all users are listed in Table 1 and graphs describing the interactioncount (where an inter-action is a listen, favourite or subscription) for each of the data types can be found in Appendix A.2.

By observing the scatterplot in which the three datatypes are brought together (see Figure 2), we see a heteroge-neous group of users, with a trend towards users with a large number of listens. While only 6 users have more than a thousand favourites, 272 users have listened to more than a thousand tracks. More important, from Figure 3 and the graphs in Appendix A.2, we conclude that there is a large group of users with a small number of interactions (e.g. 2,144 users have a listencount of 1 and 13,983 users have a favouritecount of 1). These numbers translate to the horizontal and vertical lines in Figure 3 in the lower left corner of the scatter plot.

4.2 Input signals

We have shown the heterogeneity of users on Shu✏er.fm, which necessitates di↵erent input signals for the recom-mender system. To deal with the diversity in listen-, favourite- and subscription counts of users, a recom-mender system that incorporates all three signals as in-put is required. We now elaborate on the characteristics of each input signal.

4.2.1 Listens and favourites

Both listens and favourites denote a relationship be-tween a user ua and a track ti. For listens, this is an

implicit relationship; a user has listened to the track, thereby outing an implicit positive interest. We con-sider favourites to be an explicit relationship; after lis-tening to a track, a user has actively outed a positive interest.

4.2.2 Subscriptions

A subscription denotes a user uawith a positive explicit

interest in site si. As a result of a subscription, it is

expected that a user will interact with tracks from site si. Therefore we expect an overlap between the tracks

a user has interacted with and the tracks of the sites the user is subscribed to.

Algorithm Input signal CB-Listen content-based listens CB-Fav content-based favourites CB-Sub1 content-based subscriptions CB-Sub2 content-based subscriptions CF-Listen collaborative filtering listens CF-Fav collaborative filtering favourites CF-Sub collaborative filtering subscriptions

Table 2: Type of recommender algorithm and input data per recommender

5. ALGORITHMS

The ensemble recommender will feature both collabora-tive filtering and content-based algorithms. By imple-menting both, we expect to overcome the weaknesses that are apparent in one of the two methods (such as the new item problem for the collaborative filtering algo-rithm). For an overview of all individual recommenders, we refer the reader to Table 2.

First, we describe the content-based algorithm and the related genre vector. Next, we explain the collabora-tive filtering algorithm, after which we elaborate on the workings of the ensemble recommender, by describing the process of combining recommender algorithms.

5.1 Content-based algorithm

In content-based algorithms, each user and tastemaker site is represented by ~Gua, where uais a user or tastemaker

site. By using a vector-based approach, similarity can be easily and quickly computed using cosine similarity (a popular similarity measure in information retrieval). Cosine similarity is invariant to the sum of vectors, since it is a measure of the angle between two vectors of an inner product space.

How can we describe a user’s taste in music? And how can we describe a tastemaker’s site taste in music? We propose an approach that uses a track’s genre infor-mation in describing a user or tastemaker site. Meta-information about the genres of tracks (stored in Shuf-fler.fm’s database) is used in forming a bag-of-genres vector: ~G.

The algorithm will first determine the length of ~G by listing all track genres on Shu✏er.fm. Next, ~Gua is

cre-ated for tastemaker site ua, from all genres of tracks on

ua. The string containing the genre of a track is first

split (a track may have multiple genres). Next, the in-dividual genres are stemmed (by removing special char-acters such as hyphens, spaces or forward/backward slashes), e.g. hip-hop_{! hiphop.}

(8)

perform worse in music data, compared to normalized counts. For this reason we have used a similar approach to [10], where normalization of genre information on artist level yielded good results.

The algorithm counts the number of artists on a site and the genres associated with artists. The vector is updated with weight w for every genre associated with an artist, where w is defined as:

w = P la

a2A

la⇥ fa

(1)

Where for artist a, lais the interactioncount (i.e. tracks,

listens or favourites for artist a), P

a2A

lais the sum of the

interactioncounts of all artists A (i.e. all tracks, listens or favourites of a user or site) and ga is the number of

genres that is associated with the artist, according to the tracks of the current user or site.

Listens and favourites

In the CB-Listen and CB-Fav algorithm, a user’s genre vector is constructed from either listens or favourites as input signal. Next, the vector is compared to the genre vectors of all sites using the cosine similarity measure. The recommender returns the most similar sites. Subscriptions

The CB-Sub1 algorithm takes the genre vectors of all tastemaker sites user uais subscribed to and sums them

into one vector. This resulting vector can be interpreted as a mean of all genre vectors of the sites the user is subscribed to. Next, it compares the vector to all genre vectors of other sites and returns the sites with the high-est cosine similarity.

The Sub2 algorithm has the same input data as CB-Sub1, but uses a similarity matrix that is constructed based on the cosine similarity between all sites. For all site subscriptions of a user, the algorithm lists the most similar sites (according to the values in the similarity matrix). By summing the cosine similarity values of the most similar sites to sites in the subscriptions of the user, the algorithm finds the site that are most sim-ilar to each individual site (as compared to being most similar to the mean of all sites in algorithm CB-Sub1). The result is a list of tuples (site, cosine similarity sum), sorted by similarity.

5.2 Collaborative filtering algorithm

The second type of algorithm is based on collaborative filtering. The most common-used approach to collab-orative filtering is representing users as a vector with length N , where N is the total number of items. The vector contains positive or negative ratings for each item

(based on the rating of the user for said item) and is often pre-computed for all users. By doing so, the rec-ommender can compute the cosine similarity between the current user and all users and efficiently recommend items that have high positive ratings in the vectors of similar users. [13]

Efficiency is not a requirement in the current context, which is why the implementation of the collaborative filtering algorithm follows a di↵erent, although similar approach. First, all users Usim(a) are listed; users that

are similar to the user ua, based on the intersection of

item interactions (i.e. users that have a high number of listened tracks in common). The algorithm then returns IUsim(a), i.e. the items that are most-often occurring in

the interactions of users Usim(a).

5.2.1 Listens and favourites

For the input signals listens and favourites the recom-mender first computes similar tracks IUsim(a). Next,

these tracks are mapped to sites, based on the track intersection between a site’s tracks and the tracks in IUsim(a). The sites are ranked according to the number

of intersecting tracks.

5.2.2 Subscriptions

When using subscriptions as input data, the recom-mender returns IUsim(a): the sites with the highest

num-ber of subscriptions for similar users Usim(a).

If user uais subscribed to a large number of sites (around

40 or more), the list of sites returned by the algorithm closely resembles a list of all sites ranked by number of subscriptions, which means that the recommender will recommend sites based on overall popularity. The rea-son for this, is that the number of similar users Usim(a)

- users that have one or more site subscriptions in com-mon with the current user - ends up being quite large and becomes a representative sample of all users. In order to solve this, we first compute the popularity of all sites, based on the number of user subscribed to a site. We then normalize the subscription counts to values between 0 and 1 and do the same for the sub-scription counts of the sites in IUsim(a). The

recom-mender output is a list of sites and similarity values, where the similarity value is computed by subtracting the normalized popularity value for a site from the value in IUsim(a).

The above-mentioned problem (where the recommender output resembles a list of sites ranked by overall popu-larity) was not apparent with listens and favourites as input signals. We expect the reason for this to be the large number and variety of tracks, combined with the sparsity in listens or favourites for a track.

(9)

The goal of the ensemble recommender is to create a synthesis of the di↵erent outputs of each recommender in such a way that the ensemble recommender outper-forms the individual recommenders. We expect the en-semble to outperform one single algorithm, because by combining di↵erent input signals and algorithms, the ensemble is expected to be better at handling di↵er-ent kinds of user behaviour (that we have observed in Figure 2).

We have chosen an approach where the outputs of in-dividual recommenders are combined by using weights. Weights are computed for each individual user, to ac-count for di↵erences in user behaviour. For example, one user may have a high number of listens, but no favourites, while another user may have a small num-ber of listens and a large numnum-ber of favourites. By using personalized weights, we expect the ensemble recom-mender to be better at handling a heterogeneous group of users. We concentrate on finding the set of weights that provides the highest precision, since recall is not of great importance in our context.

Input data consists of test and training data for each user, where each line in the data represents a site that is either a positive or negative example. Training and test data contains both positive items (sites the user is subscribed to) and negative items (randomly selected sites the user is not subscribed to) in a 1:3 ratio. For each site, seven features are provided (one for each rec-ommender algorithm (see Table 4 for an overview of all recommender algorithms)).

The last step in the process is to combine the output for individual recommenders, based on the optimal weights. The ensemble output for each site is computed by sum-ming the weighted outputs of each individual recom-mender algorithm.

6. EXPERIMENTAL SETUP

After describing the algorithms themselves and the in-put signals, we will now elaborate on the evaluation of the recommender system. The experiment goal is to predict site subscriptions for individual users. We do so by training the recommender on training data consist-ing of di↵erent input signals and checkconsist-ing the recom-mender output for sites in the test data.

We will now first describe the implementation details of the algorithms in the Shu✏er.fm environment, after which we elaborate on the data set and data processing.

6.1 Implementation details

All recommenders were implemented in the Ruby on Rails environment of Shu✏er.fm. To significantly speed up the performance of the recommenders, certain as-pects (such as computing genre vectors for all tastemaker

Group F-to-L user count A F_{⌧ L 64}

B F ⇠= L 29 C F L 28 D F_{⇡ L} 20

Table 3: The ratio of favourites (F) to listens (L) and number of users in each group. The first group (A) has users with the lowest ratio, the second (B) has users with an average favourites-to-listens ratio (1:⇠10), the third group (C) has the largest ratio and the users of the fourth group (D) have approximately an equal number of listens and favourites.

sites) are cached using memcached5_.

Sofia-ml is used to find the ideal weights for each user. After several test-runs with di↵erent settings and learner algorithms in sofia-ml, the ROMMA algorithm [12] was found to return the best and most consistent results for each combination of user and recommender algorithm. Input data files for sofia-ml are formatted in the SVM-light sparse data format.

Overall, the CF-Sub recommender proved to be most efficient. We noted a correlation between size of input signals and recommendation time for the collaborative filtering algorithm. This e↵ect was not apparent in the content-based algorithms. We hypothesize that the ab-straction o↵ered by genre vectors is the reason for this di↵erence in recommendation time. However, efficiency was not considered an issue in implementation, as it is out of scope of the current research and its experimental test setting.

6.2 Data set

To preserve sufficient site subscriptions in the test data, users with less than 20 subscriptions were filtered. Users are automatically assigned to four groups, to account for the di↵erences in user behaviour. The characteristics and the number of users per group can be found in Table 3.

Discerning four di↵erent user groups gives us two advan-tages: (1) the groups are based on the ratio of favourites to listens, which will give us more insights in the di↵er-ence in recommender e↵ectiveness between explicit and implicit data and (2) by evaluating di↵erent groups, we can test if the ensemble recommender is able to handle di↵erences in user behaviour.

6.3 Training and testing

(10)

Algorithm MAP P@5 P@10 P@20 Random 0.004 0.002 0.004 0.003 CB-Listen 0.014 0.002 0.006 0.010 CB-Fav 0.015 0.005 0.008 0.010 CB-Sub1 0.014 0.010 0.012 0.012 CB-Sub2 0.017 0.022 0.019 0.015 CF-Listen 0.050 0.073 0.060 0.045 CF-Fav 0.061 0.096 0.065 0.049 CF-Sub 0.082 0.111 0.095 0.079

Table 4: Performance of all individual recom-mender algorithms. The Random algorithm re-turns random sites and serves as a baseline for comparison.

Training and test data is chronologically divided in a 75:25 ratio; training data contains input data (listens, favourites or subscriptions) up until the date of the x th subscription, where x = 0.75⇥ total number of subscrip-tions. Test data is made up of the remaining 25% of subscriptions.

6.4 Evaluation

Without the means to do online A-B testing, the cur-rent approach is an o✏ine approximation of a scenario where real users are asked to rate sites that are returned by the recommender on a binary scale (relevant or not relevant).

The recommender output for each user is a list of tu-ples (site, similarity score), that is ranked by similarity (i.e. the most relevant site is the first item in the list). Based on our context and goal, we compute Mean Av-erage Precision and Precision@5-10-20. Precision@x is most important to us, because P@x tells the number of relevant sites in the first x sites returned by a recom-mender.

Recommender outputs were evaluated using Perf6 _and

processed using Python.

7. RESULTS & ANALYSIS

In this section, we examine the results of the experi-ments according to our research questions. First, the di↵erences between implicit and explicit data are dis-cussed, by comparing the outcomes of the individual recommender algorithms. Second, we analyze the per-formance of the ensemble recommender.

7.1 Recommendation effectiveness using

im-plicit and exim-plicit data signals

What is the recommendation e↵ectiveness of a recommender system trained on implicit and

ex-6_{http://osmot.cs.cornell.edu/kddcup/software.html}

plicit data?

Table 4 provides an overview of the performance of each individual recommender algorithm. We note the perfor-mance of the content-based algorithms to be inferior to the collaborative-filtering equivalents; in some cases the content-based algorithms are even outperformed by the random algorithm. We will first examine this bad per-formance, after which we compare the recommendation e↵ectiveness of algorithms that use either implicit or explicit data.

7.1.1 Performance of content-based algorithm

One challenge in using genre vectors is the noisy char-acter of genre data (as we have noted in the section on input data). Besides being noisy, genre information is also highly subjective and not based on certain stan-dards [8]. This may have led to its poor performance in contrast to the collaborative filtering algorithm.

When considering individual content-based recommenders, we ascribe the lacking performance of the CB-Listen and CB-Fav algorithm to the indirect relation between input signals and output data. This can be further ex-plained by considering the following example: A user has a preference for hip-hop and indie music. Besides, the user is subscribed to two kinds of tastemaker sites: sites that specializes in hip-hop and sites that special-ize in indie music. Furthermore, the user listens to both hip-hop and indie tracks. The genre vector for this user reflects both genres and therefore the algorithm returns sites that have both hip-hop and indie songs or any sym-biosis of the two genres (indie rap?), instead of sites specializing in one of the two genres (alike to the user’s subscriptions).

Concluding, the content-based genre vector is limited in describing a varied music taste of a user. The content-based algorithms that utilize genre vectors find sites that are sufficient in providing music that is a “mean” of the genres in the input data.

7.1.2 Differences in content-based algorithms

We start by examining the results of the content-based algorithms, according to the four di↵erent user groups in Table 3. The recommendation e↵ectiveness for all user groups in Table 4 show no large di↵erence in perfor-mance between the CB-Listen and CB-Fav algorithm. By observing the di↵erence between user groups (see Appendix B), we note a small di↵erence in P@5 in fa-vor of the CB-Fav algorithm for user group B and C. The content-based algorithms that use subscriptions as input signals (CB-Sub1 and CB-Sub2) show a small in-crease in performance over CB-Listen and CB-Fav, al-though this di↵erence is only clearly noticeable in the mean P@5 of the CB-Sub2 algorithm. We expect the reason for this advantage to be the direct relation be-tween input and output data (i.e. sites) of CB-Sub1

(11)

and CB-Sub2. These algorithms use site genre vectors, which can directly be compared by using cosine similar-ity (see 5.1). In the CB-Listen and CB-Fav algorithms, input signals need to be transformed into genre vectors to allow for comparison with sites. Creating genre vec-tors, based on a user’s listens or favourites is done using the same approach that is being used for sites. By doing so, it is assumed that a user’s listens or favourites can be translated into a users musical taste (as reflected by the vector) by the same process that is used for sites, where the site’s tracks are being used to create the genre vector. The higher MAP of the CB-Sub1 and CB-Sub2 algorithm hints that this assumption might be invalid.

7.1.3 Differences in collaborative-filtering algorithms

When comparing favourites to listens as input data for the collaborative filtering algorithm, the CF-Fav algo-rithm significantly outperforms the CF-Listen algoalgo-rithm. When comparing P@5, the CF-Fav algorithm shows a 32% increase in performance over the mean of all rec-ommenders. The di↵erence in P@20 is only 8%. CF-Sub is the best performing collaborative filtering algorithm and also the best overall recommender algo-rithm. It is only outperformed by the CF-Fav algorithm in the user group C, when comparing MAP and P@5. We ascribe the better performance of the CF-Sub algo-rithm – when compared to the other collaborative fil-tering algorithms – to the direct relation between input signals and output data (as mentioned in 7.1.2). The CF-Listen and CF-Fav algorithm use track mapping to transform input data to sites.

7.1.4 Users with a high listencount

The CF-Listen algorithm is expected to perform better for users in user group A, since the users in the group have a high listen count (i.e. more input signals). More listens lead to more information for the collaborative filtering algorithm to distinguish between similar users (because it has a higher number of listens to use in comparing users). Therefor, we expect the CF-Listen recommender to return more relevant results for users with a higher listencount.

Howbeit, by comparing performance between user groups (see Appendix B.1), we see no increase for user group A. An explanation for this unexpected result would be the variety in large numbers of listens. Users with a high listen count are expected to listen to tracks that span a large amount of genres and artists. By comparing the genre vectors of 5 random users from group A (users with a high listencount) to users from group C (users with a low listencount), we note that the genre vectors of users in group A are of greater length (which means the users listen to a larger variety of genres). This diver-sity in genres may be the cause of noise in listen data, which could be the reason for the worse recommender performance of the CF-Listen algorithm.

To confirm our hypothesis, we test it on the 10 users with the highest listencount in the data set, so the dif-ference in recommender performance should be clearly visible. We consider the performance of the CF-Listen algorithm for the 10 users with the highest listencount from user group A. Users from this group have an av-erage of 6820 listens, compared to the avav-erage of user group A of 1487 listens. P@5 for these 10 users is 0.0. However, P@10 = 0.38, which is the same for the 10 users and all users in group A. Still, the Mean Average Precision for the 10 users with the highest listencount in user group A is 0.037; 25% below the MAP for all users in user group A.

When we observe the performance of the CB-Listen al-gorithm for the same group of 10 users, we can draw a similar conclusion. P@5, P@10 and P@20 are all 0.0. Furthermore, the MAP for these users is 0.007, 50% lower than the MAP for all users.

Finally, we check for bad performance of the 10 users with the highest listencount in the CB-Fav and CF-Fav algorithms. The above-mentioned results may be caused by errors or noise that are specific to one or more of the users. However, we observe no large deviation from the average values of the user group A. Therefore, we conclude that listen-based algorithms perform worse for users with a high listen count, compared to perfor-mance of favourite- or subscription-based algorithm for these users.

7.1.5 Users with a high favourite count

We now look if the above results are also applicable to users with a high number of favourites when con-sidering the CF-Fav algorithm. Based on our former observations for users with a high listencount, we ex-pect poorer performance of the CF-Fav algorithm for users with a high favouritecount.

Yet, by observing the results for user group C (users with a relatively high favouritecount), we see a clearly better performance: P@5(CF-Fav) = 0.164, which is the highest P@5 of all four user groups (noticeably higher than user group A and B; only slightly higher than user group D).

To double-check our observations we look further into these findings. Consider the performance of the CF-Fav algorithm for the 10 users in the data set with the high-est number of favourites, compared to the performance of the two algorithms for all user groups. For these 10 users, MAP(CF-Fav) = 0.137, which is a 125% increase over the mean of all user groups and a 41% increase over the users in the same group. Moreover, for the 10 users with the highest favouritecount, P@5(CF-Fav) = 0.260; P@10(CF-Fav) = 0.15 and P@20(CF-Fav) = 0.105, which are all noticeable increases over both the users in user group C and the means for all user groups

(12)

User group MAP P@5 P@10 P@20 A 0.087 0.094 0.077 0.070 B 0.055 0.062 0.048 0.047 C 0.088 0.150 0.114 0.075 D 0.102 0.160 0.110 0.095 Mean 0.083 0.117 0.087 0.072

Table 5: Recommender performance of the en-semble recommender over the four user groups. Bold values are improvements over the highest recommender score for that user group.

(see Table 4).

This means that favourite-based algorithms have a bet-ter performance for users with a high number of favourites, when compared to users with an average number of favourites.

7.1.5.1 Conclusion

If we combine the above findings with the average per-formance of the listen- and favourite-based recommender algorithms we can draw two conclusions; there is no dif-ference in performance between listen-based and favourite-based recommenders for users that have an average listen-to-favourite ratio. This means that the recommenda-tions from these two algorithms can be complementary. We will explore this in the next section on ensemble recommenders.

Secondly, the favourites-based algorithms have a higher performance for users with a very large amount of favourites and the listen-based algorithms have a lower perfor-mance for users with a very high listencount.

7.2 Performance of ensemble recommender

Can recommendation e↵ectiveness be improved by training a recommendation ensemble based on recommender systems that use either implicit or explicit data?

The evaluation of the ensemble recommender can be found in Table 5. Bold values denote an improvement over the best performing recommender algorithm in each user group, since such an improvement is expected when combining algorithms. The ensemble outperforms indi-vidual recommenders in MAP and P@5, which is in line with our hypothesis.

7.2.1 Performance of ensemble recommender

We will examine the process of the ensemble recom-mender from the perspective of two individual users in our data set: user ui for which the ensemble

outper-forms a single algorithm and user uj for which the

en-semble shows no improvement over the best performing individual algorithm. By analyzing user ui, we show

how the ensemble recommender successfully improves recommendation e↵ectiveness for a user. By analyz-ing user uj, we show when and how the ensemble

rec-ommender fails in improving recommendation e↵ective-ness.

Outputs of the individual recommenders and the weights for ui and uj can be found in Appendix B.2. For

refer-ence purposes, we have also provided the average weights for each user group in Table 9 in Appendix B.2. By observing Table 7 and Table 8 (in Appendix B.2), we notice that algorithms with a high MAP tend to have a higher weight. Although it may seem like a direct correlation, it is not. Weights are determined by the recommender’s similarity values for sites in the indi-vidual recommender output and not according to the performance in MAP of each algorithm. For this rea-son, two algorithms with an equal P@5 (e.g. CF-Sub and CF-Listen for ui) have very di↵erent weights.

Consider user ui, a user for which the ensemble

recom-mender is more e↵ective than a single recomrecom-mender. We note a positive site match as the first result in the recommender output of CF-Sub. This same site is ranked first in the ensemble output. The rest of the top first five results in the ensemble recommender are two other relevant sites and two non-relevant sites from the recommender output of CF-Fav. We conclude that for ui, the ensemble recommender has a higher P@5 and

is therefore better than one individual algorithm. Now consider user uj, for which the recommender

en-semble did not outperform the highest scoring indi-vidual recommender (see Table 8 in Appendix B.2). Again, the CF-Sub algorithm has the highest associ-ated weight, followed by the CF-Fav algorithm. The top result in the ensemble recommender is a positive site match that is also the top result for the CF-Fav and CF-Sub. In the output of CF-Sub we find two other relevant sites at position 22 and 28, but these sites are pushed to lower positions in the ensemble output due to non-relevant results from the CF-Fav algorithm (where the next positive site match is the 64th _result).

The above example illustrates a challenge in our ap-proach to ensemble data fusion. Due to high noise (i.e. a high number of non-relevant items compared to rel-evant items) in the individual recommender outputs, sites that are relevant are often pushed back by non-relevant sites. This provides an explanation for the lack in performance boost of the ensemble recommender.

8. CONCLUSION

We have performed a two-fold research: (1) we have examined the di↵erences in explicit and implicit user data by comparing recommender performance and (2) we have analyzed the performance of a recommender ensemble on music discovery data.

(13)

Our first finding is that users with a very high listen-count receive recommendations that are worse when compared to the recommendations of users with an av-erage ratio of listens to favourites. Users with a very high amount of favourites receive better recommenda-tions compared to average users. We have attributed these observations to the high track diversity in the lis-tens of users with a very high listen count, leading to noise in recommender input data.

Our second finding is that the data fusion method in our approach to forming an ensemble of recommenders re-lies on the performance of the individual recommender algorithms. Due to a surplus of non-relevant sites in the individual recommender outputs, relevant sites (i.e. positive matches) receive a lower rank when combined with non-relevant items in the ensemble output. If the individual recommenders show good performance (and diversity in positive site matches), the recommender en-semble is more likely to outperform an individual rec-ommender. The importance of diversity in individual recommender output was also noted by Tiemann, et al. (2007).

9. FUTURE WORK

The noted di↵erence between implicit listen data and explicit favourite data is a subject of future work, espe-cially since it is not in accordance with the outcome of previous research where a similar performance between explicit and implicit data was noted [10]. Future work should focus on the generalization of our findings, by us-ing other music recommendation data sets or di↵erent characteristics for user groups. Secondly, our findings should be re-evaluated by online A-B testing, since of-fline evaluation might not suffice for recommender sys-tems that have the goal of recommending novel isys-tems. [9]

Secondly, we emphasize the importance of testing dif-ferent data fusion methods for ensemble recommenders for music discovery, such as CombMNZ [7] or WTGF [17] in order to improve ensemble recommender perfor-mance. Another way to improve the ensemble recom-mender is to tune the ratio of positive-negative exam-ples per user while training the example. Finally, there is a need for more recommender algorithms, to increase diversity in recommender outputs and overall ensemble performance.

10. ACKNOWLEDGEMENTS

Manos Tsagkias has been a great source of inspiration and enthusiasm during the process of this master’s the-sis. He has provided numerous valuable suggestions, such as proposals for recommender algorithms, and cor-rections on the final draft. The author would also like to thank Ilya Markov for being the second reader of the master’s thesis.

11. REFERENCES

[1] G. Adomavicius and A. Tuzhilin. Towards the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE Transactions on, 17(6):734–749, 2005. [2] A. Bellog´ın. Predicting performance in

recommender systems. In Proceedings of the fifth ACM conference on Recommender systems, pages 371–374. ACM, 2011.

[3] E. Bernhardsson. Implementing a scalable music recommender system. Skolan f¨or datavetenskap och kommunikation, Kungliga Tekniska h¨ogskolan, 2009.

[4] `O. Celma. Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space. Springer, 2010.

[5] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th international conference on World Wide Web, pages 271–280. ACM, 2007.

[6] S. Dooms. Dynamic generation of personalized hybrid recommender systems. In Proceedings of the 7th ACM conference on Recommender systems, pages 443–446. ACM, 2013. [7] E. A. Fox and J. A. Shaw. Combination of

multiple searches. NIST SPECIAL PUBLICATION SP, pages 243–243, 1994. [8] E. Gstrein, F. Kleedorfer, R. Mayer,

C. Schmotzer, G. Widmer, O. Holle, and S. Miksch. Adaptive personalization: A

multi-dimensional approach to boosting a large scale mobile music portal. In Fifth Open

Workshop on MUSICNETWORK: Integration of Music in Multimedia Applications, pages 1–8, 2005.

[9] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS), 22(1):5–53, 2004. [10] G. Jawaheer, M. Szomszor, and P. Kostkova.

Comparison of implicit and explicit feedback from an online music recommendation service. In proceedings of the 1st international workshop on information heterogeneity and fusion in

recommender systems, pages 47–51. ACM, 2010. [11] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154–161. ACM, 2005.

(14)

[12] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361–387, 2002.

[13] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. Internet Computing, IEEE, 7(1):76–80, 2003.

[14] A. Lommatzsch, B. Kille, and S. Albayrak. Learning hybrid recommender models for heterogeneous semantic data. In Proceedings of the 28th Annual ACM Symposium on Applied Computing, pages 275–276. ACM, 2013. [15] S. Ltd. Information — spotify press.

http://press.spotify.com/us/information/, Jan. 2014.

[16] H. Ma, I. King, and M. R. Lyu. Learning to recommend with social trust ensemble. In

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 203–210. ACM, 2009. [17] I. Markov and N. Vassilieva. Image retrieval:

Color and texture combining based on query-image. In Image and Signal Processing, pages 430–438. Springer, 2008.

[18] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work, pages 175–186. ACM, 1994.

[19] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th

international conference on World Wide Web, pages 285–295. ACM, 2001.

[20] D. Sculley. Large scale learning to rank. In NIPS 2009 Workshop on Advances in Ranking, pages 1–6, 2009.

[21] Shu✏er.fm. Shu✏er.fm: Radio powered by tastemaker music sites and blogs.

http://shuffler.fm/pages/jobs, Jan. 2014. [22] Shu✏er.fm. Shu✏er.fm: Radio powered by

tastemaker music sites and blogs.

http://shuffler.fm/pages/learn-more, July 2014.

[23] Videotrine. The most viewed videos on youtube in the world of all time - videotrine.com. http: //en.videotrine.com/all/youtube/all-time, Jan. 2014.

[24] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno. Hybrid collaborative and

content-based music recommendation using probabilistic model with latent user preferences. In ISMIR, volume 6, page 7th, 2006.

(15)

APPENDIX

A. STATISTICS

A.1 Site statistics

Figure 4: The number of subscribed users per tastemaker site, ranked by number of subscrip-tions.

Figure 5: The number of listens per site, ranked by number of listens.

Site name G S Pitchfork: Track Reviews hip-hop 2802 Pretty Much Amazing hip-hop 2314 Stereogum british 1512 FACT magazine jazz 1303 The FADER hip-hop 1303 EARMILK electronic 1183 XLR8R mixtape 1151 GORILLA VS. BEAR ambient 972 BOILER ROOM remix 951 brooklynvegan ambient 934 Indie Shu✏e indie 886 The Line Of Best Fit indie 696 Dazed Digital rap 690 NME New Music Radar death metal 686 The Music Ninja remix 648 Yours Truly minimal synth 638 Okayplayer hip-hop 550

DIY pop 543

Aquarium Drunkard country 542 Dummy Mag electronic 525 The Wild Honey Pie indie 508 Dancing Astronaut electronic 500 Mad Decent Blog remix 484 Prefix mixtape 479 stupidDOPE hip-hop 473 Turntable Kitchen indie 463 Tiny Mix Tapes ambient 456 Consequence of Sound hip-hop 448 La.Ga.Sta. house 441 2dopeboyz hip-hop 439 deepgoa remix 404 HYPETRAK hip-hop 398 Potholes In My Blog chicago 397 SOUNDISSTYLE remix 387 FEEL MY BICEP mixtape 386 ISO50 Blog electronic 384 yvynyl indie 380 NME Reviews electronic 359 Hard Candy remix 346 The 405 electronic 341

Table 6: The 40 sites with the largest amount of subscriptions (S ) and their most common genre (G).

(16)

A.2 Statistics on input signals

Figure 6: Number of listens per user (users ranked by number of listens)

Figure 7: Number of favourites per user (users ranked by number of favourites). Notice the amount of users with only 1 favourite.

Figure 8: Number of subscriptions per user (users ranked by number of subscriptions).

(17)

B. RESULTS

B.1 Individual recommender performance

Rand CB-Lis CB-Fav CB-Su1 CB-Su2 CF-Lis CF-Fav CF-Sub Mean Group A MAP 0.005 0.009 0.015 0.012 0.018 0.040 0.035 0.089 0.028 F⌧ L P@5 0.006 0.009 0.006 0.009 0.028 0.050 0.053 0.103 0.033 P@10 0.006 0.006 0.012 0.011 0.023 0.048 0.038 0.083 0.029 P@20 0.004 0.005 0.008 0.012 0.016 0.039 0.032 0.077 0.024 Group B MAP 0.003 0.018 0.013 0.012 0.018 0.030 0.027 0.061 0.023 F ⇠= L P@5 0.000 0.000 0.007 0.007 0.021 0.028 0.028 0.055 0.018 P@10 0.003 0.010 0.010 0.007 0.017 0.031 0.021 0.066 0.021 P@20 0.002 0.016 0.012 0.010 0.010 0.029 0.017 0.067 0.020 Group C MAP 0.004 0.017 0.014 0.019 0.019 0.058 0.097 0.073 0.037 F L P@5 0.000 0.000 0.007 0.014 0.029 0.093 0.164 0.107 0.052 P@10 0.000 0.007 0.011 0.014 0.025 0.079 0.104 0.107 0.043 P@20 0.002 0.007 0.012 0.014 0.025 0.052 0.073 0.082 0.033 Group D MAP 0.004 0.012 0.016 0.012 0.013 0.073 0.086 0.105 0.040 F⇡ L P@5 0.000 0.000 0.000 0.010 0.010 0.120 0.140 0.180 0.058 P@10 0.005 0.000 0.000 0.015 0.010 0.080 0.100 0.125 0.042 P@20 0.005 0.010 0.008 0.010 0.010 0.060 0.073 0.090 0.033 Mean MAP 0.004 0.014 0.015 0.014 0.017 0.050 0.061 0.082 P@5 0.002 0.002 0.005 0.010 0.022 0.073 0.096 0.111 P@10 0.004 0.006 0.008 0.012 0.019 0.060 0.065 0.095 P@20 0.003 0.010 0.010 0.012 0.015 0.045 0.049 0.079

Table 7: Recommender performance per algorithm per user group. The overall best-performing algorithm is the CF-Sub algorithm, although it is outperformed in Group C by the CF-Fav algorithm

(18)

B.2 Results for individual users

ui

and

uj

Algorithm MAP P@5 P@10 P@20 Weight CB-Listen 0.008 0.0 0.0 0.0 0.0000 CB-Fav 0.007 0.0 0.0 0.0 0.0042 CB-Sub1 0.009 0.0 0.0 0.0 0.0288 CB-Sub2 0.005 0.0 0.0 0.0 -0.0209 CF-Listen 0.083 0.2 0.1 0.1 0.0526 CF-Fav 0.044 0.0 0.0 0.05 -0.0427 CF-Sub 0.160 0.2 0.1 0.05 1.0000 Ensemble 0.409 0.6 0.3 0.15

Table 8: Individual recommender and ensemble performance and associated (non-normalized) weights for user ui in the data set.

Algorithm MAP P@5 P@10 P@20 Weight CB-Listen 0.004 0.0 0.0 0.0 -0.0049 CB-Fav 0.004 0.0 0.0 0.0 0.0029 CB-Sub1 0.006 0.0 0.0 0.0 0.0700 CB-Sub2 0.007 0.0 0.0 0.0 -0.0714 CF-Listen 0.127 0.2 0.1 0.05 -0.4330 CF-Fav 0.127 0.2 0.1 0.05 0.5782 CF-Sub 0.019 0.0 0.0 0.0 1.0000 Ensemble 0.121 0.2 0.1 0.05

Table 9: Individual recommender and ensemble performance and associated (non-normalized) weights for user uj in the data set.

A B D D CB-Listen -0.0143 0.0025 0.0073 -0.0849 CF-Fav -0.0125 -0.0370 0.0043 0.0155 CB-Sub1 0.0491 0.0422 0.0390 0.0136 CB-Sub2 -0.0058 0.0018 -0.0269 -0.0046 CF-Listen 0.0767 0.0797 0.0732 0.0850 CF-Fav -0.0090 -0.0463 0.0060 -0.0827 CF-Sub 0.9526 0.9198 0.9601 0.9492

Table 10: Individual weights per user group. Note that the overall best performing algorithm (CF-Sub) also has the overall highest weights.