Building a recommendation system for Amazon

(1)

MSc Econometrics

Track: Big Data Business Analytics

Master Thesis

Building a recommendation system for

Amazon

by

Joost Jansen

10155864

March 20, 2017

Supervisor:

Dr N.P.A. van Giersbergen

Second reader: Dr J.C.M. van Ophem

(2)

Statement of Originality

This document is written by Student Joost Jansen who declares to take full respon-sibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

The rise of the internet has made it easier for consumers to read reviews and get recom-mendations about movies, restaurants, bars and physical products. Hereby the interest for recommendation systems has risen in the past few years. Almost all of the large web-shops or internet streaming services have an own recommendation system in place. The goal for these recommendation systems is to optimize the revenue of the company that has build the recommendation system by personalising the experience of the consumers. One of the goals of the recommendation systems is to increase the extra sales next to the initial product a user is buying, these sales are called cross-sales. Amazon CEO Jeff Bazos already stated in 2006 that 35% of their total revenue of 10.7B Dollar was gener-ated via cross-sales, indicating the importance of good recommendations. Also Netflix saw the importance of good recommendations for their customers, in 2009 they promised a 1M Dollar price for the team that most improved their recommendation system, with a minimum improvement of a 10% decrease of the RMSE (Bennett & Lanning, 2007).

In the primary years of recommendation systems these were mostly focused on nu-merical ratings. These ratings are often a number of stars or a score that customers give to products that they have bought or viewed via the website. The first step in a recom-mendation system is to predict the ratings of a user on a set of items, whereafter the item with the highest expected rating is recommended. Two methodologies are broadly used to make recommendations based on numerical ratings. At first K-Nearest Neighborhood Models (KNN) as described by (Borchers, Herlocker, Konstan, & Reidl, 1999), where similarities are calculated between users or items and used for making recommendations, were used. They are easy to compute and already showed some good results in terms of recommending products to consumers. A second commonly used method is Latent Factor Modelling, in this research Latent Factor Model as described by (Bell, Koren, & Volinsky, 2009) will be discussed and investigated. This method makes use of Matrix Factorization (MF) where one vector of user-factors and one vector of item-factors are formed. The expected rating is then calculated by the inner product of these vectors.

In this research a dataset from the Amazon website will be used (McAuley, Pandey, & Leskovec, 2015). For Amazon recommendation systems are really important. A very popular quote from the Amazon CEO Jeff Bezos is “If I have 2 million customers on the Web, I should have 2 million stores on the Web.” With this quote he is pointing at the fact that the products that are presented on his website should be adjusted to every

(5)

customer to optimize the revenue Amazon generates. A recommendation system is a tool that can help to reach this goal. The dataset consists of separate subsets of item categories from May 1996 untill July 2014. The ratings are on a 5 star ratings scale, with also textual explanation from the users.

The main goal of this research will be to investigate what the most optimal way is to make recommendations on the Amazon dataset. Multiple settings of the K-Nearest Neighborhood model and Matrix Factorization model will be tested. So not only the comparison is made across the models, but also the most optimal settings per model are investigated. The expectation is that the MF-model will outperform the KNN-model, since it is an iterative process. Where a loss function is optimized. In the next chapter a literature review will be conducted, explaining the different models that are used in this research, including some case studies to investigate optimal parameter settings. The third section will discuss the data and set up of the research, followed by the results in section 4 and the conclusion in section 5.

2 Literature Review

In this chapter the different methods that will be used for building a recommendation system will be discussed. In the first subsection, recommendation systems based on a baseline estimator will be discussed. In the second subsection a recommendation system using a K-Nearest Neighborhood model (KNN-model) will be discussed, followed in the last subsection with the Matrix Factorization model (MF-model).

2.1 Baseline estimator

To compare the more complicated models discussed in the next subsections a baseline estimator is introduced. This baseline estimator makes a prediction of the rating based upon the average rating of an item. Meaning that for every item all the predictions will be the same for every user.

ˆ

ru,i = ˆri =

P

u∈Uru,i

K . (1)

The expectation is that this will lead to better recommendations than when a baseline estimator would be used that makes use of random generated ratings. Making it more difficult for the other models to increase the performance of the model. But as the

(6)

estimator is not personalised, the expectation is that this baseline estimator is not a really useful estimator to use in a recommendation system. This baseline estimator will also be partly implemented in the KNN-model from the next subsection.

2.2 K-Nearest Neighborhood model

K-Nearest Neighborhood models calculate similarities between either users u or items i. Ratings are than predicted using the items/users with the highest similarity to the item or user. These items/users with the highest similarity are often referred to as ’Nearest Neighbors’ (NN). Where the predictions are than made upon K of these Nearest Neighbors, hence the name K-Nearest Neighborhood model. When KNN-models were introduced the main focus was on user-oriented neighborhood models as in (Borchers et al., 1999).

One of the first problems these models had to overcome is that of matrix sparsity. This is the case since for a lot of the item/user combinations the rating is not available. Making it harder to calculate accurate similarities between users. A second very common problem for KNN-models is the cold start problem. This is the problem that arises when a user has not rated one of the nearest neighbors of the item wherefore a prediction is made. As the predicted rating is based upon the ratings a user has given to these nearest neighbors, making a prediction is impossible. This is always the case when a user is new or only has rated one item. In contrast, when a user has rated a lot of items it is expected that the predicted rating is more thrustworthy as it is based upon more information.

Therefore the popularity of item-based neighborhood models has increased as in (Karypsis, Konstan, Riedl, & Sarwar, 2001). These models also have the problem of the so-called cold start problem, but as there are in general less items in a dataset compared to users and in general an item has more ratings than a user has given on average, more information is available to calculate similarities and make recommendations. As stated by (Karypsis et al., 2001) it is very common that even really active users only have rated less then 1% of the items in a dataset. So expecting that the performance of an item-based KNN-model will be higher compared to a user-based recommendation, this research will focus on item similarities.

There are multiple ways to calculate the similarities between the items in the dataset. Most of these similarity measures are either correlation-based or cosine-based. In this paper one of each similarity measure is discussed, the first one being the Pearson

(7)

Cor-relation coefficient.

2.2.1 Pearson Correlation

The first step in the calculation of item similarities with the Pearson Correlation is the isolation of users that have rated both items. This is essential, otherwise the users/items can not be compared. Let the set of users who rated item i and j be denoted by U, then the correlation is calculated by:

si,j= P C(i, j) =

P

u∈U(ru,i− ¯ri)(ru,j− ¯rj)

pP

u∈U(ru,i− ¯ri)2

pP

u∈U(ru,j− ¯rj)2

, (2)

where si,j is the correlation-based similarity between items i and j, ru,i is the rating

of user u on item i and ¯ri is the average rating of item i. The value of the Pearson

Correlation is in the range of [-1,1]. Where the closer the coefficient is to an absolute value of 1, the stronger the relationship is between two items. This can both be a negative relationship as wel as a postive relationship. Meaning that a higher score on one item, could lead to a lower prediction of the rating of another item, when the correlation is negative. The opposite is also true. For selecting the Nearest Neighbors the highest value is taken into account, meaning that only negative similarities will be selected when the number of neighbors used in the prediction of the rating is higher than the number of items with a positive correlation coefficient.

2.2.2 Cosine Similarity

A second method to calculate the similarity between items is to make use of the cosine-similarity which is somewhat similar to the Pearson Correlation. As Karypsis et al. are stating in their paper (2001), the items are seen as two vectors in a n dimensional user-space where n stands for the number of total users in the dataset. The similarity between these two items is the cosine of the angle between these two vectors:

si,j = cos(~i,~j) =

~iT_~j

|~i| ∗ |~j| =

PN

u=1ru,i∗ ru,j

q PN u=1r2u,i q PN u=1r2u,j . (3)

The value of the Cosine Similarity is in the range of [0,1], when the elements of the vectors are non-negative. Which is true in this case as the elements of the vectors consist of the ratings. When the similarity between two items is high, the Cosine Similarity will

(8)

be close to 1, whereas for a weak relationship in the ratings of the two items the Cosine Similarity will be close to 0. As can be seen in the formulas above there is a clear relationship between the Cosine Similarity and the Pearson Correlation coefficient. This relationship is

P C(i, j) = cos(i − ¯i, j − ¯j), (4)

where P C(i, j) is the Pearson Correlation coefficient between items i and j and the righthand side of the equation is the Cosine Similarity between items i and j, ¯i and ¯

j are the averages of the ratings of items i and j. It can be stated that the Pearson Correlation is a centered version of the Cosine Similarity.

The biggest difference between these similarity measures is the range of the similari-ties. Where all values of the Cosine Similarity are positive, the Pearson Correlation can also have negative values in this thesis. In selecting the nearest neighbors a choice has to be made for the Pearson Correlation, these either have to be selected on the absolute value of the coefficients or on the highest positive correlation coefficient. In this thesis it is chosen to not select the nearest neighbors based on the absolute value, but select the items with the highest positive correlation coefficient.

To see the differences between the Cosine Similarity and the Pearson Correlation the similarities will be calculated on all ratings of the 100 most rated movies. This dataset consisting of these 100 most rated movies has 21,369 different users, with a total of 71.140 ratings. Which means that on average the users have rated 3.3 items. The average of the ratings is 4.125 with a standard deviation of 1.179. Figure 1 shows an overview of the distribution of the ratings. As can be seen in Figure 1, more than half of the ratings is the highest possible rating, 5 stars. It is even the case that 76% of the ratings is either 4 or 5 stars. Meaning the ratings are all really close together. For comparison reasons also the distribution of all movie ratings is presented, the total dataset consists of 1,697,533 ratings, with 123,960 users and 50,052 items. Meaning that on average the users have rated 13.69 movies. The average is 4.11 with a standard deviation of 1.198.

As can be seen clearly is that the distribution of the 100 most rated movies is really similar to the general distribution of all movie ratings. In Figure 2 the histograms of the Pearson Correlation and the Cosine Similarity coefficients of the 100 most rated movies are presented. The distribution of the coefficients seem really similar at first sight. Only a small amount of the coefficients for the Pearson Coefficients is negative, and for both

(9)

1 2 3 4 5 Ratings 0 200000 400000 600000 800000

1000000 Distribution ratings total

1 2 3 4 5 Ratings 0 10000 20000 30000 40000 50000

60000 Distribution ratings 100 most rated movies

Figure 1: Distribution ratings

the similarity measures the largest amount of the coefficients is in the range [0,0.05]. The mean of the coefficients for the Pearson Correlation is 0.014 with a standard deviation of 0.019, the mean of the Cosine Similarity coefficients is 0.046 with a standard deviation of 0.039. So the spread of the Cosine Similarity is larger, with a smaller range of possible values. The fact that the mean of the Cosine Similarity is higher than that of the Pearson Correlation is logical, considering the possible ranges of the similarities.

0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 700 800 900 Fre qu en cy

Cosine Similarity 100 most rated movies

−0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.350 200 400 600 800 1000 1200 1400 1600 Fre qu en cy

Pearson Correlation 100 most rated movies

Figure 2: Histograms similarity coefficients 100 most rated movies

To give more intuition to the similarity measures a case study will be conducted in the next subsection and for the 100 most rated movies the similarity measures are calculated. To give a first indication on the differences of the similarity measures the most and least similar movies are shown for both similarity measures in Table 1 and 2.

(10)

Movie 1 Movie 2 Similarity

Star Wars Episode 1 Star Wars Episode 2 0.313

Downtown Abbey Season 2 Downtown Abbey Season 3 0.293

Lord of the Rings 1 Lord of the Rings 2 0.258

Firefly: Complete series Serenity 0.256

Iron Man Star Wars Episode 2 -0.024

Star Wars Episode 1 Pirates of the Caribbean 1 -0.022

Downtown Abbey Season 1 Wolf of Wall Street -0.016

Band of Brothers Captain America -0.015

Firefly: Complete series Captain America -0.015

Table 1: Most and least similar movies Pearson Correlation

Movie 1 Movie 2 Similarity

Star Wars Episode 1 Star Wars Episode 2 0.472

Firefly: Complete series Serenity 0.461

Lord of the Rings 1 Lord of the Rings 2 0.399

Gladiator Downtown Abbey Season 2 0.0004

Star Wars Episode 2 The Best Exotic Marigold Hotel 0.001

Star Wars Episode 2 Dallas Buyers Club 0.001

Star Wars Episode 1 Dallas Buyers Club 0.001

Passion of the Christ Downtown Abbey Season 4 0.001

Table 2: Most and least similar movies Cosine Similarity

So it is clear that the 5 most similar combination of movies are the same for both the Pearson Correlation and the Cosine Similarity. Also the fact that Star Wars, Lord of the Rings and Downtown Abbey sequels are most similar are results that would be expected upfront. A bigger difference is seen in the non-similar movies. Striking is the

(11)

fact that the Star Wars movies are really non-similar to quite some movies as can be seen from the tables. Looking at the ratings of these two Star Wars ratings a big part can probably explained by the fact that these movies score on average really low in the Amazon dataset.

2.2.3 Case Study Similarity Calculations

To see what the difference is between the similarity measures from the previous subsec-tion a small case study is conducted to see what the real differences are between these similarity measures in practice. Three datasets are formed: one with eight movies who seem to be similar, one dataset with 8 non-related movies and one dataset where all these 16 movies are combined. The eight similar movies are all in the category ‘superhero-movies’, or related to these kind of movies, namely: World War Z, Thor, Blade Runner, Marvel’s The Avengers, The Dark Knight Rises, Batman Begins, Iron Man 3 and Man of Steel. The eight non-similar movies are: Despicable Me, Star Trek Into Darkness, Frozen, Django Unchained, Titanic, Lincoln, Avatar and Ted. To be able to make a prediction a user has to at least have rated one nearest neighbor of the item where a prediction is being made. When a user only has rated one of the neighbors the prediction that will be made will be equal to the rating of the neighbor of the item. The movies in the first dataset seem to be more related, this comes back in the fact that there are twice as many users in the dataset who have at least rated two movies in the similar dataset compared to the non-similar dataset, namely 271 versus 153. Also these users have rated on average more of the movies in the dataset, 3.58 vs. 3.46. In the total dataset there are 851 users who have rated a minimum of two movies, so quite a lot of the users have rated movies that seem not to be similar. In Table 3 some more characteristics of the datasets are shown.

Dataset # Movies # Users # Ratings Mean ratings Std ratings

Similar 8 271 970 4.193 1.072

Non-similar 8 154 534 4.264 1.001

Total 16 851 3,296 4.180 1.097

Table 3: Overview of the datasets case study

(12)

each other. With the average of the ratings just slightly above 4, and the standard deviation of the ratings around 1. In Table 4 an overview per movie of the datasets is given.

Dataset Movie # Ratings Mean ratings Std ratings

Sim Marvel’s The Avengers 2,110 4.519 0.8938

World War Z 1,584 3.677 1.2523 Man of Steel 1,281 3.820 1.3239 Batman Begins 995 4.443 1.0078 Thor 930 4.305 0.9980 Iron Man 3 904 4.017 1.2185 Blade Runner 899 4.376 1.0900

The Dark Knight Rises 852 3.920 1.3029

non-Sim Star Trek Into Darkness 1,974 4.312 1.0482

Frozen 1,517 4.547 0.9224 Avatar 1,461 4.120 1.3367 Lincoln 1,093 4.062 1.2126 Despicable Me 1,029 4.668 0.7462 Ted 1,005 3.479 1.4711 Django Unchained 949 4.155 1.1720 Titanic 933 3.998 1.4301

Table 4: Overview ratings per movie

As could already be expected is the fact that the average ratings of the movies are really similar to each other. There seems to be a small trend in the non-similar set where the more a movie is rated, the higher the average rating. For the similar dataset there is no clear trend to see. The next step now is to calculate the similarities for all the movies in the datasets with the Pearson Correlation and the Cosine Similarity. In Table 5 and 6, the similarity calculations for the Pearson Correlation and the Cosine Similarity are shown.

(13)

x1 x2 x3 x4 x5 x6 x7 x8 x1 1 x2 -0.043 1 x3 0.075 0.053 1 x4 0.003 0.003 0.222 1 x5 0.063 0.063 0.116 0.050 1 x6 -0.011 -0.010 0.101 0.100 0.070 1 x7 -0.014 -0.006 0.179 0.127 0.005 0.133 1 x8 0.023 0.037 0.108 0.076 0.156 0.126 0.156 1

Table 5: Similarity Similar Dataset Pearson Correlation

x1 x2 x3 x4 x5 x6 x7 x8 x1 1 x2 0.298 1 x3 0.247 0.432 1 x4 0.240 0.384 0.607 1 x5 0.137 0.300 0.491 0.328 1 x6 0.289 0.270 0.358 0.281 0.284 1 x7 0.201 0.336 0.629 0.454 0.343 0.549 1 x8 0.233 0.317 0.551 0.356 0.402 0.516 0.690 1

Table 6: Similarity Similar Dataset Cosine Similarity

As the Pearson Correlation coefficients are in the range of [-1,1] and Cosine Similarity between in the range of [0,1] the similarities can not be compared directly. It is more interesting to see if the both similarity measures select the same neighbors, which is not the case. For example looking at the first item x1, for the Pearson Correlation

the nearest neighbours from closest to further distant are x3, x5, x8, x4, x6, x7, x2. For

the Cosine Similarity these are: x2, x6, x3, x4, x8, x7, x5. So it is clear to see that these

neighbors are really different, which possibly has a big impact on the prediction of the ratings. But as the sample size is relatively small, it is too soon to draw the conclusion that the Pearson Correlation and Cosine Similarity measures are significantly different

(14)

in selecting the nearest neighbors.

For the non-similar dataset the similarity coefficients are presented in Table 7 and 8. It is hard to really compare the differences between the coefficients of Table 7 to that of

x9 x10 x11 x12 x13 x14 x15 x16 x9 1 x10 0.038 1 x11 0.007 0.053 1 x12 -0.007 -0.049 0.009 1 x13 0.056 0.077 0.015 -0.032 1 x14 -0.015 0.051 0.023 0.036 0.055 1 x15 0.020 0.093 0.020 0.023 0.164 0.065 1 x16 -0.017 0.051 0.116 -0.090 0.061 0.103 0.071 1

Table 7: Similarity non-Similar Dataset Pearson Correlation

Table 5, from the first look it seems that the coefficients of Table 7 have a higher value.

x9 x10 x11 x12 x13 x14 x15 x16 x9 1 x10 0.432 1 x11 0.292 0.501 1 x12 0.176 0.294 0.273 1 x13 0.283 0.373 0.217 0.476 1 x14 0.251 0.430 0.408 0.408 0.494 1 x15 0.229 0.396 0.220 0.385 0.571 0.577 1 x16 0.236 0.339 0.463 0.221 0.309 0.458 0.322 1

Table 8: Similarity non-Similar Dataset Cosine Similarity

For the Cosine Similarity it seems that the coefficients are smaller compared to the coefficients of the similar dataset from Table 6. What the effect of these smaller coefficients is, will be seen in the prediction of the ratings. In table 9 and 10 the results are shown for the total dataset with all 16 movies.

(15)

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x1 1 x2 -0.043 1 x3 0.075 0.053 1 x4 0.003 0.003 0.222 1 x5 0.063 0.063 0.116 0.050 1 x6 -0.011 -0.010 0.101 0.100 0.070 1 x7 -0.014 -0.006 0.179 0.127 0.005 0.133 1 x8 0.023 0.037 0.108 0.076 0.156 0.126 0.156 1 x9 0.021 0.038 0.015 0.035 0.045 -0.001 -0.014 0.071 1 x10 -0.007 -0.012 0.074 0.192 0.091 0.021 0.038 0.072 0.038 1 x11 0.041 -0.005 0.091 0.135 0.054 -0.047 -0.017 0.062 0.007 0.053 1 x12 0.034 -0.004 0.029 0.043 0.108 0.109 0.031 0.013 -0.007 -0.049 0.009 1 x13 -0.019 0.074 0.105 0.036 0.082 -0.013 -0.010 0.127 0.056 0.077 0.015 -0.032 1 x14 -0.009 0.003 0.201 0.126 0.103 0.185 0.160 0.124 -0.015 0.051 0.023 0.036 0.055 1 x15 0.007 0.030 0.161 0.069 0.049 0.054 0.076 0.077 0.020 0.093 0.020 0.023 0.164 0.065 1 x16 -0.030 0.017 0.174 0.050 0.027 0.075 0.071 0.044 -0.017 0.051 0.116 -0.090 0.061 0.103 0.071 1

Table 9: Similarity Total Dataset Pearson Correlation

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x1 1 x2 0.269 1 x3 0.151 0.242 1 x4 0.155 0.244 0.427 1 x5 0.078 0.193 0.390 0.195 1 x6 0.173 0.155 0.227 0.194 0.203 1 x7 0.123 0.201 0.420 0.285 0.194 0.377 1 x8 0.152 0.209 0.368 0.251 0.246 0.360 0.481 1 x9 0.181 0.300 0.185 0.121 0.111 0.089 0.084 0.141 1 x10 0.219 0.346 0.328 0.313 0.170 0.224 0.191 0.221 0.264 1 x11 0.083 0.189 0.255 0.245 0.141 0.156 0.192 0.157 0.146 0.288 1 x12 0.071 0.155 0.301 0.169 0.284 0.221 0.174 0.160 0.087 0.151 0.172 1 x13 0.155 0.165 0.273 0.172 0.246 0.254 0.211 0.235 0.147 0.172 0.110 0.254 1 x14 0.181 0.174 0.376 0.255 0.198 0.480 0.518 0.482 0.114 0.229 0.175 0.186 0.258 1 x15 0.126 0.231 0.255 0.163 0.187 0.227 0.240 0.242 0.169 0.199 0.115 0.227 0.393 0.255 1 x16 0.063 0.123 0.242 0.148 0.088 0.210 0.274 0.263 0.119 0.166 0.257 0.175 0.158 0.262 0.179 1

Table 10: Similarity Total Dataset Cosine Similarity

It has to be noted that for the Cosine similarity the similarities between the items that are in the same dataset differ from the values in Tables 6 and 8. This is due to the fact that more ratings are added to the calculations of the similarities. Users who only have rated one item from a dataset would have been filtered out in the previous

(16)

calculations, but if they rated one item from the similar dataset and one from the non-similar dataset their ratings are included in the total dataset. Because the Pearson Correlation isolates users that have rated both items in the calculation of the similarity, these coefficients stay unchanged. Expected would be that the coefficients of the items from the similar dataset are higher than that of the non-similar dataset and between items of these datasets. In Table 11 the averages of the similarity measures are shown.

Dataset Pearson Correlation Cosine Similarity

Similar 0.070 0.249

non-Similar 0.194 0.194

Cross-set 0.056 0.213

Table 11: Averages Similarity measures

It does not make sense to directly compare the Pearson Correlation averages to that of the Cosine Similarity. But what is striking is the fact that for the Pearson Correlation the average of the coefficients of the similar dataset is smaller than that of the non-similar dataset. This is in contradiction with what would be expected. For the Cosine Similarity this is not the case, there the Cosine Similarity has a higher mean of similarity coefficients of the similar dataset compared to that of the non-similar dataset. What is also striking is the fact that the similarities of the similar dataset are close to the similarities cross dataset. Where it would be more expected that the similarities of the non-similar dataset would be close to the cross-set similarities. This is both the case for the Cosine Similarity as for the Pearson Correlation.

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0 1 2 3 4 5 6 7 Frequency Cosine Similarity Similar Non-similar 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0 1 2 3 4 5 6 7 8 Frequency Pearson Correlation Similar Non-similar

(17)

To better analyse the similarity measures of the Pearson Correlation and Cosine Similarity for the similar movies and non-similar movies, in Figure 3 the histograms of the similarity measures are shown. What is surprising to see is that for the Pearson Correlation almost all the correlations are positive, where especially for the non-similar movies a negative correlation would be expected. But actually it is seen that the cor-relations are higher for the non-similar movies compared to the similar movies. For the Cosine Similarity it seems that these are slighty higher for the similar movie dataset compared to the non-similar dataset. For the Pearson Correlation the correlations for similar movies seem just as strong as the correlations between movies from the similar dataset and the non-similar dataset. Which would be expected are unrelated. Conclud-ing it can be stated that the higher correlations for more similar movies are not seen from this datasets, which is in contradiction with the expectation. This can be due to the fact that the movies in the non-similar datasets are all movies who on average score really high, making the ratings of the movies close to each other and hence the similarity measures are really high. In a K-Nearest Neighborhood model, movies have a high correlation when their average rating is similar. This is more important than the fact if the movies are really similar to each other content wise.

The next step in this case study is to look at the predictions of the ratings made with these similarity measures. The weighted sum of the ratings is calculated with the following formula: ˆ ru,i= P n∈N(si,n∗ ru,n) P n∈N|si,n| , (5)

where ˆru,i is the predicted rating for user u on item i and N is the set of all the

similar items incorporated in the calculation of the predicted rating. This N is the num-ber of nearest neighbours that is used in the recommendation system. Higher numnum-ber of neighbours will lead in general to a higher prediction accuracy, but also to higher computation time. It also has to be taken into account that the neighbours can be so unrelated that adding an extra neighbour is nothing more than adding random noise to the model. The items with the highest predicted ratings will be the items that are recommended to the user. In practice it will occur that for a user-item combination no neighbors are available to make a recommendation. When this is the case the baseline estimator from the previous subsection will be used.

(18)

are being predicted and the accuracy is measures with the Root Mean Squared Error (RMSE). The RMSE is calculated with the following way:

RM SE = s

P

(i,j)∈K(ru,i− ˆru,i)2

K . (6)

K is the set of ratings that is predicted, ˆru,i is the predicted rating and ru,i is the real

rating of user u on item i. The lower the RMSE of a model, the more accurate the predictions are. In Table 12 the results are shown for different number of neighbors, for both the Pearson Correlation and the Cosine Similarity measure on the similar and the non-similar dataset. Before calculating the RMSE of this case study the RMSE is calculated when as predictor of the ratings the baseline estimator is used. The RMSE of the baseline estimator on the similar dataset is 0.778 and on the non-similar dataset it is 0.763. 1 2 3 4 5 6 7 PC, sim 0.785 1.043 1.393 1.575 1.677 1.722 1.745 PC, non 0.763 1.065 1.365 1.515 1.617 1.716 1.726 Cos, sim 0.810 1.068 1.414 1.579 1.708 1.832 1.902 Cos, non 0.763 1.154 1.539 1.734 1.910 2.057 2.145

Table 12: RMSE Case Study

In the table PC stands for the Pearson Correlation, Cos for the Cosine Similarity, sim for the similar dataset and non for the non-similar dataset. The column headers in the table are the number of neighbors that have been used to make the predictions of the ratings. As was expected for only one neighbor used in the model the RMSE are close to the results of the baseline estimator. This is because when there is no nearest neighbor available to make the prediction, the average rating of the movie is used. Therefore at first when the number of neighbors is increasing the RMSE is also increasing. It is also clear to see that the Pearson Correlation performs better than the Cosine Similarity for both the similar and the non-similar dataset. What is striking is the fact that for the Cosine Similarity the predictions of the similar dataset are more accurate and for the Pearson Correlation the predictions of the non-similar dataset are more accurate. Interesting will be to see in the next steps of this thesis what happens with the accuracy

(19)

of the predictions with a large number of nearest neighbors. If the KNN-model can outperform the predictions where the baseline estimator is used.

In Table 13 the predictions for the complete dataset are shown for different number of neighbors used in the prediction of the ratings. Again the Pearson Correlation outper-forms the Cosine Similarity similarity measurement. Also for the number of neighbors the RMSE of the Pearson Correlation is decreasing when more nearest neighbors than 14 are used for the predictions, where for the Cosine Similarity the RMSE is still increas-ing with the number of neighbors. Also when comparincreas-ing the results up to a number of neighbors of 7, the results are more accurate compared to the results in Table 12. This is also what is expected as the dataset is larger.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

PC 0.737 0.935 1.181 1.400 1.524 1.634 1.692 1.763 1.823 1.873 1.895 1.902 1.918 1.812 1.789 Cos 0.731 0.998 1.297 1.526 1.662 1.764 1.841 1.919 1.967 2.021 2.081 2.119 2.155 2.186 2.210

Table 13: RMSE Case Study Total Dataset

The next step is to see what happens when the ratings of the 100 most rated movies are used as dataset, where the number of neighbors will be increasing up to 99. The results are presented in Figure 4.

0 20 40 60 80 100 Number of neighbors 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 RM SE

Neighborhood model performance

Pearson Correlation

Cosine Similarity

(20)

The graph has a somewhat counterintuitive look at first. With the RMSE increasing over the number of neighbors in this case. The reason for this is that for this dataset, it seems to be a better estimator to use the baseline estimator than the KNN-model as predictor. The more nearest neighbors are involved in the model, the less the baseline estimator is used and therefore the performance of the model is going down. Striking is the fact that for the Pearson Correlation a maximum is reached and the line starts to decrease, this could indicate that if the number of neighbors would even be higher the accuracy of the model would improve. Unfortunately for this dataset a higher number of neighbors is not possible, since all the items are used as possible neighbors. For the Co-sine Similarity the RMSE keeps increasing for the number of neighbors. Already before the Pearson Correlation has reached its maximum the Cosine Similarity has a higher RMSE. So from this case study the conclusion can be made that the Pearson Correla-tion (slightly) outperforms the Cosine Similarity in predicting the ratings. Therefore the Pearson Correlation will be used in this thesis.

2.3 Latent factor models

Another type of models common used for recommendation systems are Latent Factor (LF) models as described by (Bell et al., 2009). They state that LF-models are superior compared to the KNN-model as described in the previous subsection. Latent factor models characterize items and users on multiple factors. These factors are for items for example the genre of the movie or the age category it is most liked by. It can also be factors that are less well-defined like the atmosphere in the movie. For users each factor measures how much the user likes the movies that score high on the item factors. Matrix Factorization models try to explain the rating patterns by characterizing the users and items on a selected amount of factors. MF-models are amongst the most popular and successful latent factor models in use. These models form vectors of item and user related factors from the rating patterns, in such a way that the user-item relations are mapped on a space with the dimension f , where f is the number of factors. For the users this is vector pu ∈ Rf and for the items this is the vector qi ∈ Rf. For the item-vector qi this

means that all the elements measure the extent to which an item possesses each of the factors. In the user-vector pu for each user the elements measure the extent to which

the users like the items that score high on the corresponding factors.

(21)

this is in this case of this thesis the ratings in the dataset where the users explicit show what they think about a item. In MF-models it would also be possible to make use of implicit feedback, which is feedback a user does not give knowingly. This could for instance be the search pattern of a user, or the behavior on the website. Unfortunately in the Amazon dataset no implicit feedback is available.

To calculate the predicted ratings the two vectors are multiplied as dot product:

ˆ

ru,i= qiTpu, (7)

where ˆru,i is again the predicted rating of the user u on item i. This model is sensitive

for overfitting, hence a regularization term is imposed into the model. The model than is defined as:

ˆ

ru,i= qiTpu− λ(kqik2+ kpuk2), (8)

where λ is the regularization term for the pu and qi vectors. A possibility to extent the

model is by adding biases to the model for the users and items, these biases can correct for more general trends in the data. In practice there will be users that rate on average items higher compared to other users, for these trends the biases correct. Same counts for items, some items recieve higher ratings compared to other items. After including the biases, the dot product q_iTpu really measures the interaction between a specific item

and user, where the more general trends are corrected for via the biases. The biases are described as:

bu,i = µ + bi+ bu, (9)

where bu,i is the total bias factor for user u and item i, µ is the overall average of all

ratings, bi and bu are deviations for user u and item i from the overall mean. Including

the biases into the rating prediction with regularization term then gives:

ˆ

ru,i= µ + bi+ bu+ qTi pu− λ1(kqik2+ kpuk2) − λ2(b2u+ b2i), (10)

where λ1 is the regularization term for the user and item vector as discussed before and

λ2 is the regularization term for the bias vectors. To come to rating predictions the

following minimization has to be optimized:

CR(θ) = min Q∗_,P∗_,b∗

X

(u,i)∈K

(22)

where the Q and P are the matrices that are formed by the vectors qiand purespectively.

This minimization is solved with Stochastic Gradient Descent (SGD). Explanation of SGD follows in the following subsection.

2.3.1 Stochastic Gradient Descent

For the optimization of (11) the Stochastic Gradient Descent algorithm is used (Robbins & Monro, 1951). This algorithm will try to find the global minimum of the optimization problem. The algorithm updates the parameters in the opposite direction of the gradient of the objective function. It calculates the gradients of the parameters and updates the parameters for each training example in the set. This is different to more conventional Gradient Descent schemes like Batch Gradient Descent (BGD), where the gradients are calculated for the whole dataset. In practice this will mean that the computation time for BGD will be higher compared to SGD, as for one update the gradient has to be calculated on the whole dataset, as with SGD this is done on just one training example of the dataset. With SGD it is also possible to learn online, which is not the case with BGD. Learning online means that the dataset can be updated and extended continuously, this can be an advantage for Amazon for instance when they want to keep their review dataset up to date. As SGD uses one training example at every update, the variance of SGD is higher compared to BGD. This makes the value of the learning rate extra important as the learning rate determines the size of the updates and can be the determining factor if a global minimum is reached. More on setting the learning rate will follow later in this subsection. For SGD it is important that for every iteration the data is shuffled. As if we keep the dataset ordered this can lead to a bias in the optimization, especially for instance as with the Amazon dataset which is originally sorted on time and date that the review is made. The updating of the parameters in SGD is done following the following formula:

θ = θ − η∇θJ (θ; x(i); y(i)). (12)

In this formula θ stands for the parameters that will be optimized, η is the learning rate of the algorithm and ∇θJ (θ) are the calculated gradients of the parameters. For

(23)

∇_q_i = (ru,i− ˆru,i)pu− λ1qi (13)

∇_p_u = (ru,i− ˆru,i)qi− λ1pu (14)

∇_b_u = (ru,i− ˆru,i) − λ2bu (15)

∇_b_i = (ru,i− ˆru,i) − λ2bi, (16)

where ˆru,i is the predicted rating as described before. The term (ru,i− ˆru,i) is the error

term of the prediction, the deviation of the prediction from the real rating. Optimal values have to be found for the regularization terms, the learning rate, the number of dimensions f and the number of iterations. In general it is to be expected that a higher number of dimensions will lead to more accurate predictions, but will lead to a higher computation time. The same goes for the number of iterations of the algorithm. In the next subsection, the optimization of the learning rate will be discussed. For the regularization term a small case study is conducted on the 100 most rated movies, to see what the effect is of different values of this regularization term. Also the effect of adding the biases to the model will be investigated. Afterwards, different techniques for optimizing the learning rate will be investigated.

2.3.2 Case Study regularization term and biases

This case study will use the same dataset as the case study considering the KNN-model, namely the dataset with the ratings of the 100 most rated movies in the Amazon dataset. Important to note is that a filtering has been done on the dataset, users are only included if they have rated at least 5 movies. The dataset consists of 19,157 ratings from 1,971 users. In this case study the number of dimensions f , the number of iterations, the regularization terms and the biases will be investigated. First the regularization terms will be investigated. The biases are included in the model and the number of iterations is set to 100 and the number of dimensions is 5, the learning rate γ is set to 0.001. The model is trained on 80% of the dataset and the RMSE is calculated on the test set which consists of the other 20% of the dataset. In Table 14, λ1 is the regularization term of

the P and Q matrices, λ2 is the regularization term of the biases bu and bi.

It is clear to see that the RMSE is decreasing when the regularization terms are increasing up to a value of 0.5. When the value of the regularization terms is higher than 0.5, the RMSE starts increasing. Also the check has been made if the accuracy of

(24)

λ1 λ2 RMSE 0 0 1.163 0.1 0.1 1.130 0.2 0.2 1.120 0.3 0.3 1.127 0.4 0.4 1.121 0.5 0.5 1.117 0.6 0.6 1.119 0.7 0.7 1.128 0.5 0 1.146 0 0.5 1.118 0.5 0.25 1.127 0.25 0.5 1.111

Table 14: Table 14. Results regularization terms

the model improves when one of the regularization terms is decreased and the value of the other regularization term is kept at 0.5. As is clear to see in the table, this is making the RMSE of the models higher and so it is chosen to keep the regularization terms on 0.5 for the rest of this study. The next stap in this case study is to study different values of the number of dimensions of the P and Q matrices and if the addition of the biases to the model indeed leads to an improvement of the accuracy on the test set. For the number of dimensions, 3 different values are investigated: 3, 5 and 10. In Table 15, the results are shown of the RMSE of the predictions on the test set. As mentioned before the learning rate is 0.001 and the regularization terms are set to 0.5.

It is clear to see that with including the bias terms the accuracy of the model im-proves. for all number of dimensions the RMSE of the test set is lower compared to that of the model without the bias terms. As expected the accuracy of the models is improving with the number of dimensions of the P and Q matrices, except for the model without biases where the accuracy decreases for a value of 10 dimensions. For the model with the bias terms the highest accuracy is indeed reached with the highest number of dimensions. For the rest of this thesis, the number of dimensions will therefore be set to 10 and the bias terms will be included. It is to be expected that an even higher

(25)

RMSE

f no bias with bias

3 1.323 1.114

5 1.118 1.117

10 1.218 1.109

Table 15: Results for different dimensions and biases

number of dimensions will lead to better performance, but as the computation time will be significantly higher, it is chosen to keep the number of dimensions at 10. To look at the optimal number of iterations of the SGD algorithm, in Figure 5 the performance of the models for different numbers of iterations with the dimension set to 10, bias terms included and regularization terms both set to 0.5 are presented.

0 20 40 60 80 100 Number of iterations 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 RM SE

SGD Learning Curve

Training Data

Test Data

Figure 5: Performance predictions over number of iterations

What is clear to see is that the RMSE is decreasing over the number of iterations. This is also what would be expected with SGD. Running the model for 100 iterations does take quite some time, therefore it is more practical to look for a lower value. It is clear to see that the algorithm converges quite fast to a minimum value. It is clear to see that around 40 iterations the RMSE is not decreasing anymore. This convergence is

(26)

to be expected to be at different number of iterations for different datasets and settings of the model, especially for the learning rate optimization techniques from the following section.

2.3.3 Learning rate optimilization

The learning rate η determines how large the updates for the parameters are in the SGD algorithm. In Gradient Descent algorithms the objective is to let the objective function converge to the global minimum value. To achieve this result it is important to determine the optimal value of the learning rate. There are multiple algorithms available to determine this value. It is important to choose the right value for the learning rate as too low values of the learning rate lead to really slow convergence. A too large value of the learning rate can make the updates to large and let the objective function fluctuate around the minimum value, or even diverge. In the next subsection two algorithms to optimize the learning rate will be explained: Adagrad and Momentum.

2.3.4 Momentum and Adagrad

In this subsection I will explain the differences between the Momentum and the Adagrad algorithms to update the learning rate of the Stochastic Gradient Descent algorithm. Also a case study is described to show the differences in performance of the two algo-rithms, compared to a fixed value of the learning rate. Stochastic Gradient Deschent algorithms can have problems with moving around local optima. Around local optima often so called ravines appear, which are areas where the surface is more steep for one dimension compared to other dimensions. The problem for SGD is that when in a ravine, it moves really slowly out of it.

A solution for this problem can be the Momentum updating scheme (Zeiler, 2012). The Momentum updating scheme is solving the problem by multiplying the previous update vector, used in the updating with a constant γ. This γ is the so-called momentum coefficient, which is often set to 0.9. The updating then looks like the following formula:

vt= γvt−1+ η∇θJ (θ; x(i); y(i)) (17)

(27)

This means in practice that for dimensions that move in the same direction the momentum term increases and reduces the speed of updates that change in direction. It also means that the algorithm is expected to perform better, but the learning rate still has to be set manually. Especially convergence is to be expected to occur faster compared to standard SGD algorithm.

The Adagrad algorithm deals really well with sparse data, which in this thesis is im-portant since the dataset of the Amazon reviews is very sparse. It deals with sparse data by conducting larger updates for infrequent parameters compared to smaller updates for frequent parameters. This means that Adagrad uses different learning rates for different parameters at every time step t, where the standard SGD algorithm and the Momentum updating scheme use one learning rate for all parameters at every time step. To explain this function, first define the updating scheme as before for all parameters j.

θt,j = θt−1,j − η∇θJ (θj; x(i); y(i)). (19)

What Adagrad does is at every step for every parameter modifiy the global learning rate η based on the past gradients of the parameter θj in the following way:

θt,j = θt−1,j − η s t−1 P τ =1 g_τ,j2 ∇_θJ (θj; x(i); y(i)). (20) θt,j = θt−1,j− η pGt,jj ∇_θJ (θj; x(i); y(i)). (21)

In this formula the denominator is the l2 norm of all previous gradients, per di-mension. So the algorithm uses the global learning rate for all parameters, but since they are corrected with their own past gradients the learning rate is different for dif-ferent dimensions. In practice this means that dimensions with larger gradients have a smaller learning rate compared to dimensions with small gradients, who have a large learning rate. The new parameter estimation will now be done with an element-wise matrix-vector multiplication between Gt and gt.

θt+1= θt− η s t P τ =1 g2 τ, gt. (22)

A disadvantage of the Adagrad algorithm is the fact that the learning rate is fast converging towards zero. The reason for this is the accumulation of the squared gradients

(28)

in the denominator, since all the terms that are added to the sum are positive. When the learning rate is going towards zero, the algorithm is not able to add more information to the updating of the parameters. Adagrad can also have a low computation speed for dimensions with a high initial gradient or parameter value, meaning that the learning rate for the rest of the training will be low. This is due to the fact that these parameters will use a really low learning rate through all iterations. This can lead to slow convergence.

2.3.5 Case Study Learning rate optimization

To see what the differences is in performance between the different learning rate op-timization schemes a small case study is conducted for different values of the learning rates and number of iterations. For the case study the dataset of the Amazon Instant Video is used, this is the smallest product category wherefore a dataset is available. In the dataset 5,072 users are present, who have rated a total of 10,842 different items. The number of ratings is 48,836. This dataset is split in a training and test set, where 80% of the ratings is the training set en the accuracy of the predictions is tested on the test set of 20% of the ratings.

The results are presented in Table 16 for all the 3 learning rate optimizimation schemes. Where MF is the MF-model where the learning rate is manually set, MF Mom stands for the optimization where the learning rate is tuned with the Momentum updating scheme and MF Ada is the optimization where the Adagrad algorithm is used for optimizing the learning rate. In this table η stands for the learning rate, n for the number of iterations.

For the MF model with a learning rate of 0.01, the loss function converges fast and the value stays the same for the number of iterations. The same goes for the learning rate of 0.001, with the difference that the performance is better when the learning rate is 0.001. This could be due to the fact that for a learning rate of 0.01 the loss function does not reach a global minimum, but a local minimum. For the learning rate of 0.0001 there is still no convergence of the loss function for 100 iterations, therefore the model is also ran with a higher number of iterations 500 and 1,000. For this lowest learning rate the best performance is achieved, but a high number of iterations is needed. As the computation time increases drastically, this is not the number of iterations that will be used in the result section.

(29)

η n MF MF Mom MF Ada 0.01 25 1.136 diverge 1.145 0.001 25 1.119 1.126 1.951 0.0001 25 1.763 1.145 17.452 0.01 50 1.140 diverge 1.153 0.001 50 1.118 1.125 1.239 0.0001 50 1.435 1.125 12.267 0.01 100 1.139 diverge 1.155 0.001 100 1.118 1.128 1.128 0.0001 100 1.119 1.115 5.738 0.0001 500 1.106 1.098 1.209 0.0001 1,000 1.106 1.099 1.054

Table 16: Results Case Study

loss function is diverging for this value. Again for the 0.001 value of the learning rate, the loss function already converges for 25 iterations. The performance is almost equal compared to that of the MF model. For the lowest learning rate, again a high number of iterations is needed to make sure that the loss function converges. The performance is slightly better in the end compared to the MF model.

For the Adagrad algorithm it seeems indeed that when the number of iterations is high enough that the learning rate is not of a big influence on the performance of the model. It seems that for the learning rate of 0.01 and 0.001 the model converges fast. The learning rate 0.0001 reaches the highest accuracy overall, but again takes a lot of iterations. Therefore the learning rate is set to 0.001 in the next section. The number of iterations is set to 50 for all the models. Since the dataset will be large, it is important to have a number of iterations that is not too large, so that the computation time does not become too long. The initial learning rate will be set to 0.001 for all of the optimizations of the MF-model.

(30)

3 Data and Setup

In this section a short overview of the data used in the next section and the set up of the models will be discussed. For the K-Nearest Neighborhood model the Pearson Correlation is used to calculate the similarities between the items. A comparison will be made for a number of neighbors of 50, 100 and 500. As mentioned before, when there are no nearest neighbors available to make a prediction, the baseline estimator will be used.

For the Matrix Factorization model a comparison will be made for all the learning rate optimization schemes. For all these three optimization schemes the regularization terms are set to 0.5, bias terms are included and the number of iterations will be 50. The learning rate for all of the optimization schemes of the MF-model will be equal to 0.001. The KNN-model and the MF-model are compared to the baseline estimator which consists of the average rating of the item.

In Table 17, an overview is presented of the data used in the next section. For 5 different categories the models are trained and their performance will be compared. All datasets are filtered in such a way that all users have at least rated five items in the dataset. The data is coming from the Amazon website, which was retrieved via (McAuley et al., 2015)

Category # Ratings # Users # Items Mean ratings Std ratings

Digital Music 57,111 4,022 3,568 4.216 1.084

Video Games 194,545 16,856 10,671 4.079 1.196

Sports & Outdoors 239,257 24,182 18,354 4.400 0.978

Movies & TV 1,538,738 92,201 50,046 4.097 1.198

Table 17: Data Overview

The datasets differ in size to test the performance of the models on different sizes of the datasets. For all categories the average of the ratings is really similar and also the standard deviation has a value that is close to each other. It is expected that the performance on the Movies & TV dataset will be better compared to the other datasets, as this dataset is the largest.

(31)

4 Results

In Table 18, the results are shown for the models described in the previous subsections.

Category Size BL NN50 NN100 NN500 MF MF Mom MF Ada

Digital Music 57K 0.272 0.771 0.867 0.960 2.724 1.092 1.042

Video Games 195K 0.323 0.793 0.905 1.035 2.743 1.093 1.029

Sports & Outdoors 239K 0.284 0.839 0.923 1.006 2.508 0.999 0.964

Movies & TV 1.5M 0.272 0.548 0.812 0.883 2.147 0.972 0.916

Average 2M 0.288 0.738 0.877 0.971 2.531 1.039 0.988

Table 18: Results

In general the predictions are more accurate for larger datasets. The highest accuracy in terms of RMSE is reached on the Movies & TV dataset. Striking is the fact that the baseline estimator is outperforming all of the models with a landslide, with an average RMSE of 0.288. It is also clear to see that the more neighbors are added to the KNN-model, the less accurate the predictions become, which was already observed in Section 2. The average RMSE of the KNN-model where 50 nearest neighbors are used in the predictions is 0.738 and 0.971 for the KNN-model where 500 nearest neighbors are used for the predictions. This is due to the fact that when the number of neighbors increases, less predictions are made with the baseline estimator and it seems to be the case that the predictions of the baseline estimator are better than that of the KNN-model.

The RMSE of the Matrix Factorization model are higher than that of both the baseline estimator and that of the KNN-model. The MF-model is performing best when the Adagrad optimization scheme is used for the learning rate, with an average RMSE of 0.988. This is significantly lower than when no optimization scheme is used in the MF-model, this has an average RMSE of 2.531. The Adagrad optimization scheme is just slightly performing better compared to the MF-model where the Momentum updating scheme is used for the learning rate.

It is expected that the baseline estimator is performing so well due to the fact that most of the ratings is either 4 or 5 stars. Therefore additional results are generated for the Movies & TV dataset, where the test set is split up per rating level. The Movies & TV category is the chosen category to use, as this is the biggest dataset and performance

(32)

of the models is the highest on this dataset.

Subset Size BL NN50 NN100 NN500 MF MF Mom MF Ada

1 19K 3.189 2.669 2.033 1.014 1.945 1.828 2.143 2 19K 2.117 1.474 0.981 0.937 0.843 0.812 0.772 3 37K 0.832 0.724 0.697 0.620 0.156 0.185 0.120 4 71K 0.098 0.424 0.522 0.603 1.115 1.158 1.104 5 162K 0.303 0.607 0.651 0.766 2.097 2.046 1.989 Total 308K 0.272 0.548 0.812 0.883 2.147 0.972 0.916

Table 19: RMSE per rating subset

In the total line in Table 19 the RMSE of the Movie & TV dataset is presented. The table shows that the theory that the baseline estimator is performing so well because of the distribution of the ratings is true. Still the baseline estimator is overall performing the best, but it is clear to see that for the subset with the 4-star ratings, the baseline estimator keeps outperforming all other models with a landslide. The same counts for the result on the subset with the 5-star ratings. But as expected, the RMSE of the other subsets are increasing the further the ratings are moving away from the average of 4 stars. Where the RMSE of the baseline estimator on the 4-star ratings subset is equal to 0.098, it is equal to 3.189 for the 1-star rating subset. For the subset with the 1-star ratings, the baseline estimator is even the worst performing estimator.

This also influences the KNN-models as can be seen in the table. As the baseline estimator is decreasing in performance, so is the KNN-model. This effect is the strongest when the number of nearest neighbors is set to 50, which is logical since the model than makes use of the baseline estimator the most. The higher the number of nearest neighbors, the less the effect of splitting up the test sets in the different ratings. As can be seen that for the number of neighbors set to 500 the lowest RMSE reached is 0.603 and the highest 1.014, which is a way smaller gap than for the number of neighbors is equal to 50. There the lowest reached RMSE is 0.424 and the highest 2.669.

For the MF-models a different trend is seen in the table. In contrary to the baseline estimator and the KNN-model, the MF-models do not perform the best on the subset closest to the average of the ratings, but to the subset in the middle. The RMSE of

(33)

the MF-model for the different learning rate optimization schemes is significantly lower compared to any results of the KNN-model on the different subsets. The further the subset is away from the middle subset, the worse the performance of the MF-model. Still the MF-model which makes use of the Adagrad optimization scheme performs the best overall over the subsets, but for some subsets the Momentum optimization or the standard MF-model outperform the Adagrad optimization scheme.

To further filter out the effect of the baseline estimator and be able to compare the KNN-model to the MF-model, another accuracy measure is used in Table 20. Again the test set is split up for the different number of stars that is given as rating, but instead of the RMSE the percentage of correct predicted ratings is used. As the baseline estimator is using the average of the ratings and most of the items in the Movies & TV category have an average rating of approximately 4, the accuracy on the other subsets than the 4-star rating subset will be really low for the baseline estimator. This filters out the effect it has on the KNN-model.

Subset Size BL NN50 NN100 NN500 MF MF Mom MF Ada

1 19K 0.006 0.121 0.205 0.286 0.205 0.296 0.312 2 19K 0.012 0.089 0.167 0.248 0.276 0.398 0.432 3 37K 0.086 0.134 0.297 0.389 0.346 0.451 0.509 4 71K 0.886 0.716 0.637 0.546 0.289 0.402 0.418 5 162K 0.094 0.129 0.304 0.384 0.198 0.305 0.342 Total 308K 0.281 0.274 0.372 0.411 0.245 0.353 0.386

Table 20: Percentage predictions correct per rating subset

Still the baseline estimator is in total the best performing estimator, with a total of 28.1% of the ratings predicted correct. The baseline estimator is only performing well on the subset with the 4 star ratings, which is logical as most items have an average that is close to 4. Therefore it is also not really interesting to compare the results of this subset of the KNN-model to the MF-model. For the other subsets it is interesting, as the baseline estimator is not improving the KNN-model. It even is a disadvandtage now for the KNN-model to use the baseline estimator in these subsets. This is due to the fact that when the baseline estimator is used, the prediction will almost certainly be

(34)

wrong.

In general the accuracy of the KNN-model is improving with the number of neighbors used in the predictions of the ratings. The accuracy goes from 27.4% for the number of neighbors equal to 50, to 41.1% for the number of neighbors equal to 500. This is even with the 4-star rating subset included, wherefore the accuracy goes down with the number of neighbors used. When the subset of the 4-star rating is not considered in the total number, the accuracy even goes from 9.5% for 50 nearest neighbors to 27.5%, which is a huge increase. It is clear to see that the trend is the same for all the subsets except for the 4-star rating subset. For all subsets the accuracy improves with the number of neighbors.

For the Matrix Factorization model the trends are the same for all of the subsets. The accuracy is the lowest for the standard MF-model. The Momentum optimization scheme already improves the model from 24.5% to 35.3%. The Adagrad optimization scheme is outperforming both with an accuracy of 38.6%, which is lower than the KNN-model with 500 nearest neighbors used in the predictions. But if the 4-star rating subset is let out of the total accuracy calculation, the accuracy of the MF-model with Adagrad optimization scheme is equal to 28.1%, which is higher than the 27.5% for the KNN-model with 500 neighbors. Therefore it could be stated that when the baseline estimator is not influencing the KNN-model positively the MF-model is outperforming the KNN-model.

To further investigate what the effect is of the distribution of the ratings in the Amazon dataset, a sample is taken from the total Movies & TV dataset. For every level of the ratings a subset of 20,000 ratings is randomly selected. Therefore a dataset is formed with 100,000 ratings. In Table 20 the results are shown for the subsample of the dataset and also the results for the total Movies & TV dataset is presented in Table 21 to be able to make a good comparison.

Dataset Size BL NN50 NN100 NN500 MF MF Mom MF Ada

Subset 100K 1.005 0.683 0.642 0.598 2.438 1.001 0.968

Total 1.5M 0.272 0.548 0.812 0.883 2.147 0.972 0.916

Table 21: RMSE Movie & TV Dataset

(35)

the subset of the dataset, the RMSE on the total dataset is equal to 0.272 and for the subset the RMSE is equal to 1.005. It is even the worst performing model except for the Matrix Factorization model where no optimization scheme is used for the learning rate. For the KNN-model the RMSE is decreasing for the number of neighbors used in the prediction of the ratings. For 50 nearest neighbors the RMSE is equal to 0.683, where the RMSE for 500 nearest neighbors 0.598 is. The KNN-model is also still outperforming the MF-model, even though the positive effect of the usage of the baseline estimator is now gone. For the MF-model there is no real change in the trend of the performance on the subset where the ratings are distributed more equally. The RMSE is higher compared to the RMSE of the whole dataset, this is possibly due to the size of the dataset.

5 Discussion

In this thesis, it is investigated what the best possible model would be to use in a rec-ommendation system for Amazon. By investigating two types of models: a K-Nearest Neighborhood model and a Matrix Factorizon model. These two models are compared to a baseline estimator which is the average of the ratings of the item where a rec-ommendation is made for. It is clear to see in this thesis that the baseline estimator outperforms both the KNN-model and the MF-model on all of the different datasets of the four selected product categories. This was not what was expected, but looking back at the distribution of the ratings it does make sense that this baseline estimator is performing so well. Since most ratings are 4 or 5 stars, out of a possible 5. The average of the ratings is often between the 4 and the 5 and therefore close to most of the ratings. If the baseline estimator would be really used in a recommendation system this would mean that all of the users would get the same recommendations, which would be con-sisting of the items with the highest average ratings. As the goal of a recommendation system is to personalise the experience of the user, really using the baseline estimator is not an option. On the datasets of the product categories the K-Nearest Neighbor-hood model, where 50 nearest neighbors are used, outperforms the Matrix Factorization model. As the baseline estimator is also used in the KNN-model and the performance of the baseline estimator is so high, it is investigated further what the real difference in performance is between the KNN-model and the MF-model by splitting up the dataset. The Movies & TV test set is split up per level of the ratings. The test set consists of 20% of the ratings which are randomly selected from the total dataset. It is clear to

(36)

see that the baseline estimator performs the best on the subset with the 4-star ratings. Which is logical as the average of most items is approximately 4 stars. The further away the subset is from this 4-star rating subset the worse the performance of the baseline estimator. On the subsets with the 1, 2 and 3-star ratings the MF-model is outperforming the KNN-model. This is probably due to the fact that the baseline estimator is negatively influencing the performance of the KNN-model on these subsets.

When instead of the RMSE the percentage correct predicted ratings is used, different trends are seen in the data. The KNN-model is improving with an increase of the number of neighbors, instead of decreasing as it is with the RMSE. The KNN-model where the number of neighbors is set to 500 is performing the best, also outperforming the MF-model.

Finally a subset has been taken from the Movies & TV dataset where the ratings are distributed equally over the number of stars. On this subset the KNN-model with 500 nearest neighbors is performing better than the MF-model again. For the KNN-model the performance is improving with the ratings distributed more equally. For the MF-model the predictions are less accurate on the subset of the ratings, which is probably due to the size of the dataset.

So concluding it could be stated that the KNN-model is outperforming the MF-model. This is partly due to the positive effect the baseline estimator has on the results of the KNN-model on the Amazon product categories, but also when subsets of the dataset are investigated the KNN-model is outperforming the MF-model. Using the baseline estimator is not a real option for Amazon as it is not personalising the experience for the users. For the Amazon dataset it seems the most appropriate the use the smallest amount of nearest neighbors that is investigated in this thesis. When the ratings would be more distributed equally a higher number of nearest neighbors would be preferred.

If it would be decided to use some form of a MF-model it would be logical to use the MF-model where the Adagrad updating scheme is used for the learning rate. For further research this could be interesting when the textual reviews would be added to the recommendation system. As this is not possible in the KNN-model. But for this thesis the conclusion can be made that the KNN-model is the preferred model to use in a recommendation system for Amazon.

(37)

References

Bell, R., Koren, Y., & Volinsky, C. (2009). Matrix factorization techniques for recom-mender systems. Computer , 42 (8), 30-37.

Bennett, J., & Lanning, S. (2007). The netflix prize. Proceedings of KDD cup and workshop, 2007 , 35.

Borchers, A., Herlocker, J. L., Konstan, J. A., & Reidl, J. (1999). An algorithmic framework for performing collaborative filtering. Proceedings of the 22nd annual inter-national ACM SIGIR conference on Research and development in information retrieval , 230-237.

Karypsis, G., Konstan, J., Riedl, J., & Sarwar, B. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web, 285-295.

McAuley, J., Pandey, R., & Leskovec, J. (2015). Inferring networks of substitutable and complentary products. Knowledge Discovery and Data Mining.

Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, 400-407.

Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 .