In Cervisia Veritas: a Treatise Concerning the Accuracy Conditions of Recommender System Techniques

(1)

Master Thesis

In Cervisia Veritas:

A Treatise Concerning the Accuracy Conditions of

Recommender System Techniques

Author: A.A. Zwartsenberg

Student Number: 10808574

Date: November 30, 2018

Programme: MSc Econometrics

Track: Free Track

Supervisor: N.P.A. van Giersbergen

Second Reader: K. Pak

Capgemini Supervisor: M. Markus

Abstract

In this thesis a recommender system was built and its performance evaluated by adding layers of complexity. Single ratings, Multiple Ratings and Review frameworks were in-vestigated, and the performance of both neighborhood models and latent factor models were evaluated. Increasing the complexity of the models by adding more ratings and by including reviews lead to increased accuracy, when the lower bound of ratings per user achieved a certain threshold.

(2)

Statement of Originality

This document is written by Student Arthur Zwartsenberg who declares to take full respon-sibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

5.2 Analysis . . . 22 5.3 Model-Based Approach . . . 25 5.3.1 Analysis . . . 30 6 Multiple Ratings 34 6.1 Neighborhood Approach . . . 35 6.1.1 Analysis . . . 37 6.2 Model-Based Approach . . . 43 6.2.1 Analysis . . . 46 7 Reviews 51 7.1 Neighborhood Approach . . . 53 7.1.1 Analysis . . . 55 7.2 Model-Based Approach . . . 59 7.2.1 Topic Models . . . 59 7.2.2 Analysis . . . 65

(4)

8 Sensitivity Analysis 69

9 Conclusion 73

References 75

(5)

1 Introduction

In the capitalist age individuals have the freedom to make any choice of products they choose. With an increase in products to cater to the wide variety of interests of individuals -and with the advent of the digital age and the resulting increase in information- it has become harder to make a choice from this plethora of products and services. Creating recommendations personalized to individual interests have therefore become more and more important. Since Malone et al. (1987) created one of the first recommender systems more than three decades ago, these systems have come a long way and have become in widespread use in several industries. Since then, Recommender Systems have seen wide application in society for use by ordinary customers, such as recommending products when browsing online retailers like Amazon and recommending movies and series when watching Netflix. Recommender Systems utilize statistical learning and pattern searching on user- and item data in order to make improved product recommendations to users, by gathering information and algorithmically linking it to individual preferences, and subsequently recommending items based on these preferences (Chen et al., 2015).

These recommendations contribute to the profitability of companies by improving conver-sion rates, promoting sales of other related items and improving customer loyalty through added value (Schafer et al., 2001). Furthermore, user satisfaction is increased by being able to find items that one is interested in despite the ballooning of available products, and by acquir-ing items that are better suited to one’s preferences. The relevance and economic impact of Recommender Systems is illustrated through the Netflix Challenge, in which the online media services company Netflix announced it would award $1 million to the team that managed to improve its predictions of movies and series by 10%. The challenge finally ended in September 2009, three years after the unveiling of the challenge, as the prize was awarded to BellKor’s Pragmatic Chaos team (Koren and Bell, 2015). With the advent of Big Information in the Digital Age, the impact and lucrativity of these types of algorithms is only slated to increase. The fundamental purpose of a Recommender System is to recommend items that a user might like. This boils down to two distinct tasks: predicting the ratings of an item that a user has not rated, and then making a recommendation by returning a list of items with the highest predicted ratings. Thus, the problem at hand is to find an algorithm that gives the most accurate predictions.

Although Recommender Systems have come a long way, they are still imperfect. More specifically, many suffer from the cold start problem, in which a new user has no or little information available on which to base predictions (Chen et al., 2015) (Lika et al., 2014). This

(6)

is related to the rating sparsity problem. This means that in a user-item rating matrix most elements will be empty, i.e. a the matrix is sparse. Many users only provide few ratings, leaving little information for the system to train itself on.

Although many web services already use Recommender Systems, there are still places left where its use is less widespread. One of these is the website BeerAdvocate, where users can rate and write reviews on several tens of thousands of beers. With such diversity, a Recommender System can take a user by the hand through the fog of uncertainty and help the user in selecting which beer to sample. In this thesis an attempt will be made to develop that Recommender System, by training it on data from BeerAdvocate’s website.

Several methods of predicting ratings will be investigated. We will start out with the basic framework of only looking at a single global rating and compare the accuracy of Neighborhood methods and Latent Factor models. We gradually expand the complexity of the models by including ratings for subcategories of the beers, so that we can compare the accuracy of the two techniques in the different frameworks and investigate whether the more advanced models will actually yield an improved performance. Finally we add review information and compare the accurac of both methods; we investigate whether the more advanced frameworks provide better accuracy than less complex models.

Finally we know which model has the highest accuracy and under which conditions this performance is achieved. This method could in theory be implemented on BeerAdvocate’s website to improve recommendations for beers.

We will start out with a theoretical foundation in Section 2, followed by a description of the data in Section 3. Following this, the Methodology will be discussed in Section 4. Succeeding this, the models will be built, starting with the Single Ratings framework in Section 5, then adding ratings for subcategories to the information set in Section 6, followed by the addition of reviews in Section 7. Then in Section 8 a sensitivity analysis will be performed, to see whether conclusions still hold under different basic assumptions. Finally, we discuss the results and conclude in Section 9.

2 Theory

2.1 Background

Recommender Systems can make use of various types of inputs in order to make predictions and suggestions, such as implicit feedback. Implicit feedback indirectly captures the opinion of users, for example the browsing history or purchase history of a person (Oard et al., 1998).

(7)

Implicit feedback may yield information and guidance for a Recommender System in instances where little explicit information is available. This means that in those cases opinion can generally be inferred, however not with high precision as it relies on circumstantial indicators that may not be directly related to preferences. The most prevalent and intuitive kind of data to use is explicit feedback, where users directly report their feelings and feedback on items (Koren and Bell, 2015). An example of this is when users rate a product numerically, or leave a review that captures the feeling of the user towards that item. In some cases explicit feedback is not available or not ready for use; in such cases implicit feedback can be used to make recommendations. However, explicit feedback is most common, especially in the form of ratings.

Ratings can take on different forms, including continuous ratings, which is used in the Jester joke recommendation system, in which scores can take on any value in the interval [-10, 10]. However, such a system is not widely used (Aggarwal et al., 2016). Other options are binary ratings where values can take on the values {0, 1}, indicating whether a user dislikes an item or not. Moreover, ordinal ratings are ratings where the options are categorical and users can choose between categories such as for example strongly disagree, disagree, agree and strongly agree. A key feature of ordinal ratings is that there is a hierarchical order between the categories, namely strongly agree defines a more positive preference than agree which in turn signifies a more positive preference than disagree, etc. Furthermore, the difference between several categories is not equidistant, which means that there might be a larger difference in preference between agree and disagree than between strongly agree and agree, which makes analysis more difficult as this distance variability needs to be taken into account. Finally, the most prevalent rating system is interval-based ratings which have as the range of possibilities several discrete numeric options, with fixed intervals between the options. An example is the range [1, 5] with the integers as possibilities (Aggarwal et al., 2016), which is most common.

Recommender System techniques can be grouped into two categories: content filtering and collaborative filtering (CF). Content filtering is based on creating a user or item profile based on the characteristics of users or items, respectively (Koren et al., 2009). User profiles might contain demographic information such as age, level of education, etc., while item-based profiles could contain information such as genre and actors in the case of movies. On the contrary, CF is about recording past user behavioral actions, such as rating and review history, and using this to create predictions without requiring explicit user or item profiles. CF has recently had much attention, as it played a large role in the winning of the Netflix prize (Koren et al., 2009). CF techniques have a number of advantages, mainly related to their simplicity and intu-itiveness. Because of the intuitive approach, their decisions are easy to explain and interpret.

(8)

When explanatory power is important CF techniques may be more useful, as it is straightfor-ward to explain why a particular recommendation was made. This stands in stark contrast to other more advanced methods such as Neural Networks that tend to suffer from the black box problem; it is hard to explain why the model made the recommendation it did (Zhang et al., 2018). Recommendations are also stable with regards to new items and users that are added, as they can easily be inducted into the new data set, after which similarity calculations and recommendations are possible.

Although CF has been popular and easily implementable in recent years there have been several issues related to scalability and usability in real-time. With large data sets collabora-tive filtering techniques are computationally expensive. The computational burden increases quadratically with the number of users. For each user u the similarity to all other users has to be calculated, so computation time is a quadratic function in the number of users, namely n(n − 1) in the case of n users. In some cases millions of users are present, which can lead to an exponential increase in calculation time as trillions of similarities have to be determined (Deshpande and Karypis, 2004). Since many recommendations have to be made on short no-tice such systems become infeasible when too many users are present. And although user-item matrices may be sparse, user-user matrices tend to be very dense and a small update in user behavior may require updates in large sections of the user-user matrix. As new items and new users are added these huge matrices need to be updated continually which might prevent the efficiency of its use when new information is received in real-time.

Another drawback is the problem of sparsity. Since a typical users might have only rated a few items out of thousands, and similarly an item may have only had a rating by a few users, it is sometimes difficult to find overlap and compute similarity. This is especially present in the cold start problem, where new users have no ratings and thus no similarities, making it hard to make recommendations.

CF techniques generate specific item recommendations for users. Fundamentally, CF relies on matching items to users by recommending items based on similarities to previous items a user liked in the past, or similarities between different users. Two main approaches exist to facilitate this interaction: neighborhood methods and latent factor methods.

2.2 Neighborhood Approach

An advantage of Neighborhood methods is that minimal assumptions of the underlying struc-ture of the data are made, which means it is insensitive to model misspecification. Because of this, Neighborhood models are flexible models that are easy to interpret and can make

(9)

predictions in the absence of large quantities of data. Neighborhood methods try to infer the relationship between different users or alternatively, different items, called User -Based Collab-orative Filtering (UBCF) and Item-Based CollabCollab-orative Filtering (IBCF) respectively (Gao et al., 2015). User-based approaches determine the preference of a user for an item based on similar users. On the other hand, item-based approaches determine the preference of a user for an item based on ratings by that user for similar items. Thus, UBCF maps several users towards one item, whereas IBCF maps several items to one user. Both techniques have their pros and cons, and conditions under which they perform best. UBCF relies on ratings by similar users, so if there is high variation between users the recommendations have less predictive power. Conversely, IBCF may be more accurate in some instances, but will recom-mend similar items to items that have been rated in the past, leaving no room for accidental fortunate discoveries, called serendipity.

The term Neighborhood model implies that similar users or similar items need to be found. The two concepts are discussed below shortly, after which they will further be elaborated upon.

User-based collaborative filtering Users that are similar have similar tastes, and there-fore similar ratings for items. If user u and user v have given similar ratings on items, then you can use user v’s observed rating for item i to predict user u’s unobserved rating for item i.

Item-based collaborative filtering Items that are similar will have similar ratings by the same user. If item i and item j are similar, then the rating of item j by user u can be used to predict the rating of item i.

IBCF techniques are more stable regarding changes to ratings (Aggarwal et al., 2016). As the number of users tends to be much larger than the number of items, there is a high proba-bility that two users have a small number of mutually rated items, leading to potential large fluctuations in the calculated similarity and inaccurate predictions. Moreover, since there is a constant influx of new users, neighborhood similarity needs to be computed and updated frequently which can be time-consuming (Aggarwal et al., 2016).

From a theoretical standpoint IBCF techniques should provide more accurate recommenda-tions, since the predicted ratings are based on the user’s own ratings. They start with finding a set of similar items to the target item, in our case beers. Ratings are based on these similar items by extrapolating the ratings of similar items to the target item. In such a case, having other ratings by the same user as a baseline means the predicted rating is highly indicative of

(10)

the preferences of the user. This is in contrast to UBCF, where ratings are extrapolated from other users who might have overlapping but distinct preferences. Despite better accuracy in theory for IBCF, the differences in UBCF that generally lead to worse performance also lead to novel discoveries. Whereas IBCF recommends items that are similar to previous rated items, UBCF is more likely recommend novel items which may contain positive surprises because similar users to the target user may have rated exotic items which can then be recommended. However, which reasoning applies depends on the data set at hand and the characteristics of the users and beers in the data set (Aggarwal et al., 2016). It ultimately depends on the task at hand which method is preferred. Users with a Netflix subscription benefit more from serendipity, as a new unknown TV-series which they can binge watch means they are more convinced of the value of their monthly payment. However, people on Amazon shop for a certain purpose, so an unrelated exotic item is unwanted.

A drawback of the Neighborhood approach is that since predictions are made based on similarity in either users or items, the accuracy suffers when items or users have few ratings available. Moreover, since suggestions are based on neighbors the results are more sensitive towards outliers. These problems are less prevalent in the Latent Factor approach.

2.3 Model-Based Approach

In contrast to the Neighborhood approach, model-based techniques such as the Latent Factor approach try to make recommendations by using the ratings matrix to train a parametric model which can then be applied to predicting ratings and making recommendations. The main difference between the neighborhood approach and the model-based approach is that an assumption is made about the underlying structure of the data and a model is fitted based on the specified structure. Parameters are trained in learning whereas in the neighborhood approach no model is specified and prediction is done by analogy (Chen et al., 2015). An example of a model-based approach is by introducing latent factors. Latent factor models use a distinct approach by utilizing both item and user rating characteristics (Koren et al., 2009). It approaches Collaborative Filtering (CF) with the goal of finding hidden latent features that explain the observed ratings; examples include Latent Dirichlet Allocation (LDA) developed by Blei et al. (2003) and Neural Networks by Salakhutdinov et al. (2007). Finding latent factors can be discovered using Singular Value Decomposition (SVD), which is a well-known technique in information retrieval and is helpful in identifying latent factors (Koren and Bell, 2015). Applying SVD to the CF area can be problematic as the user-item matrix is sparse. This is due to the fact that SVD does not work well when a high number of values is missing

(11)

as SVD is undefined when information about the matrix is incomplete. Furthermore, using SVD for the few observed ratings is prone to overfitting. A way to overcome this was to use imputation, filling in ratings for the unobserved cases (Sarwar et al., 2000). However, this is computationally burdensome as many ratings have to be computed. It also increases the amount of data, making techniques that are optimized for sparse matrices less efficient. Furthermore, the imputed ratings may be distorted as there is no way to verify the accuracy of the ratings. Recent works therefore suggested only modelling the observed ratings and combating overfitting by using regularization (Canny, 2002).

The benefit of latent factor models is that computers can discover characteristics which might elude the human eye, discovering hitherto unseen patterns. By inferring rating patterns from user and item characteristics it forms vectors of factors (Koren and Bell, 2015). Higher correspondence between factors leads to a higher probability of positive experience, and hence a recommendation. These methods have become popular due to the scalability: for each user i and item j, f factors are used. The number of total parameters that needs to be estimated is therefore only f ·(i+j) compared to i·(i−1) calculations in the Neighborhood approach. With the rapidly accelerating online user base, which can number in the millions, this scalability has become more and more useful for practical applications.

2.4 Problem Statement

The task of building a Recommender System has two components: predicting ratings and returning a list of the items with the highest predicted ratings. This can be formalized as follows: We have a set of n users U = {u₁, ..., un}, a set of m items I = {i1, ..., im} and

a user-item ratings matrix R ∈ IRn×m which comprises ratings given by users to the items (Aiolli, 2013). As not every user has rated every item, some elements rui in R will be absent.

For V_u, the set of items that user u rated, predict the rating ˆruiof item i /∈ Vu. Return the list

of the k items with the highest predicted rating, with k chosen arbitrarily. Returning a list is straightforward, so the problem boils down to predicting ratings as accurately as possible. We will try out different techniques in different frameworks, and see which techniques predict most accurately.

We first look at the most basic neighborhood and latent factor models and evaluate their performance. Then we will gradually make them more complex to see if their performance is improved. Subsequently, the models are expanded to include ratings for subcategories. Finally, reviews are added.

(12)

3 Data

The data concerned a data set which we scraped from the website BeerAdvocate, which con-tained over 1.5 million pairs of ratings and reviews of different beer sorts by different users, with only explicit feedback information. It contained no useful information that could be used to create content-based profiles, so we had to rely on collaborative filtering techniques. In the data set there were 1.563.784 ratings and reviews by 33.252 users regarding 65.459 beers brewed by 5.799 breweries and could be grouped into 64 categories. This is summarized in

Reviews Users Beers Breweries Categories

1.563.784 33.252 65.459 5.799 64

Table 1: An overview of the data set

Table 1. Every observation concerned a user giving a beer 5 ratings and one review. The 5 ratings were for the dimensions Overall (the global rating), and for four aspects: Appearance, Aroma, Palate and Taste. Ratings are on a discrete scale and could have 9 different values, namely: 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 and 5.

A large number of users had rated only 1 beer, and a large number of beers had only one rating. Inferring user or item properties from such low quantities of information is difficult (Isinkaye et al., 2015). Therefore a selection from this data set was made. It is common to train your model on 80% to 90% of the data and test it on 10% to 20% (Kohavi et al., 1995). We want to be able to test the model on at least 1 rating per user, so in order to make sure that the model is effective for every user, we need at least 5 ratings per user (if we use 20% for the test set). However, since evaluating on a single rating is more sensitive to outliers we want to make sure that for extra statistical certainty we test the model on at least 2 ratings per user, which means we have to have at least 10 ratings per user if we use 20% for the test set. Analogous reasoning holds for a lower bound on the number of ratings per beer. However, in the case of beer more ratings are necessary, since it is more difficult to infer objective properties of items based on ratings by users since these ratings are prone to user biases. Therefore the lower bound is set at 50, which from the literature seems a high enough number to allow properties to be inferred (Isinkaye et al., 2015). These lower bounds are applied to the data set, which results in many beers and users being dropped. Later, we will vary the lower bounds in the sensitivity analysis to see if they impact the results.

(13)

at least 50 ratings is 5096. However, in our final data set we want every user to have at least 10 ratings and every beer to have at least 50 ratings simultaneously. The final number of users and beers will be lower. Imagine a certain user u having given 10 ratings, including a rating for beer i with 20 ratings and a beer j with 50 ratings. If we remove all beers that have been rated less than 50 times total, this means beer i will be removed. Because beer i is no longer in the data set, user u will now have 9 ratings in the data set, which means he will have to be removed as well. Because user u is removed and he had rated beer j, this means beer j now has 49 ratings, which is less than 50 and has to dismissed also, starting the whole cascade from scratch. This network effect can have significant consequences; leading to a completely empty data set if certain lower bounds are chosen. This means that the final data set will have far less observations than the original 1.563.784 ratings. After cleaning the data set of the users with less than 10 ratings and the beers with less than 50 ratings, the remaining data set looks has properties as shown in Table 2.

Reviews Users Beers Breweries Categories

1.181.878 9.894 4.988 858 63

Table 2: An overview of the cleaned data set

After cleaning the data set and having enough users and beers left to do analysis, the de-scriptive statistics of the data set are as shown in Figures 3, 4 and 5. Since the point of this thesis is to predict ratings of beers, some variation in the ratings is required. If all the ratings are 4 then rating predictions are redundant because you can just predict a 4 every time. In Figure 3, we can see that the ratings 3.5, 4, and 4.5 are the most frequent ratings by a large margin. However, even the rating with the lowest frequency (1) still has 6.546 occurrences, which means there is enough variation in the ratings to be able to make useful predictions.

In Figures 4 and 5 we see the average rating of users and beers respectively. We can see there is considerable variation in average ratings, indicating that user and beer biases have to be included in the models. More details on biases will be explained in the corresponding sections.

(14)

Figure 1: The number of ratings per user sorted by decreasing number of ratings, with user 1 defined as the user having the most ratings and user 9.894 the least. In order to account for privacy issues, only a number is used to identify users in the data set. The most prolific rater has given over 3.000 ratings, whereas the users with the least number of ratings that are still in the data set have rated 10 beers (with thousands having only rated 10), as per the lower bound.

(15)

Figure 2: The number of ratings per beer, sorted in decreasing order, with beer 1 defined as having the highest number of ratings and beer 4.988 the lowest number of ratings. In order to account for privacy issues, only a number is used to identify beers in the data set. The beer with the most ratings has been rated around 3000 times. After that the ratings gradually drop but the beers with the least number of ratings still have been rated around 50 times (with hundreds having only 50 ratings), as per the lower bound.

Figure 3: Distribution of the ratings in the cleaned data set. We can see that the rating of 4 is the most prevalent, with the rating of 1 being least prevalent.

(16)

Figure 4: Distribution of the average rating of users in the cleaned data set.

(17)

4 Methodology

We use a standard 80/20 training/test split for the data, as is common in Recommender Systems (Kohavi et al., 1995). Since a property of machine learning models is that the fit on the training data does not necessarily generalize to new unseen data, the test set is withheld in order to evaluate the performance of the model on unseen data. The training data is used to train the model, but since some models rely on parameter values that have to be chosen empirically, certain parameter values have to be chosen and evaluated. These values are estimated on the validation set, which is a subset of the 80% training set, and the optimal parameter values are then selected according to their performance on the validation set. Since data has a stochastic nature to it, the results of the validation will depend on the data in the validation set; had a different valuation set been chosen the optimal values might have been different. In order to combat this and provide statistical reliability of the results we use cross-validation. In k-fold Cross-Validation the training set is split into k parts and every part is used as validation set once. See Figure 6 for a detailed explanation. Since we have

Figure 6: The evaluation scheme. In our case, select 20% as the test set and withhold it. From the remaining 80% (the training set) choose 25% (so 20% of the total data set) as the validation set, which is used to tune the model parameters. Perform k-fold Cross-Validation by randomly dividing the training set in k equal parts if number of observations n is even (if n is odd, then the size of the sets will differ slightly, but as we have a million observations this effect will be insignificant) comprising of k − 1 training parts and validating on the last part. For every run, choose a different validation set so that in the end all parts of the training set have been used as validation set exactly once. Select the model that performs best according to the averaged validation error.

(18)

a data set with a high number of users, randomly selecting 80% of the data as the training set could mean that some users are only present in the training set and some users are only present in the test set, especially since most users have only 10 ratings each. Therefore we use a variant of the All-but-X evaluation scheme, popularized by Breese et al. (1998). This method withholds x observations per user to test on, while the remaining observations are used for training the model. Since some users have rated 10 beers whereas others have rated a thousand, we will instead use a percentage of observations instead of a fixed number. We choose to withhold 20% of the observations per user to test on, and 20% to validate on, in accordance with the 60/20/20 split. This ensures that the model generalizes well to all users, irrespective of the number of items they have rated. So we slightly modify standard k-fold CV to reflect this reasoning.

Algorithm 1 k-fold Cross-Validation

1: for user u do

2: Get I_u, the set of all rated items by user u

3: Split ratings of Iu into 80% training set (Tu), 20% test set

4: Split Tu into k parts Tu1, ..., Tuk

5: Aggregate Tuj of all users u into Tj, j ∈ 1, ..., k, where Tj is the training set of the k-th

run

6: for run k do

7: Aggregate all Tj for j = 1, ..., k into T

8: Train model on T(−k), where T(−k) is T \ Tk and Vk= Tk

9: Predict ratings of the validation set Vk and calculate error (= CVk)

10: Calculate CV = 1 K K X k=1 CVk

11: Choose parameter values that minimize CV

We choose k = 4 for the number of cross-validation runs. Since for users with 10 ratings we have 8 of their ratings in the training set, we want to be able to validate on at least one rating. However, this is more liable to outliers, so to improve the statistical accuracy we take 2 ratings in our validation set. That means we have to do 8

2 = 4 cross-validation runs. Then, the optimal parameter values are chosen that yield the lowest validation error (=CV ). The model is trained with the optimal parameter values on the combined training and validation set and its performance is gauged on the test set to obtain a final predictive measure.

Different methods exist to measure the performance of the model, such as the Mean Squared Error (MSE), Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE). All are variations of the same concept: the difference between the predicted ratings and the actual

(19)

ratings. The measure that is chosen in our case is the RMSE, the root of the squared distance between predicted rating ˆr and actual rating r, because it is common in Recommender Systems (Aggarwal et al., 2016). Furthermore it gives us an intuitive interpretation of the performance of the model: the average distance of the prediction to the actual rating. The formula is:

RM SE = v u u t P u∈U P i∈Iu (rui− ˆrui)2 N , (4.1)

where u is the user, U is the set of all users, Iu is the set of items rated by user u, N is

the number of ratings in the set over which the RMSE is evaluated, and r_ui and ˆrui are the

observed and predicted ratings of user u for item i, respectively. Even though the possible ratings are discrete (1.5, 2, etc.), we choose to do continuous prediction. Since recommending is about returning the list with the highest predicted ratings, it becomes unclear which item to recommend if classification is performed or a predicted rating is rounded off to its nearest value in parameter space if you are left with several 4’s for example. If continuous ratings are predicted it is clear which items to recommend if a top-N list is made. Furthermore, accuracy was improved when sticking to continuous prediction (Pazzani and Billsus, 2007).

4.1 Neighborhood Model

We use 4-fold Cross-Validation (CV), with CV as described in Algorithm 1. For each CV run, the ratings of the validation set are predicted using the formula as described in that particular section, while the parameters are varied. In the case of neighborhood models no assumptions have to be made about the underlying structure of the data (Cremonesi et al., 2010). The parameter values of the model that leads to the lowest average RMSE over the validation sets is chosen to obtain the optimal model. This optimal model is used to predict the RMSE of the test set, yielding a final accuracy of the model. This procedure is executed for the Neighborhood method in all three frameworks discussed in this thesis: Single Ratings, Multiple Ratings, and Multiple Ratings and Reviews. More details will be given in the corresponding sections.

4.2 Latent Factor Model

We similarly use 4-fold CV for all Latent Factor models discussed in this thesis, as described in Algorithm 1. For each CV run, the ratings of the validation set are predicted using the formula as described in that particular section, while the parameters are varied simultaneously.

(20)

In the Latent Factor model an assumption is made about the underlying structure of the data. Therefore multiple parameters have to be estimated. The combination of parameters that leads to the lowest average RMSE over the validation sets is chosen. Using these optimal parameters, the model is trained using the entire training set and the ratings in the test set are predicted. The resulting RMSE is the final accuracy of the model. As for the Neighborhood method, this is done for three different frameworks: Single Ratings, Multiple Ratings, and Multiple Ratings and Reviews.

Latent Factor (LF) model optimization is a problem without a unique or globally optimal solution. The objective function is also dependent on the values of the latent factors. Since a closed form solution is not available, several methods exist to optimize LF models such as Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD), where ALS can be seen as a particular instance of a nonlinear Least Squares solution and SGD is a variant of Gradient Descent, a first-order iterative optimization algorithm that finds the minimum of a function (Comon et al., 2009). Both methods have wide application in Recommender Systems and as such abundant literature exists to justify their uses and detail their applications (Wang et al., 2012a). Since SGD is more intuitive and more common for the models that we will be investigating, this method is chosen. In the general case, Gradient Descent is an algorithm that calculates errors between actual rating and predicted rating over all observations:

E =

N

X

n=1

(yn− ˆyn)2, (4.2)

where y_n is the n-th observation or target variable (the entity that is being predicted), ˆyn

is the predicted value of y_n according to the model and N is the number of observations. The algorithm then adjusts the weights of the factors in the direction that is opposite to the gradient. This ensures that the parameter moves towards the local minimum, in the following manner:

θt+1←− θt− γ

δE

δθ, (4.3)

where θt+1 and θt are the to-be-estimated parameters at time t + 1 and time t respectively,

E is the error rate as described in Equation (4.2), and γ is a learning rate that determines the magnitude of the adjustment and therefore controls the times it takes to converge (Chee and Toulis, 2017). So, the factors are updated incrementally after each epoch (pass over the entire training set). This means that the parameters are updated only after an entire pass over the training set with size N . In the case where a data set comprises millions of observations, this would entail that it takes a significant amount of time to update the

(21)

parameters, increasing calculation time by a factor 106 over a method that incrementally updates after a single observation. As the algorithm only terminates when a local minimum is reached, and convergence can take several tens of epochs, Gradient Descent becomes infeasible where large data sets are concerned (Li et al., 2018). In order to solve this problem we use Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent Instead of summing over all observations for a single epoch, with SGD the error and gradient are calculated and the parameters updated after a single observation (Comon et al., 2009). The error now becomes:

E = (yn− ˆyn)2, (4.4)

where y_nis the n-th observation, ˆynis the prediction of ynand N is the number of observations.

Notice the difference between Equations (4.4) and (4.2), where the error is the sum over all observations in the former case versus just a single observation in the latter case. The error is calculated for every observation, and this is done sequentially and exactly once for each observation, so all observations in the data set are taken into account. This more frequent updating ensures that convergence time is improved greatly (Elahi, 2011). In our case, where parameters need to be optimized, we therefore use SGD. More details specific to the model used will be described in the corresponding section.

We loop our algorithm through the training set for several epochs until the algorithm has converged. It has converged when the training error stops decreasing, indicating that a minimum has been reached, or if the decrease in RMSE from one epoch to another is lower than 0.001. Depending on the learning rate, the value of γ might be too large to perfectly fit into the local minimum, constantly overshooting it and then overshooting it again, oscillating back and forth. At this point the training error will not decrease further so the algorithm can be terminated. We choose 0.001 because with a learning rate of γ = 0.01 this difference can indicate that oscillation around the minimum takes place (Chee and Toulis, 2017).

5 Single Ratings

In this section the accuracy of both the Neighborhood and the Matrix Factorization techniques are evaluated in the Single Ratings framework. That is, only a single rating -the global rating-is taken into account.

(22)

5.1 Neighborhood Approach

5.1.1 User-Based Collaborative Filtering

In User-Based Collaborative Filtering (UBCF) a matrix is obtained in which the elements are ratings of users for items. Then the users with the most similar tastes as the active user u (the user for which the ratings are predicted) are identified, called the Nearest Neighbors (NNs) (Aiolli, 2013). The NNs are determined by calculating the similarity between all other users in the data set. The predicted rating is then a function of the ratings of the k users that have the highest similarity with user u, i.e. the k-Nearest Neighbors (kNNs). There are different similarity measures which will be discussed later. The calculation of similarity can be difficult because different users may have biases towards liking/disliking different items, and it is unclear to what extent similarity in rating is the consequence of similarity in taste. Furthermore, it is unclear how to calculate similarity if users have not rated the same items. These hindrances are solved by finding the overlapping ratings between users. Denote Iu the

set of items that have been rated by user u and denote I_v the set of items that have been rated by user v. Define the set of common ratings between users u and v as Suv = Iu ∩ Iv. It is

possible that S_uv = ∅ since usually thousands of items are present in most environments and most users rate only a handful of items, making ratings sparse. Then the similarity is computed with values in Suv according to one of several similarity measures. Distance measures can also

be calculated, in which case the similarity is the inverse of the distance (Isinkaye et al., 2015). We will be comparing the accuracy of several similarity measures, namely the Cosine and Pearson similarities, which have seen common use in recommender systems (Aggarwal et al., 2016). Moreover, they have nice theoretical properties and are intuitive, meaning they can easily be understood (Isinkaye et al., 2015). In the following paragraphs the details of each similarity measure will be discussed.

Cosine The Cosine similarity is a measure that evaluates the similarity between non-zero vectors of an inner product space. It is different from other similarity measures in that is a vector-space model based on linear algebra rather than statistics (Isinkaye et al., 2015). It measures the distance by comparing the angle between two m-dimensional vectors. The Cosine method is widely used in text-mining and related fields, as text documents can easily

(23)

be represented as vectors. The formula is: suv,C = P i∈Suv ruirvi r P i∈Suv r2 ui r P i∈Suv r2 ui ∀u, v ∈ U, (5.1)

with s_uv,C the Cosine similarity between users u and v, rui the rating of user u for item i, rvi

the rating of user v for item i, S_uv the set of all items that have been rated by both users u and v and U the set of all users.

Pearson Another measure is Pearson’s correlation coefficient, which is a similarity measure that describes the degree to which two variables are linearly related (Huang, 2008). It is formalized as follows: suv,P = P i∈Suv (rui− ¯ru)(rvi− ¯rv) r P i∈Suv (rui− ¯ru)2 r P i∈Suv (rvi− ¯rv)2 ∀u, v ∈ U, (5.2)

with s_uv,P the Pearson’s correlation coefficient between users u and v, r_ui the rating of user u for item i, rvi the rating of user v for item i, ¯ru and ¯rv the average ratings of users u and v

respectively, S_uvthe set of all items that have been rated by both users u and v, and U the set of all users. To be thorough the averages ¯ruand ¯rvshould be the mean calculated only over the

items that are included in Suv. Every time a user is compared to other users only the ratings

on items they have in common are included in the set S_uv, so ¯ru and ¯rv will be different based

on which users user u is compared to. This would entail that for every comparison between user u and another user a different ¯ru should be calculated. However, in practice it is common

to calculate ¯ru only once per user, since in practice the differences in accuracy between doing

this and calculating a separate ¯ru per comparison are only minor (Aggarwal et al., 2016).

Furthermore, it can be argued that retaining a single ¯ruis more informative, since if the set of

common ratings Suv contains only one commonly rated item, which can be common in sparse

data sets. Calculating ¯ru over a single item will be indeterminate because a rating minus the

same rating (which is the average in the case of only one rating) will be 0.

The similarities between users is calculated between target user u and all other users. The predicted rating can then be based on the ratings of its k-nearest neighbors, i.e. the ratings of the k users with the highest similarity measure. The number of observed ratings in its k-NN group may vary significantly depending on the item. As such, the k users who are most similar

(24)

to user u globally may not have rated a specific item i. Therefore, when predicting the rating for item i, the predicted rating is based on the k most similar users that have rated item i. The predicted rating ruiis then a function of the weighted average of the the k-Nearest Neighbors,

the formula for which will be discussed later. A drawback of this method is that different users may rate differently, depending on the person. Some optimistic raters may on average rate higher than pessimistic raters, regardless of the beers sampled (Aggarwal et al., 2016). This is also what we saw in Figure 4, where user means varied significantly. By calculating similarities based on these ratings the user biases are essentially taken into account, because a rating is made up of both item preference and bias. To combat this, the ratings are corrected by subtracting the average rating of user u from all user u’s rating. This mean-corrected rating is defined as follows:

˙rui= rui− ¯ru ∀i ∈ Iu, (5.3)

where ˙rui is the mean-corrected rating of user u for item i, ¯ru is the average rating of user u

and Iu is the set of all rated items by user u. The mean-corrected ratings of item i by the

kNNs is used to make a mean-corrected prediction for an item i, after which the mean rating of user u is added back to yield the actual prediction of the rating of user u of item i. The predicted rating for a user is then formulated as:

ˆ rui= ¯ru+ P v∈Σ(k)u (i) suv˙rvi P v∈Σ(k)u (i) suv , (5.4)

where ˆruiis the predicted rating of user u for item i, ¯ru is the average rating of user u, Σ(k)u (i) is

the set of the k most similar users to user u that have rated item i, suvis the similarity of user

v to user u and ˙rvi is the mean-corrected rating of user v for item i. The number of nearest

neighbors k on which the prediction is based can be chosen in different ways, however it is most common to fit the model on the data and select the value of k that yields the best results (Konstan and Riedl, 2012). More details are explained later. Now that we have discussed UBCF, we move on to the other well-established Collaborative Filtering method: Item-Based Collaborative Filtering (IBCF).

5.1.2 Item-Based Collaborative Filtering

In IBCF, the methodology is similar to UBCF, but instead of calculating similarity between users, this is done for items. As in the case of UBCF, where users have differences in their

(25)

average ratings, so do items have differences in quality and hence ratings, as could also be seen from Figure 5. Therefore, to compensate for this difference, the ratings for item i are mean-corrected. The mean-corrected rating then becomes:

˜

rui= rui− ¯ri ∀i ∈ Ui, (5.5)

where ˜ruiis the item-mean-corrected rating of user u for item i (contrast this with ˙ruiwhich was

the user -mean-corrected rating as described in the UBCF section) and ¯riis the average rating

of item i over Ui, where Ui is the set of all users that have rated item i. After obtaining the

mean-corrected ratings one can calculate the Adjusted Cosine similarity, a common measure in IBCF (Aggarwal et al., 2016) which measures the similarity between items based on the mean-corrected ratings. The Adjusted Cosine method is a variant of the Cosine similarity mentioned above, and is called adjusted because it makes use of the mean-corrected ratings when calculating Cosine similarity, which was not done in the standard Cosine case. Although the standard Cosine and Pearson methods that were described in the UBCF section can also be used for IBCF, the Adjusted Cosine similarity method tends to provide better results (Aggarwal et al., 2016). The formula is:

sij,AC = P u∈Sij ˜ ruir˜uj r _P u∈Sij ˜ r2_uir P u∈Sij ˜ r2_uj ∀i, j ∈ I, (5.6)

where sij,AC is the Adjusted Cosine similarity between items i and j, ˜rui and ˜ruj are the

mean-corrected ratings of user u for items i and j from Equation (5.4), respectively, S_ij is the set of users that have rated items i and j and I is the set of all items. The idea is to use the information contained in the known ratings of user u and to use this information to predict the rating for similar items. If items are rated similarly then the preference of a user should also be similar. Since this is about ratings of the same user, there will not be as much variation due to the inherent variation between different users, hence providing an estimator that is in theory more reliable. Based on the similarity the top k most similar items to item i can be determined, and the rating of an item can then be predicted by the weighted average of the ratings of similar items. This is formalized as:

(26)

ˆ rui= P j∈Σ(k)_i (u) sij,ACruj P j∈Σ(k)_i (u) sij,AC , (5.7)

where ˆruiis the predicting rating of user u for item i, sij,AC is the Adjusted Cosine similarity

between items i and j, ruj is the rating of user u for item j and Σ(k)i (u) is the set of the k

most similar items to item i that user u has rated. Now that we have discussed the techniques we will implement them in practice.

5.2 Analysis

We will now implement the aforementioned techniques on our data set and evaluate the results. The ratings in the validation and test sets are predicted using the k-NN technique and the accuracy of the methods will be evaluated. We will compare UBCF by using the two described similarity measures, Cosine and Pearson. The exact steps are described in Algorithm 2. Then we do IBCF with the Adjusted Cosine measure, which is described in Algorithm 3.

Algorithm 2 UBCF

1: for user u do

2: for user v do

3: Calculate similarity suv between user u and user v

4: for item i do

5: Construct set Σ(k)_u (i) of k most similar users to user u that have ratings for item i

6: according to s_uv

7: Predict ˆruibased on ratings of k most similar users

Algorithm 3 IBCF

1: for item i do

2: for item j do

3: Calculate similarity s_ij between item i and item j

4: for user u do

5: Construct set Σ(k)_i (u) of k most similar items to item i that user u has rated

6: based on sij

7: Predict ˆruibased on ratings of k most similar items

In order to make sure that every part of the training set has been in the validation set, we perform 4-fold Cross-Validation, as described in Algorithm 1. We then select the parameter k

(27)

for the k-Nearest Neighbors scheme that yields us the best model, based on the lowest average RMSE when predicting the ratings of the validation set over the 4 cross-validation runs. In some cases, even though k is set to a certain value, the number of ratings available for that item may be smaller than the amount of neighbors k. In that case, the value of k is chosen as the number of available ratings, effectively making the formula for the number of neighbors k = min(k, Σ(k)_u (i)). The optimal value of k for each method (Cosine, Pearson and Adjusted Cosine) will then be used for the model, after which the ratings of the test set will be predicted to determine the final performance of the three methods.

Figure 7: Plot of the RMSE of the validation set against different values of k in the kNN prediction scheme. In the Cosine method the RMSE seems to be decreasing, however it is nearing a flat rate, meaning that a value of k = 100 or k = 110 might be optimal. This was researched by taking the range of k’s higher than the maximum initially selected, to k = 110 and k = 120, but there was an increase in RMSE again indicating overfitting. For the Pearson method the minimum is at k = 90 the minimum is reached after which the RMSE starts increasing again. For the Adjusted Cosine the minimum is reached at k = 50.

In kNN it is important to check a wide range of neighbors as the optimal number depends on the data set. If you use too many neighbors in your prediction scheme, this means that too many user’s opinions are taken into account even though they have little similarity to the active user. This leads to biased predictions. On the other hand, too few neighbors and the

(28)

results are sensitive to outliers, meaning that due to chance two users could be seen as similar to each other according to the measure even though they share little preference, because some users may have given similar ratings despite differences in preference. An optimum will be reached at a point that balances these two opposite forces. From the literature, values in the interval [10, 100] are usually investigated as they provide a range wide enough for the optimum to be contained therein (Zhang et al., 2015). As Zhang et al. (2015) make use of a similar data set and model, we have no reason to deviate from this. We will take intervals of 10, since there should be no significant difference between e.g. 80 or 81 neighbors.

We can see the results of the model validation in Figure 7. It seems that IBCF performed worst, compared to the others, being close to 10% less accurate than the UBCF Cosine and Pearson methods. The Cosine and Pearson similarity methods had very similar accuracies and similar progress, slowly decreasing in RMSE as k increased until reaching a minimum around 90 or 100. We would expect that an optimum is reached in our interval, which is indeed the case for the Adjusted Cosine similarity and Pearson’s similarity. However, for Cosine there is no local optimum. Since most users only had very few ratings per person, more neighbors led to more accurate results. When few neighbors were used, the similarity was based on dissimilar neighbors, leading to biased results. These results are only indicative of the training set, however, so to determine the final fit of the model we choose the optimal values as seen above from the model, k = 100 and k = 90, for the UBCF Cosine and Pearson methods respectively and k = 50 for the IBCF method using Adjusted Cosine similarity. These parameter values are used for the final model that tries to predict the ratings of the held-out test set. The results are seen in Table 3. We can see that in terms of accuracy the UBCF Cosine similarity and UBCF

Parameters Cosine (UBCF) Pearson (UBCF) Adjusted Cosine (IBCF)

RMSE 0.6226 0.6213 0.6824

k 100 90 50

Table 3: Performance of the model on the test set in the kNN prediction method for the different similarities. UBCF is the User-Based Collaborative Filtering method, IBCF is the Item-Based Collaborative Filtering method. k is the number of neighbors used when predicting the ratings.

Pearson similarity methods are virtually equivalent, yielding a similar RMSE. Although the Pearson has a slightly lower RMSE this effect is marginal and could be due to the random nature of the split of the data. The Adjusted Cosine measure does worse, having a RMSE that is nearly 10% higher than the user-based methods. This could be due to the way the similarity

(29)

is calculated between items in this case. Calculating similarity between users by comparing the ratings two users have given to the same item is a good indication of preference, since if they have rated an item the same they must have similar preferences. In contrast, considering two similar ratings by the same user and basing predicted ratings on this is imperfect. A user may have given similar ratings but since no item characteristics are available, the similar ratings could be due to particular characteristics of both items being appealing to the user while being distinct from each other, therefore providing no basis that other similar items will also be appealing. Future research could try to include item characteristics and see whether the difference holds. This could also be done for user characteristics but is not necessary, since underlying characteristics may be useful but rating behavior provides enough information to base a model on. Now that we have the results of kNN regarding a single global rating, we will later see whether expanding this model to include multiple ratings provides predictions that are more accurate.

5.3 Model-Based Approach

A popular application of the latent factor approach is Matrix Factorization (MF). MF mod-els map user and item characteristics to a joint latent space of dimension f , and particular preferences of users towards items are computed as inner products between latent factors that reflect item characteristics and latent factors that represent user preferences towards those particular item characteristics (Koren and Bell, 2015). We will start by discussing the basic MF model and evaluate its performance before adding additional layers of complexity and see whether this increased complexity is better able to describe the data. The MF approach can be formalized as follows: each user u has associated factor vector qu with qu ∈ IRf. Similarly,

each item i has factor vector pi with pi ∈ IRf. As every user and every item has a (set of)

factor(s) f , the dimensions of q and p are n × f and m × f respectively, (with n the number of users, m the number of items, and f the number of factors) For item i, pi measures the extent

to which item i possesses the particular factors, which can either be positive of negative. For user u, qu measures the extent to which user u possesses preference towards those factors,

which can again be positive or negative. The resulting dot product of qu and pi measures the

preference of user u towards item i, formalized as:

ˆ

rui= qTupi ∀u ∈ U ∀i ∈ I (5.8)

where ˆruiis the predicted rating of user u for item i and qu, pi are the factor vectors of user u

(30)

This model tries to capture the interaction between users and items to yield a predicted rating. However, it was implicitly assumed that the interaction effect between every user and beer was the same, since no user-specific or item-specific terms are included. However, a reasonable assumption is that variation in ratings is at least partially explainable as a result of user-or item-specific effects, independent of interactions, otherwise known as biases (Kuser-oren et al., 2009). More specifically, some users are more prone to give higher or lower ratings than others, irregardless of the quality of the beer. Similarly, in the item case, some beers will have better quality and will thus be rated higher, independent of whom it was rated by. This is also what can be seen in Figures 4 and 5, where user and item averages differ significantly, adding extra weight to the assumption that user- and item-specific factors should be accounted for. Therefore, user- and item- specific biases should be included in the model. While there may be specific differences in user and item average ratings, ratings also tend to hover around a global mean. Different qualities of beers would lead to different ratings, so based on this, significant variation in ratings is expected. However, as we saw from Figure 3 most ratings were centered around the 4 mark, meaning that whatever the item and user qualities it is likely that a median rating is given irrespective of the user or item. To take this into account, we add a global bias. This is also done in the literature (Koren et al., 2009).

A benefit of MF techniques is that it allows flexibility in the equations to be estimated, meaning that it is possible to adapt the model while still being in the same framework and subject to the same optimization techniques (Koren and Bell, 2015), which means we can grad-ually increase the complexity of the model, as we will do later on. Taking the aforementioned biases into account, the predicted rating of our model becomes:

ˆ

rui= µ + βu+ βi+ quTpi ∀u ∈ U ∀i ∈ I, (5.9)

where µ is the global average, βu and βi are user and item biases respectively, and qu and pi

are the user and item factor vectors of user u and item i, respectively. This is the first basic MF model that we will analyze and evaluate before adapting it to include multiple ratings, and finally multiple ratings and reviews.

We first fit the model on the training set, and subsequently find the optimal parameters by calculating its accuracy when predicting the values of the validation set according to the RMSE defined in Equation (4.1), using 4-fold Cross-Validation as described in Algorithm 1. The model contains values µ, β_u, β_i, q and p that need to be optimized, using SGD as described

(31)

in Section 4.2. The error function per observation is:

Eui= (rui− µ − βu− βi− quTpi)2 ∀u ∈ U ∀i ∈ I, (5.10)

where E_ui is the error for observation n, µ is the global mean, β_u is the bias of user u, β_i is the bias of item i, and qu and pi are the factor vectors of user u and item i, respectively.

A caveat of MF based techniques is that there is a risk that the parameters fit themselves to the training data without increasing their generalization performance, since the parameters to be estimated are fitted to the training data, and the values are determined in order to minimize the error (Schölkopf et al., 2002). This is due to the fact that an observation usually contains both signal and noise, and by fitting too much on the training data the model also fits to the noise, with a result that the model does not generalize well to observations outside of the training set where other forms of random noise is present. This is called overfitting (Said and Bellogín, 2018). Parameters can attain very high values in order to fit observations that are partially due to noise, which can be counterbalanced by very low values in other parameters. Model regularization is a method to address this problem with the result that it increases generality, by decreasing the test set error while keeping the training set error the same (Cheng and Luo, 2018) (Lemberger, 2017). This can be done by adding a regularization term λ. λ has two forms, commonly called the L1 and L2 norms. L1 ensures that the sum of

the absolute values of the parameters are small, whereas L₂ ensures the sum of the squares of the absolute values are small.

It has been observed in practice that the L₁term can lead many parameter values to become 0, making it an appropriate candidate in feature selection where many parameter values are believed to be irrelevant (Ng, 2004). However, in our case, we have no reason to assume that some parameters are irrelevant, since the factors together describe rating behavior with every factor concerning a certain characteristic. We do use the L2 norm, for the aforementioned

reason of values getting too large. Since we only use L2, its regularization term λ2 will from

now on be referred to as λ. The total error function to be minimized is then: min q,p,β,µE = X u∈U X i∈Iu (rui− µ − βu− βi− qTupi)2+ λ(βu2+ β2i + ||qu||2+ ||pi||2), (5.11)

with parameters as in Equation (5.9), with the inclusion of the regularization term λ, which ensures that β_u, β_i, q_u and p_u do not become too large to overfit on the training data. Note that µ is not regularized since it is the intercept. Penalizing the intercept would make the procedure depend on the origin chosen, i.e. if a constant c was added to r_uithe intercept would

(32)

not shift by the same amount. We want µ to only be a vertical transformation hence the lack of regularization. Before we start the Stochastic Gradient Descent optimization algorithm we have to initialize the parameters.

Parameter initialization The parameters that have to be initialized are q, p, f , λ, γ, β_u, βi and µ. It is common to randomize the starting values of q and p (which are of dimension

n x f and m x f respectively, for n users, m items and f factors) (Linshi, 2014). Since preferences are random they can be drawn from a random uniform distribution. Ratings r can take on values between 1 and 5, and the predicted rating ˆr is the inner product of a 1 x f transposed user factor qT_u and a f x 1 item factor p_i. The resulting predicted rating ˆ

rui = f

X

j=1

quj · pij is between 1 and 5 if the elements of qu and pu are between

√ 1 f and √ 5 f . To accommodate the fact that users can like and dislike items -meaning the elements can be either positive or negative respectively- the initial values of the factor vectors will be sampled from a U N IF (− √ 5 f , √ 5 f ) distribution.

The number of latent factors f , the number of dimensions on which the algorithm gives the user and the beer a score, is an important feature, since it determines the number of user-item interactions on which a score is based. In the literature values between 20 and 100 are taken (Koren et al., 2009) (Cremonesi et al., 2010). We want enough factors so that hidden relationships are uncovered, but not so many that we get redundant factors. However, no theoretical basis exists to select a value for f , so we will try all the values as suggested by the literature, namely [20, 100]. However, we select the lower bound a little lower (10) because perhaps a few powerful characteristics are enough to capture the interactions. We will check this interval in increments of 10, because no significant increase in performance was achieved when taking too small steps (Koren and Bell, 2015). Then we will compare the performance of the model using the different values of f and see which value of f yields us the best fit. Similarly, the regularization term λ determines the fit of the model on the data, and determines the extent to which the parameters can attain large values. In the literature values between λ = 0.001 and λ = 10 are usually used (Cheng and Luo, 2018) (Koren et al., 2009). However, the optimal value depends on the data used. Therefore we will check ourselves empirically which value of λ gives us the best fit.

γ is also a parameter but it only controls the speed of convergence, so it is not necessarily relevant for finding the most accurate model. Choosing a constant rate for γ will mean there is a higher oscillation radius with less probability of convergence, but will leave the transient

(33)

phase faster. Conversely, choosing a decreasing rate will take longer to leave the transient phase but is more likely to converge (Chee and Toulis, 2017) (Elahi, 2011). It is usually set at a low value, such as γ = 0.01 or γ = 0.02 so that it does not miss a local minimum. However, this does mean that converge takes longer and may take several epochs to achieve (Nóbrega and Marinho, 2014). We will take γ = 0.01 in order to maximize the probability that we reach a minimum. Unless stated otherwise, γ is chosen as 0.01 where appropriate.

µ is found by calculating the average rating over the whole training set: µ = 1 N X u∈U X i∈Iu rui,

where U , I_u, and N are the set of users, the set of items user u has rated and the number of observations, respectively. βu and βi are then found by subtracting the global mean from the

average ratings of user u and subtracting the global mean from the average ratings of item i, respectively. So β_u = 1 ||Iu|| X i∈Iu rui− µ and βi = 1 ||Ui|| X u∈Ui

rui− µ, with ||Iu|| and ||Ui|| the

sizes of the sets I_u and U_i respectively, where U_i is the set of users that has rated item i. Now that we have our starting values of the model we can start to optimize and determine the fit.

Parameter optimization With the parameter values as chosen above we will start the algorithm, looping through the whole training set for multiple epochs until a minimum is reached, updating the parameters according to the framework as described in Equation (4.3). More specifically:

qu,n+1←− qu,n− γ(−piEui+ λqu,n), (5.12)

pi,n+1←− pi,n− γ(−quEui+ λpi,n), (5.13)

βu,n+1←− βu,n− γ(−Eui+ λβu,n), (5.14)

βi,n+1←− βi,n− γ(−Eui+ λβi,n), (5.15)

and

µn+1 ←− µn+ γEui, (5.16)

with variables as defined before with the subscript n + 1 added to signify the new value for the parameter values if observation n concerned user u and item i, Eui the error as described in

Equation (5.10) and γ the learning rate. We then loop through the data set until convergence, as explained in Section 4.2.

(34)

5.3.1 Analysis

We train our model using 4-fold CV on the training set, and validate the performance on the validation set, as described in Algorithm 1. In our case, we estimate the number of factors (f ), and the regularization term (λ). Since one value of f might be optimal at a certain value of λ, it is possible that when f changes the optimal value of λ also changes. So we optimize f and λ simultaneously by taking different values of f and λ concurrently and then performing a grid search for the lowest RMSE, yielding us the optimal values of f and λ. We first need a neighborhood in which we can search for the optimal λ, as λ = 0.001 to λ = 10 is too broad an interval. We need a value for f before we can run the optimization algorithm. The increase in parameters is linear in f , so if a value of λ is optimal at one value of f , the optimal value of λ at another value of f may be different but will not differ by several orders of magnitude. We can therefore safely find the optimal λ at a specific f and have reasonable certainty that the optimal value of λ at another f will also be in the interval around that optimal value of λ if we choose the interval of λ wide enough around this optimal value. So after we found the optimal value of λ we will still take an interval for λ of several orders of magnitude during the grid search to make sure that the optimum is contained in the interval. We choose f = 100 for this first run. As this is the case where the most parameters are present, the regularization term is bound to have the most influence. At that value of f , λ is at its most sensitive so this gives us a good starting point for the interval of λ during the grid search. We start by validating the model when using the whole range of λ, from λ = 0.001 to λ = 10 with f = 100. The results can be seen in Figure 8. We can see the convergence of the algorithm in the Appendix, in Figure 21.

There we see that for very low values of λ the RMSE has minor variation, decreasing but then increasing again, meaning that the regularization is not stable, but still staying relatively constant. On the contrary, too high a value of λ and the performance decreases markedly, starting from around λ = 0.032 indicating that too high a regularization does not give the model the flexibility it needs to fit the data. A minimum is reached around λ = 0.01. We will therefore do our grid search in the interval around λ = 0.01, varying both the number of factors f and λ simultaneously. The results can be seen in Figure 9.

In Figure 9 we can see the performance of the model on the validation set. We see that for values around λ = 0.01 the RMSE is lowest, showing a significant change when the value of λ changes. As there is little overfitting this means that the model is well-specified as no significant decrease in RMSE can be achieved by taking a significantly different structural form. This could also be due to the fact that we initialized f by taking initial values that are inversely

(35)

Figure 8: Plot of the average RMSE (after 4-fold CV) of the validation set for different values of λ and f = 100. A minimum is seen around λ = 0.01, yielding an RMSE of 0.59652.

proportional to the dimension of the factor vector f , effectively ensuring that the number of factors matter little for the overall result. Still, there is minor variation, which implies that the there is some degree of overfitting and a lower number of factors is preferred. Similarly, the value of the RMSE also shows large fluctuations in the value of f , staying relatively constant for higher levels of f such as in the range [80, 100], but sharply dropping when the number of factors becomes smaller and moves in the direction of 20. This indicates that finding 80 categories on which a user-item interaction is calculated does not reflect reality well, indicating that users only take relatively few factors into account, leaving us an optimal value of f in the lower range of the investigated interval. It seems that in the interval investigated for f , there is little overfitting, possibly starting only at much higher values for f . Taking the optimal values f = 20 and λ = 0.01 for the optimal model, we validate its performance on the test set, the results of which are seen in Table 4.

(36)

Figure 9: Plot of the average of the RMSE of the validation set after 4-fold CV for different values of f and λ. A minimum RMSE of 0.59647 is found for λ = 0.01 and f = 20.

Method Factors (f ) Regularization Term (λ) RMSE

MF-α 20 0.01 0.5952

Table 4: Performance of the optimal MF model on the test set. This basic Matrix Factor-ization (MF) model in the Single Ratings framework is dubbed MF-α.

In Cervisia Veritas: a Treatise Concerning the Accuracy Conditions of Recommender System Techniques

Master Thesis