• No results found

The effectiveness of implicit feedback for an online news recommender system

N/A
N/A
Protected

Academic year: 2021

Share "The effectiveness of implicit feedback for an online news recommender system"

Copied!
61
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master Thesis

MSc. Marketing Intelligence

Faculty of Economics and Business

Department of Marketing

The effectiveness of implicit

feedback for an online news

recommender system

January 2019

By Floris Kalk

s2173212 Wielewaalplein 284 9713BR Groningen f.kalk@student.rug.nl +31611779272

(2)

1

Abstract

(3)

2

Contents

Abstract ... 1

1. Introduction ... 4

1.1. Types of recommendation systems ... 5

1.2. News Recommenders ... 6

1.3. The Netflix prize ... 6

1.4 Problem statement and research question ... 7

1.5. Contributions ... 8

1.6. Structure... 8

2. Theoretical framework ... 8

2.1. Recommender systems in different products and services ... 8

2.2. News recommender systems ... 9

2.3. Simple methods ... 10

2.4. Collaborative filtering methods ... 11

2.5. The history behind matrix decomposition ... 12

2.5.1. Principal Component Analysis ... 12

2.5.2. Singular Value Decomposition ... 12

2.5.3. Different matrix factorization methods ... 13

2.6. Matrix Factorization ... 14

2.7. Explicit feedback and implicit feedback ... 15

2.8. Offline and Online experiments ... 16

2.10. Data and privacy... 17

3. Research Design ... 18 3.1. Data collection ... 18 3.2. Data description ... 18 3.3. Datasets... 19 3.3.1. User data ... 20 3.3.2. Item data ... 20 3.4. Data pre-processing ... 21 3.5. Variable creation ... 22

3.5.1. Page time variable ... 22

3.5.2. Word count variable... 23

3.5.3. Hit bottom variable ... 23

(4)

3

3.6. Methodology ... 24

3.6.1. The MF algorithm ... 24

3.6.2. Hyperparameters ... 26

3.6.3. Implementation of the algorithm ... 27

3.6.4. Offline database testing of MF ... 28

3.6.4.1. Mean Percentile Ranking ... 28

3.6.4.2. Precision at K ... 29

3.6.5. Online Real-time testing of MF ... 29

3.6.5.1. Online Evaluation criteria ... 30

3.6.5.2. A/B-testing ... 31

3.6.5.3. Two conditions ... 31

3.6.5.4. Hypotheses for A/B-test ... 32

3.6.5.5. Sample size and test duration ... 33

4. Results... 34

4.1. Offline evaluation results ... 34

4.1.1. Implicit feedback variable ... 34

4.1.2. M@P and MPR ... 35

4.2. Online evaluation results ... 38

4.2.1. Parameter optimization ... 38

4.2.2. Click through rate ... 39

4.3.3. Total article views per user ... 41

5. Discussion and conclusion ... 42

5.1. Conclusions ... 42

5.2. Managerial implications ... 43

5.3. Limitations ... 44

5.4. Recommendations for further research ... 45

5.5. General conclusion ... 46

6. References ... 47

(5)

4

1.

Introduction

Recommendation systems (RSs) can be really powerful. In 1988, the mountain-climbing book ‘Touching the Void’ was released, which narrated about the adventure of Joe Simpson in the Peruvian Andes. It did not gain a lot of attention initially. However, some years later, another book about mountain-climbing was released, called ‘Into Thin Air’. This book did gain a lot of attention and generated a lot of sales, also via Amazon’s online bookstore. Amazon’s RS noticed people buying both books and started recommending ‘Touching the Void’ to people buying or considering ‘Into Thin Air’. This resulted in a lot of sales and attention and eventually ‘Touching the Void’ even became more popular than its counterpart. Without the online bookstore of Amazon and their RS, the book would not have been seen by so many potential buyers and never have gained so much popularity (Anderson, 2004).

RSs are all around us, and that is not surprising. When people are confronted with too much information, they will suffer a so-called information overload. Research has shown that too many choices negatively affects the cognitive ability, leading to poor quality choices (Iyengar & Lepper, 2000). Therefore, RSs were designed to overcome this information overload problem (Kille, Lommatzsch & Brodt, 2015).

Since the initial years of development, RSs have become big business. This makes senses, since consumers are susceptible to recommendations and perceive them in general as positive (Fitzsimons & Lehmann, 2004). The success of companies such as Netflix relies heavily on their ability to recommend the right movies to their users. In fact, Netflix's Chief Product Officer Neil Hunt, argued that their personalization and recommendation systems combined are worth $1B per year. Moreover, 80 percent of the content users view on Netflix is

recommend by algorithms, while only 20 percent comes from user searches (McAlone, 2016). It is evident that recommendation plays a huge role in final choice of content.

RSs have the potential to play a big role in online news, by aiding in our search for news. By helping in the browsing procedure, the information overload can be reduced. The RS can do the filtering of articles for you, hereby aiding you in solving this decision problem (Kille et al. 2015).

(6)

5

articles, videos and graphics a day, the Wall Street journal 240 stories, and the New York Times publishes 230 pieces of content (Meyer, 2016).

The first time the topic of news recommendation really caught the attention of the general public was when we learned about the Facebook algorithms that potentially played a decisive role in the US elections. Facebook recommendation algorithms have been criticized for focusing on emotion-centered engagement (Manjoo & Roose, 2017). Even though the exact contents of their strategy to increase engagement are unknown, critics worldwide agree that their recommendation systems play a big role in steering sentiment of the general public.

Luckily, news recommendation systems can also be used in a more fruitful way. By

implementing systems that are designed not just for maximizing emotional response, but for content and articles that matter.

Matrix Factorization (MF) is a widely used method developed for RSs, companies such as Netflix make use of the method. MF uses similarities between user behavior to recommend items, being part of collaborative filtering. User behavior displays underlying user

preferences, which are used to calculate so-called factor scores for every item-user

combination. By comparing the behavior of one user to that of other users, latent factors are discovered that drive user preferences. By applying reduced rank matrix factorization, the preferences are split up into two rectangular tables: one for user factor scores and one for item factor scores. For discovering whether a user will like an item, we calculate the inner product of the corresponding item factor scores and user factors scores. This total score represents the suspected liking, or expected preference for an item. The higher the score, the more likely the user will be prone to liking the item.

Apart from collaborative filtering, there are other methods used in RSs.

1.1. Types of recommendation systems

(7)

6

in the past and making recommendations based on these past preferences for product attributes (Ansari, Essegaier & Kohli, 2000; Wedel & Kannan, 2016).

Collaborative filtering, as used in MF, does not use content to group information. Instead, it uses the opinion of peers to build up recommendations, hereby recommending items to a user based on items of users with similar tastes (Liu, Dolan & Pedersen, 2010; Garcin, Zhou, Faltings, & Schickel, 2012; Felfernig, Friedrich, Jannach & Zanker, 2006).

This method is used in practice by Amazon and Netflix. Their systems aim to find users that give similar ratings to products or movies, hereby determining taste. Then products which fall into similar taste categories are recommended to that user (Hardesty, 2017).

The third type of filtering combines both methods. Hybrid filtering features a mixture of aspects from both information filtering and collaborative filtering (Gu, Dong, Zeng & He, 2014). Newspaper websites, including the New York Times website use a Collaborative Topic Model that combines preferences and contents (Spangher, 2015). Authors such as Liu et al. (2010) also use a hybrid method. They devise a Bayesian framework to predict news preferences combining content-based recommendations with existing collaborative filtering. Additionally, Ahn (2006) combines genre information and buyers ratings expressing popularity into a hybrid method.

1.2. News Recommenders

The ever faster growth of the internet and its content presents some unique challenges (McMillan, 2000). One of the biggest challenges for news websites is to help readers find articles that are interesting for them (Lui et al. 2010).

This is complicated, because news preferences are trend-sensitive, users do not want to see more articles about one subject, and there often is not enough information to make reader profiles (Garcin, Dimitrakakis & Faltings, 2013). Considering articles expire quickly, collaborative filtering appears less suitable for news recommendation (Okura, Tagami, Ono & Tajima, 2017). Hence, this type of filtering is quite uncommon in news recommendation (Garcin et al. 2012). Word-based methods, using content filtering, have issues with synonyms and orthographical variants. It is difficult to infer the specific meaning of a word from context, and therefore word-based methods also appear less usable in context on news recommendation (Okura et al. 2017).

1.3. The Netflix prize

(8)

7

others, use a combination of over a hundred algorithms to win the prize (Amatriain & Basilico, 2015). There were two algorithms that contributed a lot to the performance of the prize-winning ensemble. One of them is based on Matrix Factorization (MF).

The MF model projects both the items and the users in the same latent factor space and computes similarities between them (Kille et al., 2015). The winners, Koren and Volinksy, in collaboration with Hu (2008), later on continued their work on MF for TV recommenders by making it able to perform on datasets with implicit feedback, rather than using more often available explicit feedback. The difference lies in how the feedback data is acquired. Explicit feedback is information willingly provided by users, such as star-ratings or thumbs-up. Implicit feedback is feedback that is not willingly provided by users, but where preference can be inferred from user behavior, e.g. viewing time (Jawaheer, Szomszor & Kostkova, 2010). Hu et al. (2008) state that up until then, most of the existing literature has focused on explicit feedback, probably due to the ease of use. However, in practice, considering it is more frequently available than explicit ratings, most RSs need to be designed for incorporating implicit feedback.

1.4 Problem statement and research question

Chung & Rao (2012) perform a comparative study between several types of filtering methods and find that SVD++ based on MF by Hu et al. (2008) performs exceptionally well.

To my knowledge, their method for implicit datasets remains to be tested against a news recommendation dataset. Because news websites rarely ask users to rate articles, explicit ratings are not readily available. There is, however, a lot of implicit feedback available – such as dwell time and scroll length – which are gathered on news websites. Moreover, the MF model can be recalculated frequently, to accommodate the quick expiration of articles. Lastly, MF is capable of discovering latent factors, which would require less information to detect reader profiles. Therefore, the aim of this thesis is to discover how implicit feedback can be used in designing a RS. In other words:

How can implicit feedback improve a news recommender by using matrix factorization (MF)?

(9)

8

of RSs, (see e.g. Amatriain & Basilico, 2015; Garcin, 2014; ter Hoeve et al., 2018; Peska & Vojtas, 2015), hence the second and third sub-question are posed.

RQ1: How can implicit feedback be captured in a variable in order to be used in a

recommender?

RQ2: To what extent does implicit MF outperform benchmark models in an offline evaluation? RQ3: To what extent is implicit MF able to outperform the benchmark in online evaluation?

1.5. Contributions

This work will add to the existing literature on RSs. This work aims to address the usefulness of MF to recommenders for news articles. This work will evaluate what types of implicit feedback can be used and how it should be converted into a variable. Furthermore, the paper will investigate how MF holds up against benchmark models in an offline and online evaluation.

1.6. Structure

The next parts of this paper are organized as follows:

- Section 2 is a literature review of related work. It describes the use of recommender systems over different industries with special attention to the MF in collaborative filtering. - Section 3 is the methodology section in which the data collection is described, as well as the preparation of the data for applying MF. In addition, it is described how a combination of offline and online testing will be used to test various versions of MF.

- Section 4 provides the analysis and evaluation of the performance of the different models tested offline and online.

- Section 5 consists of a discussion of the results from the offline and online testing. - Section 6 provides an explanation and summary of the most important findings. Also, limitations are identified and suggestions for future work are provided.

2.

Theoretical framework

2.1. Recommender systems in different products and services

RSs have spiked a lot of interest in the academic world and have been investigated in multiple research areas (Hu et al., 2008).

(10)

9

Smyth, 2006). Senecal & Nantel (2004) find that online recommendations have a stronger effect on consumers’ product choices, than conventional recommendation sources such as other consumers and experts. This illustrates that online stores can benefit greatly from implementing good RSs on their websites. It is therefore not surprising that RSs are becoming more sophisticated over the years. For example, McGinty & Smyth (2006) use adaptive selection to increase diversity, and are able to dramatically improve the performance of conventional e-commerce recommenders. Felfernig et al. (2006) take it one step further by devising a knowledge-based recommender called CWAdvisor for online selling platforms.

Huang, Zeng & Chen (2007) show that online RSs are becoming more complex by using random graph modelling of sales transactions to uncover consumer behaviour patterns and outperform other representative collaborative filtering algorithms.

Bodapati (2008) uses a different, but not less sophisticated approach. He combines purchase data and recommendation response data and applies this on an e-commerce dataset. He devises a decision framework for recommendations that makes use of the distinction between awareness and satisfaction that outperforms the benchmark model.

The multimedia realm features research in for example movies or music recommenders. This

field shows increased complexity as well, by different data types used as well as more complex estimation models. Ansari et al. (2000) use five features to estimate recommender for theatre releases and video rentals: expressed preferences or choices, preferences for product attributes, other people’s preferences, expert judgement and individual characteristics, combining this into a sophisticated model. They estimate the model using a regression-based Hierarchical Bayesian (HB) collaborative filtering model. Chung, Rust, & Wedel (2009) devise a recommender for digital audio players that produces real-time recommendations on big amounts of data. Even missing data is accounted for. Ying, Feinberg & Wedel (2006) build on their movie recommender work by including a comprehensive account of missing data. Julià, Sappa, Lumbreras, Serrat & López (2009) also account for missing data by using factorization and test their RSs on movies and books, among others.

2.2. News recommender systems

(11)

10

papers. This trend has continued since the beginning of the 21st century (Stempel III, Hargrove & Bernt, 2000). We now have the possibility to read news anywhere and

everywhere: on our phones, laptops, tablets or even our smartwatch, and are not confined to newspapers only printed daily. News reading is shifting further and further towards an online activity (Kille et al. 2015).

The content of the news is also not the same as it was before. Hoffman (2006) found that online newspapers feature different content to attract a younger public. Also, online news is more audience-centered compared to the more conventional journalist-centered traditional newspapers (Boczkowski, 2004).

A couple of decades ago, news supply was rather limited and you had little to choose from. You would read the news that was dropped in your mailbox daily at 7 o’clock. Nowadays, news is offered to us all day, every day. It is offered in massive amounts via multiple channels, in which digital technology plays a big role. Therefore, also in the area of news, RCs have been developed.

News recommendation differs from product recommendation, such as movies and songs, in terms of users having preferences for latent concepts instead of actual items. For movies it is easier to know and specify what type of movie you like, but for news, this is harder to pin down. Moreover, news articles decrease in relevance as users become more aware of them, exhibit a much higher addition and deletion rate, and people seldom re-read articles, whereas e.g. movies are more often re-watched. News websites, contrary to for example Netflix or Amazon, do seldom require a profile to be logged onto before browsing the site, so users do not create consistent profiles. That is the reason why news websites often use session

identifiers for profiling. However, these identifiers are often ambiguous or inadequate, due to cross-use of multiple devices, sharing computers, or blocking monitoring such as cookies (Kille et al. 2015).

2.3. Simple methods

Simple methods are less advanced than information, collaborative or hybrid filtering. They rely on a straightforward way of ordering news articles and providing them as suggestions. Because these methods are based on algorithms that are intuitive and rather easy to grasp, simpler methods are also easier to implement (Kille et al., 2015).

Most popular methods rely on the assumption that if everyone likes the article, it is also relevant

(12)

11

Most recent methods take the variable recency, being creation time, as its core variable. Articles

that are published last, are most recent and will be featured on top of the list. The output is a list of articles sorted from most recent (high) to least recent (low). Considering that news is all about what is new, the intuition behind recency is not that bad.

Random methods present articles in a random fashion. There is no intuition behind it other than

a random draw from the whole set of articles available. It produces a random list of articles until the specified amount of articles is met. Being completely at random, it possibly lacks in relevance of articles. This is compensated to some extent ,however, by the possibility of recommending surprising articles that are ‘out of the box’, being neither popular, nor recent.

2.4. Collaborative filtering methods

Within collaborative filtering, a number of different methods exist. We can distinguish between memory-based and model-based methods (Kille et al., 2015). Memory-based

methods use all available data for devising a model. These neighborhood methods aim to find relationships between items or users (Koren & Bell, 2015). A user model iterates between a list of other users and uses a predefined similarity function to find users that have similar preferences. In other words, it finds users who are the nearest neighbor. An item model follows a similar approach, but instead finds the nearest neighbor, which in this case is an item.

Van Roy & Yan (2010) perform a comparative study within collaborative filtering methods and find that when it comes to robustness, linear and asymptotically linear algorithms are more robust to manipulation than nearest neighbor algorithms that are often used.

(13)

12

An example of a model-based method is the latent-factor model. MF has proven to be one of the most successful types of methods within collaborative filtering (Kille et al, 2015). The results from this method were found to be consistently better than results produced with a neighborhood model (Hu, Koren & Volinsky, 2008). The MF model projects both the items and the users in the same latent factor space and computes similarities between them. There is of course a history of modelling behind MF.

2.5. The history behind matrix decomposition

Having a high number of dimensions in your data, makes it hard to grasp. In machine learning, it is not desirable to input a lot of features into the model, because that slows down computations. Hence, researchers refrain from feeding a large number of features into an algorithm. Also, higher number of features or dimensions are believed to suffer from distorted distances. Therefore, researchers seek to reduce the dimensions in their data. Over the years, matrix decomposition techniques have evolved from Principal Component Analysis to complex forms of Matrix Factorization.

2.5.1. Principal Component Analysis

One way to reduce dimensions is by using Principal Component Analysis (PCA), originally described by Karl Pearson in 1901 (Tipping & Bishop, 1999). The idea is to transform raw data into a reduced dimensionality representation, by transforming a set of correlated variables into a set of orthogonal components. Tipping and Bishop (1999) suggest that PCA should be based on a probability model and update the original into a Probabilistic Principal Component Analysis (PPCA). Rice & Silverman (1991) and Silverman (1996) introduced smoothing of PCA coefficients and used regularization to penalize roughness on the eigenvectors. Bali, Boente, Tyler & Wang (2011) use a combination of different smoothing methods. Huang, Shen & Buja (2009) extend one-way PCA to two-way PCA , where both the rows and columns are structured, by using regularization on both the left and right singular vectors in the SVD. 2.5.2. Singular Value Decomposition

(14)

13

for items (left singular vectors), users(right singular vectors), and the rank (singular values), and taking the product of these matrices. By this simplification, the data is reduced (Feng & He, 2014). By applying weighted least squares the values can be found . According to Feng & He (2014), the main usage of SVD is to reduce data and SVD is commonly used in research to approximate data matrices with lower rank matrices. In their research, the authors note that the first singular value and vectors can be driven by a small number of outliers. Therefore, they propose a more robust alternative that moderates the effect of these outliers.

2.5.3. Different matrix factorization methods

A variety of different methods for MF have been developed and are described in the scientific literature. As explained before, the MF model projects items and the users in the matrix and calculates hidden factor scores to find similarities between users.

One of the first times factorization is used for a RS is in the work by Julià et al. (2009). Their adapted factorization outperforms SVD, giving better predictions by using less computational power. But as noted earlier, their main advantage is the ability to process missing data. Lee & Sueng (1999) were the first to use a non-negative matrix factorization (NMF) in their work on objects recognition in the brain. By using non-negativity constraints, the algorithm is able to learn parts of faces and semantic features of text. It is different from PCA in that in learns part-based representation instead of holistic representation.

Sun, Lebanon & Kidwell (2012) describe how NMF is used in recommenders by predicting ratings of items looking at the similarity between users. In this case NMF is used in a method similar to latent factor analysis, where the matrix is used to calculate and enter the unobserved entries. The method of Sun et al. (2012) is different. Sun and co-authors constructed a full probabilistic model on preferences that is able to handle heterogeneous preference information, being that not all users have to provide an equal number of preference classes. This means that users do not have to provide scores for every item available, so also incomplete lists can be used. Johnson (2014) performs a more advanced type of MF called logistic MF which models the probability that a user will like a specific item.

(15)

14

Chung & Rao find that the MF SVD++ model outperforms both the other CF models (neighborhood and baseline linear regression) and the attribute based models from Ansari et al. (2002) and Ying et al. (2006). However, in the original dataset, their hybrid model with ‘virtual experts’ is able to outperform even the SVD++ model. Still, when applying models datasets varying in heterogeneity, results are mixed and SVD++ is able to outperform the hybrid model wen the amount of ratings is low. Nevertheless, Chung & Rao’s (2012) most important finding is that that performance of different models is dependent on amount and type of data available. There is not one best model.

Cold-start problem

One of the main disadvantages of using MF, as well as other CF techniques is that the model needs a sufficient amount of data to perform well. If there are no prior interactions between users and items, it is hard for the model to make appropriate recommendations. In the case of MF, this means new users, or new items without interactions are hard to addressproperly (Hu et al., 2008; Garcin et al. 2012).

2.6. Matrix Factorization

The introduction section to this paper featured a special case in the RS industry. The Netflix prize gained a lot of attention the academic world as well as in the industry and brought the two worlds together by competing in the same challenge. This challenge fueled progress of recommender systems throughout the industry (Ricci, 2015). The challenge was set in 2006 and the first group of researchers who were able to outperform the old Netflix algorithm Cinematch would win a cash prize. This new model should be able to reduce the root mean squared error (RMSE) of the predicted rating by 10 percent or more (Amatriain & Basilico, 2015).

It was the first time that the world of academia was presented the opportunity to work on a real life problem by using such an extensive dataset, containing 100 million movie ratings. This possibility attracted the attention of thousands of participants, ranging from students to industry professionals, creating a unique interaction between the academic and business worlds (Koren & Bell, 2015). The nature of the problem enabled thousands of different teams to focus on improving one single metric (Amatriain & Basilico, 2015).

(16)

15

that finally broke the 10 percent boundary and were awarded the 1 million dollar prize. However, using over 100 algorithms to result in a new recommender requires an enormous effort for engineers to implement. So, selecting from the most contributing methods and implementing this algorithm seems more appropriate. As mentioned before, one of the two methods part of the ensemble that contributed most to the performance was the MF algorithm (Amatriain & Basilico, 2015).

2.7. Explicit feedback and implicit feedback

Feedback plays a vital part in many recommender systems (McGinty & Smyth, 2006). Explicit and implicit feedback are fairly different from each other and have different characteristics, but both types of feedback have great potential to increase recommendation quality. The two types of feedback provide a different degree of expressivity of liking (Jawaheer et al., 2010). Deciding what type of feedback should be used for the design of a recommender system depends on task, domain and users of the recommender (McGinty & Smyth, 2006).

Explicit feedback is feedback users actively choose to provide through a rating of some sort,

using a mechanism to express interest in items (Jawaheer et al., 2010). Such mechanisms include star ratings, thumbs up or down buttons (Hu et al., 2008). People might be unwilling or unable to provide product ratings, thus much of the explicit feedback is ‘missing’ (Wedel & Kannan, 2016). Properly accounting for these missings, among others , increases recommendation quality substantially (Ying et al., 2006). Yet, explicit feedback is believed to be more accurate than its implicit counterpart. Also, Explicit feedback can be positive and negative. Problems with attaining explicit feedback have prompted companies such as Amazon to use a different type of feedback unobtrusively gathered from user behavior: implicit feedback (Wedel & Kannan, 2016).

Implicit feedback is gathered by a recommender without an actual choice of the user, by

(17)

16

Implicit feedback is inherently noisy, since we can only guess true intentions. It could be that the person who choose a product was dissatisfied or bought the product for someone else (Hu et al., 2008). Additionally, the link between user preference and implicit feedback is less clear than for the explicit kind (Peska & Vojtas, 2013). Luckily, the quality of modern methods using implicit feedback are comparable to those using explicit feedback (Lee & Brusilovsky, 2009). Both types of feedback have been studied to some extent in literature. Lee (2009) discovered that that negative preferences that can be revealed in implicit feedback increases recommendation quality when it comes to job recommendation. Peska & Vojtas (2013) reach a similar conclusion on a travel agency dataset, where they discover negative implicit feedback to be of value for their recommender system. Jawaheer et al. (2010) conducted a research in which they compared the recommendation power between an implicit and explicit dataset. In contrary to their prior beliefs, the two sets do not differ significantly on performance. Ilievski & Roy (2013) develop a Select-watch-leave framework that incorporates implicit feedback in a news recommendation system. However, their system remains to be tested in an online environment. Of course, (Hu et al., 2008) use implicit data to increase their performance on a normal MF model. Research on the subject has confirmed that implicit feedback can benefit a RS, especially if explicit feedback is not available.

2.8. Offline and Online experiments

To bring perspective on the offline and online performance of MF, the second and third research questions of this thesis, offline and online testing will be addressed in the following section. Researchers agree that a combination of offline and online tests is the best method to evaluate the performance for a recommender system (ter Hoeve et al., 2018; Garcin, 2014). Offline experiments are used to test different algorithms against an existing public dataset, or a privately owned dataset (Ricci, 2015). Online experiments are usually conducted afterwards, to test the algorithms in real-time with actual user response in an online environment. Even though it is common practice to use a combination of offline and online experiments to test the performance of RSs, some RSs are only tested offline. Chung et al. (2009) conduct an offline experiment creating an adaptive personalization system for music playlists. They found that this personalization system leads to a greater number of songs listened to and to longer periods of listening. Julià et al. (2009) test their adapted factorization on offline recommender datasets on movies, books and jokes.

(18)

17

the model online (Amatriain & Basilico, 2015). Researchers at Blendle follow the same path (ter Hoeve et al., 2018). For the research in this thesis, the two-step procedure of first offline testing and then online testing is also followed.

In a two-step approach of designing a new recommender, offline experiments are often conducted first. Offline experiments are easier to conduct. What also makes offline experiments more appealing, is that they require no interaction with real users and thus enable comparison of performance of multiple algorithms at low costs. This can be used to tune the algorithm’s parameters and to reduce the total number of candidate algorithms (Gunawardana & Shani, 2015). The algorithms that perform best enter the next phase of online testing. In conclusion, offline tests are easier in terms of engineering involved and take less time to conduct: hours or days, versus weeks of an online A/B test (Amatriain & Basilico, 2015).

Online experiments are a precious resource. They enable accommodating recent events and provide user interaction via responses to tests in real-time. Online experiments come closest to the real world, enabling observations of actual interactions in real-time (Amatriain & Basilico, 2015).

2.10. Data and privacy

The more data can be gathered and combined into user profiles, the better personalization and recommendations can be made. However, users and consumers worry about the usage of their data with regard to their privacy. According to a survey, over three-quarters of consumers are not comfortable with the amount of data advertisers have on them, and consumers are even concerned websites do not adhere to the privacy laws in place (Wedel & Kannan, 2016). Those who are concerned have reasons to be concerned: Combining and fusing data from multiple sources generates a mosaic effect, creating user profiles that should have been private. Moreover, privacy laws have historically been lagging behind on technological advances in data collection (Wedel & Kannan, 2016).

However, two developments might decrease consumer’s concerns. Governments are imposing more and stricter privacy laws, and businesses are more likely to govern themselves to regain trust and build stronger relationships with the customers (Wedel and Kannan, 2016).

(19)

18

3.

Research Design

3.1. Data collection

The research was conducted at the NDC Mediagroep. NDC mediagroup is publisher of the three biggest newspapers of the north of The Netherlands, as well as regional newspapers and weekly journals.

The research focuses on the biggest website of the NDC Mediagroep, being ‘Dagblad van het Noorden’(DvhN). Over the years, NDC Mediagroep has gathered a substantial amount of data on user behaviour via their websites, so a rather large dataset is available for analysis. However, I only use a selection of the full dataset to perform my offline analysis. Unfortunately, I cannot perform an analysis on the full set of data available, due to the vast amount of time required to pre-process the data, which would be required for an appropriate analysis. Using a sample allows researchers to have full control over the size, nature and completeness of the data in the sample (Wedel & Kannan, 2016). However, a sample being smaller than the full dataset also limits the ability to handle long-tail distributions and extreme observations. This could be problematic if I would need to explain rare events in the tail (Naik & Tsai, 2004), which is, however, not the case.

To obtain the sample dataset, a query is written in Splunk Query Language, which is a language similar to SQL. This query extracts a dataset from the online database of the website of DvhN. Herein, we specify how big the dataset should be, based on the amount of interactions over a number of days. For the offline evaluation, a training set is needed, as well as a validation set to analyze the models, and a test set. Regarding privacy, NDC ensures rules and regulations from the GDPR and other privacy rules are followed by implementing a cookie wall when entering the website for the first time. The research in this thesis also aims to protect privacy. Therefore, the dataset was first anonymized to ensure no personal information could be linked to that person.

3.2. Data description

(20)

19

enhances the user model that can be created, which in its turn improves the performance of the recommender (Friedman, Knijnenburg, Vanhecke, Martens & Berkovsky, 2015). The dataset for designing and testing the recommender system will therefore be composed from the DvhN website alone.

3.3. Datasets

A total of three datasets is used for this research. This allows for the results of the offline evaluation to be benchmarked against different datasets, prior to online implementation of the model. However, only dataset one will be thoroughly described.

Dataset one consists of roughly one week of data containing the events on the webpage of DvhN from 12th of November 2018 12:00 until the 19th of November 2018 12:13. Dataset two consists of data from the webpage of DvhN from the 14th of October 2018 until the 20th of October 2018. Dataset three contains data from the webpage of DvhN from the 10th of December 2018 until the 10th of December 2018.

Dataset one has a total of 1,594,633 events, spread over three different types of events: pageviews, content views and clicks. The pageview event is created when a page is viewed. The content view event is created for every article when users scroll down on an article page and the ‘read more’ block is loaded. The click event is created when a user clicks on the ‘read more’ block. Every article has a ‘read more’ block on the bottom of the page, independent from the platform that people use to access the website. An example of the recommender box is shown below in figure 1. This figure shows the recommended six articles under an article regarding a fire in ‘Blijham’ from the 13th of January 2019.

A total of 22,355 unique articles were accessed in 643,137 different sessions by 345,105 unique users. The data also features columns with information on: the device type used (mobile, tablet or desktop), the article id, and the website URL.

(21)

20

Figure 1. Example read More Block

3.3.1. User data

Every time a new user enters the website of DvhN, a session-id is created and stored together in a unique user-id cookie. This cookie is sent and stored in Splunk. Every time this user performs an action on the website, this event is stored in a session-id. When the same person enters the website again on a different occasion, a new session-id is created, but it is stored under the same user-id in his or her cookie. A session-id is terminated when no event has happened for 30 minutes. The cookie-id is deleted after two years of inactivity. There is no time-count on a page, but this could be derived from the difference in time between events. The session-id contains information on what pages, or URLs, are visited and hence also what articles are read by the user.

3.3.2. Item data

(22)

21

Third, miscellaneous other reasons may hamper inclusion of an article reading in the database, such as the server being down, resulting in a number of article readings falling through the cracks.

3.3.3. Interaction

There is not a lot of information on items in Splunk, except from an article-id and the URL. The interaction between the user and the item is stored in Splunk. The cookie containing the user-id contains what URLs are visited by the user. The URL and article-id allow us to stitch the two datasets together.

3.4. Data pre-processing

Before it can be entered into the MF model, three adjustments to the dataset have to be made. First, following a similar adjustment to the dataset as decribed by Hu et al. (2008), data of users having no clear preferences are dropped. Having only one interaction with an article does not provide enough insight into user preferences. Hence, data of users that only had one interaction are deleted. Further, data on articles that are read by a substantial amount of the users, could no longer be discriminating and lose predictive power. If everyone reads the article, one could argue that there is no clear user preference that can be inferred. Therefore, deleting items with more than 10,000 views is also considered.

Second, the dataset was converted into a sparse matrix where the rows correspond to the users and the columns to the items. The values are the implicit feedback scores as calculated by one of the implicit feedback variables above, multiplied by the constant alpha. Alpha places extra weight on the so-called non-zero entries – interactions that did occur in the dataset. This parameter will be further explained in section 3.6.2 concerning hyperparameters.

(23)

22

following datasets: the training set has 531,138 interactions, the validation set has 5,369 interactions, and the test set has 5,299 interactions. There was a possibility of more than 1,5 billion interactions (119,775 users * 13,144 articles). Out of these interactions only 0.03 % occurred and turned out to be non-zero.

The other two datasets have been subjected to the same three alterations as mentioned in the section above. This resulted in separate datasets as described hereafter. For dataset two, the total dataset consisted of 500,059 interactions. The training set has 489,928 interactions, the validation set has 4,951 interactions, and the test set has 4,882 interactions. Dataset three consisted of 476,287 interactions. The training set has 466,571 interactions, the validation set has 4,713 interactions, and the test set has 4,643 interactions.

3.5. Variable creation

The following section features an explanation on the creation of the variables needed to utilize implicit feedback. This section aims to provide information on RQ1 on how implicit feedback can be captured in a variable in order to be used in a recommender.

To be able to use the implicit feedback as input for the model, a confidence level is needed. The confidence level consists of a combination of total time spent on the article page, pagetime, as well as whether the user scrolled all the way down on the page, hitbottom. In order to normalize for the words per article, a different variable is needed that captures the amount of words per article, or wordcount. An implicit feedback variable is created that captures this confidence level. In order to compose the implicit feedback variable, an number of additional variables must be created.

3.5.1. Page time variable

The website does not track the time a user is active on a webpage with an article. However, we can compute this variable by comparing the time between two subsequent events by one user and saving it into a ‘timedelta’ variable. If a user clicks on the webpage to enter a new page, he leaves the article page. By deducting the time between these two events we created a variable for viewing time on a webpage.

(24)

23

contentview event, the mean pagetime is used. The mean of time spent on a page is computed at a approximately 100.38 seconds. This value is imputed for all cases where the pagetime is missing or could not be computed. This resulted in a total of 174,572 imputations of the mean, or roughly 27.55% of the total rows, which is quite high.

Some pagetime values are quite long. Even for the long articles, it is unreasonable to think that people take over half an hour to read. The largest article in the dataset has 3965 words. Slow readers read approximately 200 words per minute, so they would need less than 20 minutes to read the article (Rayner, Slattery & Bélanger, 2010). Taking into consideration possible delay or distraction, to be on the safe side, we set the upper bound to 30 minutes, or 1,800 seconds. All values above 30 minutes are set to this maximum.

3.5.2. Word count variable

In order to be able to link pagetime and the word count of an article to each other, a column for word count was created. Normally, Google Cloud Natural Language analyses provides the word counts of articles. This is machine learning method from Google devised for text analysis also counts the words in a piece of text. However, not all articles in the dataset were analyzed properly by this service. For these exceptions, the words in the article were counted separately, in the JSON file itself.

3.5.3. Hit bottom variable

The website does not track whether the user has hit the actual end of an page. This information is attained by looking at a different tracker. Near the bottom of the article page there is a ‘read more’ content block. When the user scrolls down and the ‘read more’ block becomes visible, an event is created, with the same session identifier. Therefore, we can use this event to discover whether the user has scrolled down to the bottom or not. The variable will be binary and for the session with the same article ID, all hit bottoms values will become 1.

3.5.4. Implicit feedback variable

The implicit feedback variable consists of all information gathered from an interaction between a user and an item that displays to what extent the user likes the item. It measures the engagement the user has with an item. This research uses pagetime, hitbottom and wordcount in different combinations to construct the implicit feedback variable.

(25)

24

that has to be attributed to hitbottom has to be tested. Also, if a user spends a big amount of time on a short article, it should have a higher score than if he spends the same amount of time on longer article. Therefore, it is reasonable to divide the pagetime by the wordcount, in order to normalize the pagetime score. However, extensive testing most determine whether this actually increases model performance.

Overall, this means there are numerous combinations to be tested. In order to discover which version of implicit feedback performs best on the evaluation criteria, the following options were considered.

• Including hitbottom versus not including hitbottom • Attributing different weights to hitbottom:

o Different versions of pagetime + hitbottom o Different versions pagetime * hitbotom • Whether to divide pagetime by wordcount

• Normalizing implicit feedback to fit in between 0 and 1 range

3.6. Methodology

3.6.1. The MF algorithm

MF uses latent factors to compute similarities between users. These latent factors are hidden features of an item that MF discovers by projecting items and the users in the same hidden factor space (Kille et al., 2015). Hence, MF requires a set of users and items and the known interactions between the two. Hu et al. (2008) explain MF as a cost function where the rating of user u for item i can be explained by a user factor of u transposed times an item factor of i. For explicit datasets, the ratings are the interaction between user u and item i. In other words, how user u has rated item i.

𝑝𝑢 = user factor of u

𝑞𝑡= item factor of i 𝑟𝑢𝑖 = rating by u for i

𝑟𝑢𝑖 = 𝑝𝑢𝑇 𝑞𝑖

(26)

25

MF places all users and items in a matrix, with users as rows and items as columns. Every interaction or rating 𝑟𝑢𝑖 a user has with an item, will result in a score in the matrix. The matrix will have a lot of zero-entries, since a lot of interactions did not happen. Therefore, the matrix is converted into a so-called sparse matrix, which only stores or ‘remembers’ the displayed ratings, and not the zero-entries. These ratings are then used as a input for the model to calculate factors scores for users and items.

For implicit datasets however, an alteration of this model is devised, because there are not explicit ratings that can be entered. Hu et al. (2008) state that the preference of user i for item u is not that as simple as a rating score. If someone did not watch a movie, preference was believed to be 0, and if someone watched a movie preference was set to be 1. However, there can be a number of reasons for not watching movie i, instead of not liking it. A user can be unaware, or be unable to consume item i. It could also be that a person accidentally clicked on a movie, even though he did not like it. The user then exited the movie almost immediately. Therefore, a better way of dealing with preference is a confidence level. As the confidence grows, it becomes closer to 1, because we can be more sure that a user liked that item.

Therefore, they transform rating 𝑟𝑢𝑖 into a confidence score as shown in equation 2.

𝑐𝑢𝑖 = 1 +α 𝑟𝑢𝑖

(eq. 2)

The confidence is shown by a set of variables 𝑐𝑢𝑖, which measures our confidence in observing

𝑝𝑢𝑖 . For the case of news recommendation: within this confidence we can include implicit data such as scroll-length and time on page and combine this with article viewed.

Hu et al. (2008) state that the most plausible choice for this 𝑐𝑢𝑖is stated in equation 2, but they

also propose different ways of capturing the confidence score with using a logged version of 𝑐𝑢𝑖.

MF aims to split up the user-item matrix into two separate matrices: one for users and one for items. When taking the inner-product of these factor scores, a preference score 𝑝𝑢𝑖 will arise.

𝑝𝑢𝑖 = 𝑥𝑢𝑇 𝑦𝑖

(eq. 3)

(27)

26

difference between predicted values and observed values. This can be achieved by repeatedly running different versions of the model comparing predicted values with observed values.

min 𝑥∗,𝑦∗∑ 𝑐𝑢𝑖(𝑝𝑢𝑖− 𝑥 𝑢 𝑇 𝑢,𝑖 𝑦𝑖)² + λ(∑ || 𝑥𝑢 ||2+ ∑ || 𝑦𝑖 ||2 𝑖 𝑢 ) Minimizing Regularization (eq. 4)

A regularization term λ is then added into the second part of equation 4, to prevent overfitting. Overfitting is when a model starts to fit noise. With too many degrees of freedom, the model adapts itself too much to the training data set. If the model fits too well to the training data set, it fails to perform well on the validation set. The regularization term allows the model to be a rich model with a lot of modelling power when there is sufficient data. When the data is sparse in areas where there are only a few data points, the model is simple. Hu et al. (2008) argue that direct optimization such as stochastic gradient descent cannot be used., due to computational problems. Therefore, they reduce their original model into a linear model that predicts preferences as a linear function of past actions (ruj > 0), weighted by item-item similarity.

3.6.2. Hyperparameters

A number of variables are included in the MF algorithm that need to be tuned: regularization, alpha and confidence and the number of factors.

Regularization Lamda λ

In order to prevent overfitting, regularization is added to the algorithm. Hu et al. (2008) found that results without regularization (λ = 0) were better compared to a popularity model. The ideal value for λ is different for every dataset. Therefore, a grid search will be used to detect which value performs best for the dataset, varying in using a value for lambda from 0.0001 up until 100. A table of the different values of λ that are tested can be found in appendix A.

Alpha and confidence α

(28)

27

the confidence values that are non-zero, than those that are zero. In other words, interactions that have an implicit feedback score will be attributed more importance, than those that lack the implicit feedback scores, where no interaction between the user and the article has taken place. Again, a grid search will be used to detect which value performs best for the dataset. For α we use a value ranging from 0.001 until 100, the step sizes are specified in appendix A.

Factors

The amount of user factors and item factors has a big influence on the performance of the model. Using more factors takes more time to calculate, but probably also results in better performance (Hu et al., 2008). Also, when using more factors, more training data is required as well as more iterations, which will also require more time. Therefore, we must find an amount that is acceptable when it comes to computation time, as well as performance. Using more factors makes the algorithm slower and will require additional interactions and more data to train on. Hu et al. (2008) tried different factors ranging from 50 to 100. We can use a time-it function in python to discover the amount of time a specific task takes. Grid search will reveal what amount of factors is appropriate, without the calculations using too much time. Appendix A features the factors that were used.

Iterations

According to Johnson (2014), MF does not always deliver the same results. Two MF calculations with the same hyperparameters do not necessarily produce the same factor scores every time. Since the model learns the factor scores from scratch, it might find other latent factors and factor scores. Therefore, the MF computation must be repeated and averaged to rule out any outliers performing exceptionally well or exceptionally poor. Looking at previous work and taking into consideration time constraints, a number of 15 iterations was chosen to be performed for each set of parameters.

3.6.3. Implementation of the algorithm

(29)

28

the specified amount of factors. Fourth, the inner product of these matrices is calculated to generate a list for every user. Fifth, the performance can be evaluated by comparing the results with specified evaluation criteria.

Different versions on the algorithm with different values for the metavariables can be tested against each other.

3.6.4. Offline database testing of MF

Recommendation quality depends on the fit between the predicted behaviour, preferences of users and their displayed actual preferences (Kille et al. 2015). The users’ preferences can be discovered in a dataset with user-item interactions. Since we have the possibility to test the algorithms against our own collected dataset, this evidently is a more preferred method than testing against an open dataset. Hence, the offline testing will be performed on the dataset as described in chapter 3.

Offline evaluation allows to test against interactions, hereby comparing performance of the different variations or the MF algorithm, by altering the weights and parameters. The performance of different algorithms on suggested articles is compared to the actual articles read by the users in the dataset. In order to evaluate the MF model, a validation and a test set are created. The performance model is assessed by comparing the MF model to benchmark models consisting of a random and a popular model, as featured in section 2.3. The comparison is performance based on the following evaluation criteria: ‘Mean Percentile Ranking’ and ‘Precision at K’

3.6.4.1. Mean Percentile Ranking

(30)

29

are no articles higher on the ranking, making it the most preferred article. An MPR of 100% means that there are no articles lower on the raking, making it the least preferred article. For random predictions, the MPR is expected to be 50%, so when the rank of the model approximates 50%, it is not performing better than random.

3.6.4.2. Precision at K

Precision at K (P@K), also known as precision at N, consists of the amount of articles that were predicted correctly in the top-K recommendations. Variable K represents the amount of items to consider in the top-K set. The researcher is able to set the K value, but usually this is set at k=10, being the top-ten set. There are hundreds of relevant articles for users, but users will not want to see all of them. Knowing that there are only a limited number of articles that can be presented in the ‘read more’ block, it is more appropriate to look only at the top of the recommended set, than at the total recommended set (Gunawardana & Shani, 2015). Therefore, P@K is an evaluation criterion that is applicable in this case. It consists of the proportion of recommended items in the top-ten (K = 10) recommended set. In other words, the recommended and used items divided by the total recommended items (Gunawardana & Shani, 2015). The value of P@K lies between 0 and 1. A perfect score is 1.0, where all items were predicted correctly, whereas a score of 0 indicates that no items were predicted correctly.

Two remarks are made on the use of P@K as an evaluation metric. First, P@K is able to conveniently illustrate to what extent the predictions match the actual article preferences of the user. However, it does not capture other related quality features, such as the diversity of the recommended articles. Hu et al. (2008) notice that the goal of RSs is not to predict future behavior, but to direct users to items they would not have considered otherwise. Unfortunately, P@K is not able to capture this. Second, P@K seems less appropriate if there are a substantial amount of users in dataset that have read less than ten articles. For these users, it is impossible to acquire the perfect score of 1.0. Hence, this metric would benefit from correcting for these sporadic readers, by adjusting the score based on the amount of articles read by users.

3.6.5. Online Real-time testing of MF

(31)

30 3.6.5.1. Online Evaluation criteria

The recommendation quality is depended on the fit between the predicted behaviour or preferences of a user and their displayed actual preferences (Kille et al. 2015). We can evaluate this by looking whether the reading behavior of tested sample change when they are exposed to a condition (Ter Hoeve et al., 2018). Performance metrics are an appropriate way of measuring this change in behaviour. Throughout recommender academia and industry research, one metric stands out: Click-Through Rate. Apart from CTR, total page views is also considered an appropriate metric, by some authors even considered as more appropriate (Berman et al., 2018; Garcin et al. 2014).

Click-Through Rate

Click-Through Rate (CTR), or clicks, consists of the percentage of users that click on the webpage variation (Berman et al., 2018). In this research this means clicking on the ‘read more’ block with recommended articles. According to Garcin et al. (2014), Click-Through Rate (CTR) is the industry standard evaluation metric, as it is correlated with the advertising revenues generated on news websites. Several researchers have utilized CTR to evaluate online performance: Liu et al. (2010); Okura et al. (2017); Peska & Vojtas (2013), among others. However, usage of CTR is not undisputed. Multiple authors have criticized the use of CTR, suggesting it is not an appropriate metric after all (see Zheng, Wang, Zhang, Li & Yang, 2010; Garcin et al. 2014). They argue that there might be items in the recommended set that are clicked on that would have been picked nonetheless, due to e.g. popularity. These popular items would have generated a Click-Through anyhow, regardless of the fact that they were recommended. The two studies confirm this effect (Garcin et al., 2014; Zheng et al., 2010). Therefore, looking at CTR alone, is not deemed reliable.

Total article views per user

Apart from CTR, total page views is also considered an appropriate metric (Berman et al., 2018), by some authors even considered as more appropriate (Garcin et al. 2014).

(32)

31

1951). Some authors favor it over other t-tests (Delacre, Lakens, & Leys, 2017). This test requires the mean, standard deviation and sample size of the two conditions. Two hypotheses are formulated to capture the aim of this test:

H0: The mean of article views per user of the baseline model is equal to that of the MF hybrid model.

H1: The mean of article views per user of the MF hybrid model is significantly different from that of the baseline model.

This test will be conducted as a two-tail test, since I am not only interested in discovering whether the new model is able to outperform the old model, but also the other way around. In other words, the mean of the MF hybrid for article views per user is suspected to be different from the mean of the article views per user of the old DvhN model.

3.6.5.2. A/B-testing

The online testing will be conducted by the use of a A/B test. This is an online experiment in which two different versions of a website are tested against each other, while the users are assigned randomly to the these two versions (Berman, Pekelis, Scott & Van den Bulte, 2018). The sample will be split into two groups. Whereas one part of the sample will be attributed to condition A, a benchmark model, the other part of the sample is attributed to condition B, the new MF model. I will only use a subset of 50% of the total amount of traffic, in order to not interfere too much with the website. A short pre-test on 10% of the traffic is performed to ensure that the model works and all data is gathered properly. Then, the differences in performance on the different conditions are evaluated using online evaluation criteria described above in section 3.7.1. The expected change constitutes of the clicking behavior of users and whether they use the ‘read more’ block.

3.6.5.3. Two conditions

For the A/B test, two conditions are formulated. Part of the users will be assigned to condition A, whereas the other part will be assigned to condition B. For the experiment, one condition must be set as a baseline, to which the other condition is compared (Berman et al., 2018).

Condition A

(33)

32

The author of the article has the possibility to select some articles he deems relevant to be put in the ‘read more’ box. If he enters less than six articles, the rest is filled with articles that fall under the same header on the website. These headers are for example: Groningen, Drenthe, Economy. If the author enters no articles at all, the whole box is filled with these header articles.

Condition B

For the second condition, we use the version of the MF model that performed best on the evaluation criteria. This means the model with an alpha of 22, regularization of 144 and a total of 250 factors.

We follow a similar approach to Garcin et al. (2012) to tackle the cold-start problem. For new users, the model has not attained interaction information on which to base the recommendations. Therefore, for these users, a different recommender is initiated that recommends the most popular articles of that dataset. Hereby the online MF model is transformed into a hybrid model, as explained in section 1.1. When a user only has one interaction with one article, this user is also considered a new user, and the popular model is initiated. Further, a fallback is implemented to prevent the user from not receiving any recommendations at al. If the calculations take too long, or the recommender does not provide articles, due the server being down or unrelated reasons, also this popular model is initiated. When either of the two happen, an event is created to track which model is used: the MF or the old model. In that way, the division between MF and the popular model can be explained in the results.

3.6.5.4. Hypotheses for A/B-test

It is common practice to determine the success of A/B-test by the change in conversion rates (Berman et al., 2018). In the case of this research, the conversion consists of CTR. Two hypotheses are formulated in order to assess the performance. The goal of the test is to discover whether the new recommender model is able to outperform the old model currently active in the website of DvhN. In other words, to what extent is the MF hybrid able to outperform the baseline model.

(34)

33 3.6.5.5. Sample size and test duration

It is risky to run the A/B test up until the required significance is met. This form of p-hacking leads to false conclusion about significance and results (Berman et al., 2018). A/B testing relies on statistical principles, which prescribe a minimum sample size (Fung, 2014).

To calculate the required sample size, the power rule is applied, which requires an estimate of the expected change. The expected change is estimated based on three sources: the result of the offline testing of the algorithms, a search in existing literature on RSs tested online, and a comparison to a results of a previous online test at carried out on the website of DvhN.

Garcin et al. (2014) report on different CTR’s for models tested online. The popular models result in an increase in CTR of a 4-7% on different visit time lengths. The random model’s increase in CTR is 7.7% for short visits, 8.9% for medium-length visits, and 9.7% for longer visits. The authors tested a Context Tree model against these baselines, which adapts to current trends and user preferences. For the Context Tree model, Garcin et al. report an increase in CTR of 10.8% for medium-length visits and 13.1% for long visits, respectively. In a previous test the original recommender from the company scored a conversion, or CTR of 13.12%. Whereas the topic-sensitive PageRank, without weighted articles scored 20.03% and with weighted article scored 22.27%.

If we consider 13.12% as the benchmark CTR, the expected CTR of the MF hybrid will therefore lie between 13% and 20%. Comparison of unique user ID’s of different datasets to each other has demonstrated the likelihood of a substantial amount of new users in the dataset every time. This entails that the popular model will be initiated often. Garcin (2014) observes an increase on CTR of 7% maximum for the popular model. Considering the performance of MF on the offline evaluation, I refrain from overestimating the power of the MF model. Hence, I reckon the MF part of the hybrid is able to increase CTR further to a change of around 15%. Hence, I choose an expected change in CTR of 14.9%.

(35)

34

in the meantime. After all, if the H0 is true, it is still possible that at some stage the criterion seems significant, leading to wrong conclusions on significance – (Armitage, McPherson & Rowe, 1969).

With a significance of 5% and a power threshold of 80%, this requires 5,000 article views per condition. Knowing that condition B sometimes uses the popular model, instead of the MF, to be certain, an extra margin is added to the required sample size. Therefore, I estimate the required sample size to be at least 5,500 article views per condition. So 5,500 times the ‘read more’ block has to be loaded on an article page, displaying the recommended set of articles. It is not required for the read more block to be loaded in order to assess the performance of total pageviews per article. Therefore, it is likely that the dataset to evaluate this metric will be of bigger size than the 5,500 specified for evaluating CTR.

Looking at normal traffic over multiple weeks, I expect the test to run for approximately 2 days until the specified number of article views have been subjected to the conditions. Unfortunately, it is not possible to use a stratified sampling method, similar to that of Ter Hoeve et al. (2018) to ensure users displaying different types of reader behavior are equally divided over the two conditions. However, after the test is finished, I can take a look at details of the users that have been subjected to the test to check how the types of platforms used were divided over the conditions.

4. Results

This chapter presents the output of the multiple variations of the algorithm and its parameters. The first section features offline evaluation. Here, the most powerful version of the implicit feedback variable is presented, as well as the optimal values for alpha, lambda and the number of factors. These optimal values will be accompanied by the model scores on the P@K, the MPR score and the corresponding visualization in a MPR graph. The model is compared to a benchmark of a random and a popular model.

The second section features the results of the online evaluation. Here, the results of the online A/B test are interpreted, based on the online evaluation criteria of CTR and total article pages viewed per user.

4.1. Offline evaluation results

4.1.1. Implicit feedback variable

Referenties

GERELATEERDE DOCUMENTEN

Als planten gedurende het hele etmaal even efficiënt met licht om zouden kunnen gaan, zou het verloop van de fotosynthese onder vaste, constante omstandigheden in de meetcuvet een

(Please note once more that an infinite perpendicular domain width would give rise to complete extinction and thus no intensity at this location. ) Here we make the reasonable

Overview of the proposed approach for context-realistic asphalt operation training simulators Paver GPS coordinates TSTs GPS sensor Processing Centre Temperature linescanner

De rentabiliteit van de biologische bedrijven is door de hogere kosten over de jaren 2001-2004 vijf procentpunten lager; dit resulteert in een 12.000 euro lager inkomen

Environmental histories of the empire in a longer, geographically broader view come to the fore again in Richard Grove’s Ecology, Climate and Empire: colonialism and

onwaarschijnlijk. Hij stelt zich de vraag of de restanten niet eerder in verband te brengen zijn met een houten stoel, waarop een elektromotor gedraaid heeft. o

If differences were observed, both the short- and long-distance subshells were subtracted from the experimental data, and values for the coordination distance and

As can be expected, the freedom to shift power from one TX to another, as pro- vided by the total power constraint, gives the largest gains when the TXs see channels with