A user-centric evaluation of the Netflix recommender system: User experience beyond algorithm accuracy

(1)

A user-centric evaluation of the Netflix recommender system

User experience beyond algorithm accuracy

Abstract

This study focuses on the user experience of recommender systems. User experience can be defined as how a user evaluates the interaction with a system or service. Recommender systems incorporate machine learning algorithms in order to provide recommendations from a large set of items. As a result, they are often evaluated by their objective recommendation accuracy. However, research on the topic states that user-centric subjective evaluations should always be considered as well. In this study, the user experience of a real-world recommender is analyzed; the Netflix platform. The goal of this study is characterize Netflix’s view on user experience and to compare it to the perspective of actual Netflix users. This is achieved by analyzing the Netflix recommender system, using existing literature on user experience in recommender systems and by executing a controlled experiment among a selection of Netflix users. The controlled experiment features a task scenario for the participants, followed by a user experience survey. The results from this study provide insight into how the Netflix platform is evaluated by its users, what they believe to be important in their user experience and how this compares to the vision of Netflix.

Robin Schouten 10743294

Supervisor: dr. J.A.C. Sandberg

Second examiner: drs. A.L. van Pappelendam Bachelor thesis Information Science

(2)

1. Introduction

We live in a time where recommender systems are found everywhere, from social media to e-commerce to entertainment platforms. More than ever, people want to have fewer, but more personalized options, in order to save time and make better choices. Recommender systems help people achieve this by providing personalized recommendations from a large catalog of items, which are generated by intricate machine learning algorithms. For some time now, developers of

recommender systems have made an effort to design increasingly more accurate algorithms in order to provide users with recommendations that are even more tailored to their preferences. As a result, it is often assumed that more accurate recommendations directly lead to a better user experience, In order to further explore this assumption, this study focuses on the user experience of a real-world recommender system: Netflix.

Within the research topic of the user experience of recommender systems there has been a

paradigm shift from mainly considering algorithm accuracy to acknowledging that user experience is a more complex concept. Users want more than just a precise representation of their preferences in their recommendations (Ricci, Rokach, & Shapira, 2015). Consequently, evaluating user experience is a challenging effort that requires a system incorporating all its relevant aspects (Konstan & Riedl, 2012).

Several researchers have made an effort to explore the dynamics of user experience in recommender systems. Knijnenburg, Willemsen, & Kobsa (2011) have created a pragmatic framework for the evaluation of recommender systems that goes beyond algorithm accuracy and puts an emphasis on the users’ self-reported user experience. A similar framework has been designed by Pu, Chen, & Hu (2011), which also considers user experience in recommender systems as multidimensional, but lacks the inclusion of personal and situational characteristics in comparison to the aforementioned

framework. Both frameworks provide methodologies for evaluating recommender systems from the perspective of the user.

The academic relevance of this study is that it applies a conceptual recommender system evaluation framework to a real-world recommender system (Netflix). The practical relevance of this study is that it may provide valuable findings for Netflix as a company by evaluating the user experience of its recommender system in a user-centric manner, going beyond algorithm accuracy.

The main goal is to characterize Netflix’s view on user experience and comparing it to the experience of actual users of the platform. This is done by analyzing the Netflix recommender system using available literature and evaluating its user experience through a controlled experiment.

For this study, the following research questions have been formulated: - How can Netflix’s view on user experience be characterized? - How do Netflix users evaluate the Netflix recommender system?

- How does Netflix’s view on user experience compare to the experience of Netflix users? The first question is answered in the theoretical framework. The second question is answered partly in the theoretical framework and partly through the results of a controlled experiment. After analyzing the results of the experiment, the final question can be answered.

(4)

2. Theoretical framework

2.1 User experience in recommender systems

A recommender system is a system designed to provide personalized recommendations to a user based on machine learning algorithms. In short, higher algorithm accuracy leads to more accurate recommendations. The purpose of a recommender is that it helps a user make better, faster and more relevant choices from a large set of items, such as a web shop product catalog or a movie database. Without a recommender, all users would be presented with the same catalog items. Most of the time the recommendations feature some form of personalization. This is often based on a user’s previous interaction with the platform and the choices (s)he has made in the past (Ricci et al., 2015). The moment a person starts using a platform with a recommender, a personal profile is built up, which keeps track of interaction and choice behavior. At the creation of the profile, there

obviously is no past data available to base recommendations on. This is called the cold start problem and is often tried to mitigate by generalizing recommendations at first and steadily introducing increased personalization as soon as more user data becomes available (Chang, Harper, & Terveen, 2015).

In order to provide users with relevant, personalized recommendations, their preferences thus first have to be characterized. Platforms achieve this by implementing preference elicitation, which is a method of evoking feedback data from users. This can be done in two ways; implicitly and explicitly (Rashid, Karypis, & Riedl, 2008). Implicit feedback means that the system gathers behavioral user data, such as viewing history and clicking patterns, to estimate their preferences. Explicit feedback relies on users to express their preferences through interaction with a feedback system by leaving reviews or ratings on the content of the recommender. The recommender system continually learns from feedback data and can make increasingly more accurate predictions because of it.

There are two main types of recommender systems: collaborative filtering recommenders and content-based recommenders (Ricci, et al., 2015). A combination of the two is called a hybrid

recommender. The collaborative filtering recommender provides recommendations based on similar users’ interaction with catalog items through implicit and/or explicit feedback. An example is that of e-commerce company Amazon, which suggests items based on the purchase history and/or reviews of similar customers. In contrast, a content-based recommender bases its recommendations not on the behavior of similar users, but rather on the similarity between the characteristics of the items in the content catalog. An example of this is the Internet Movie Database (IMDb), which recommends movies and series based on criteria such as overlapping cast members or having the same genre. User experience (UX) has been a difficult concept to define by researchers, which has resulted in diverse definitions being available. An early, general approach to defining user experience (UX) is that of Hassenzahl (2008): “[UX is] a momentary, primarily evaluative feeling (good-bad) while interacting with a product or service.” Knijnenburg, Willemsen, Gantner, Soncu, & Newell (2012) define user experience in recommender systems as “the users’ evaluation of their interaction with the system”. The definition of Konstan & Riedl (2012) is as follows: “the delivery of the

recommendations to the user and the interaction of the user with those recommendations” and focuses more on the recommendations provided by a recommender system. For this study, the definition of Knijnenburg et al. is used as it emphasizes the perspective of the user and how (s)he experiences the recommender system.

(5)

2.2 A user-centric framework for the evaluation of recommender systems

Knijnenburg et al. (2012) state that UX should inherently be considered from the perspective of the user. Additionally, they stress that UX does not exist in a vacuum and that there are several more aspects that influence it. As a result, they have suggested a comprehensive user-centric framework for the evaluation of recommender systems (figure 1) (Knijnenburg, Willemsen, & Kobsa, 2011), in which all aspects of the framework (in)directly affect the UX and all other aspects (as shown by the arrows in figure 1).

Figure 1: A user-centric framework for the evaluation of recommender systems by Knijnenburg et al. (2011)

The main aspects of the framework are depicted as the six large (colored) boxes (figure 1). Each aspect contains one or several relevant elements that can be associated with a recommender system (shown as the square white boxes in figure 1). The aspects and their corresponding elements are described below.

The objective system aspects (OSA) are the technical specifications of the system, such as the recommender system algorithms, the composition of the content catalog and the features of the user interface. These can be identified as the rounded square boxes in figure 1. An example an OSA is the objective recommendation accuracy that is a result of the algorithm accuracy.

The subjective system aspects (SSA) are the user’s interpretation of the OSA. These provide an extra layer between the system itself and the user experience, providing information about how the users perceive the technical aspects of the system. Perceived recommendation set variety and perceived recommendation quality are both SSA and describe how the user experiences the variety and quality of the recommendations (OSA) provided for them. Perceived recommendation quality includes how the user subjectively evaluates the objective recommendation accuracy and to what extent they appreciate the recommendations provided by the system.

The user experience (EXP) is the actual experience a user has through interaction (INT) with the system (i.e. clicking, making choices, etc.). Knijnenburg et al. (2012) distinguish three subcategories within EXP: system-EXP, process-EXP and outcome-EXP. System-EXP shows how the user views the system as a whole in terms of effectiveness and hedonic value. Process-EXP is how the user

experiences interaction with the system. Outcome-EXP measures how the user feels about the result of the interaction with the system. Four elements are discerned within EXP: perceived system effectiveness and fun (system-EXP), usage effort and choice difficulty (process-EXP), and satisfaction

(6)

The personal characteristics are all characteristics that describe the user as an individual. The elements within the PC are domain knowledge, gender and trust in technology. Domain knowledge refers to the level of expertise the user has regarding the topic and content of the recommender system. The situational characteristics concern the context of the user’s interaction with the system. Included here are the system-specific privacy concerns.

The framework provides statements that can be used as questionnaire items to measure each of the elements within the aspects of the framework. These are shown in italic in figure 1.

2.3 Netflix’s view on user experience 2.3.1 Netflix’s focus on algorithm accuracy

In the paper of Gomez-Uribe & Hunt (2015) it becomes clear that Netflix puts its recommender system at the center of its business model. It is stated that the company believes that improving algorithm accuracy, and therefore, recommendation accuracy, is key to expanding its business as it keeps users engaged with the platform. It is also claimed that more accurate recommendations improve the user experience by making the platform more compelling to its users. As a result Netflix puts a lot of effort into testing different versions of algorithms and A/B testing which performs better. This is done by comparing implicit behavioral data, such as item selection behavior, time spent on the platform and the continuation of subscription to the service. Gomez-Uribe & Hunt also claim that users generally fail to accurately identify how the UX could be improved for them. As an argument for using this approach, they state that users cannot differentiate between good and better recommendations and that only the users’ behavioral data is an accurate indicator of what works for them in the UX of Netflix.

An early example of Netflix’s emphasis on algorithm accuracy is that of the Netflix prize contest that took place in 2006, organized by the company itself. A dataset was provided to the contestants with over 100 million anonymous movie ratings – the largest recommendation dataset available at that time. The challenge was to create a recommender that scored higher in terms of algorithm accuracy than the company’s own recommender “Cinematch”. The winning team, if any, would be awarded $1,000,000 (Bennett & Lanning, 2007). After three years, in 2009, a team finally succeeded and took the prize money. However, the team’s found accuracy improvements were never adopted by the company due to high implementation costs and a shift of focus regarding its own recommender algorithms (Amatriain & Basilico, 2018). Naturally, the Netflix prize attracted a vast amount of attention to the topic of collaborative filtering algorithms during this time period, not in the least in terms of research effort and the growing number of skilled experts in that field (Bell, Koren, & Volinsky, 2010).

2.3.2 The Netflix recommender system

The Netflix platform features a hybrid recommender system, meaning that it incorporates both collaborative and content-based filtering algorithms. In practice this means that both the behavior of similar users as well as the similarity between the characteristics of the content catalog items are considered when generating recommendations.

The Netflix recommender system consists of a set of six algorithms, each specifically made for generating certain types of recommendations (Gomez-Uribe & Hunt, 2015). They incorporate personalization, popularity and temporality (i.e. how time affects the content consumption) in order to present the user with a selection of items from the content catalog. These recommendations are

(7)

shown as titles (or items) within their corresponding rows in the Netflix user interface. Some of the algorithm are rankers (choosing a specific order to show the recommendations in), while others serve a different purpose. The output of each the different algorithms is described shortly below. The technical details of the algorithms are beyond the scope of this study and therefore are not included. The personalized video ranker (PVR) is responsible for making personalized recommendations by selecting subsets of content to show to the user. Often these are based on a certain genre or topic, such as “Horror movies” or “Award-winning series”. The PVR also generates the “Popular” subset (as shown in figure 2) by incorporating the general popularity of content on the platform.

Figure 2: An example of recommendations made by the PVR.

The Top-N video ranker uses a more focused approach than the PVR by selecting highly personalized singular items and showing them to the user as a selection of the best picks for them (figure 3).

Figure 3: An example of recommendations made by the Top-N ranker.

The trending ranker combines temporal popularity with personalization to create a set of currently trending items as recommendations for the user (figure 4).

Figure 4: An example of recommendations made by the trending ranker.

The continue watching ranker shows a ranked selection of items that the user has partially watched. The algorithm tries to assess which items the user might want to continue watching and which not. Temporality plays an important part here. An example can be found in figure 5.

Figure 5: An example of recommendations made by the continue watching ranker.

The video-video similarity algorithm suggests a set of items from the content catalog that are similar to an item that a user has previously watched. This happens in an non-personalized manner by considering only item-based similarity. An example of this algorithm in the Netflix user interface can be found in figure 6.

(8)

Figure 6: An example of recommendations made by the video-video similarity algorithm The page generation algorithm is used to build up the home feed of Netflix, where most of the interaction with the platform takes place. The outcome of all beforementioned algorithms is shown in the form of rows in the user interface. These rows appear in a particular order based on

personalization, popularity and temporality. Additionally, this algorithm selects an item to feature at the top of the Netflix home feed (figure 7).

Figure 7: A featured catalog item chosen by the page generation algorithm.

Together these algorithms form the Netflix recommender system. There are several other algorithms included in the platform, but those serve only 20% of the total interaction of users, while the Netflix recommender serves 80% (Gomez-Uribe & Hunt, 2015). Through analyzing the different algorithms of the Netflix recommender system and their business model, it becomes apparent that most of the Netflix user interface is made up of recommendations. It seems that the company has put great effort into making its platform free of other distractions.

2.4 Hypotheses/predictions

As stated in 2.2, UX (EXP in the framework) is fundamentally user-centric and should be evaluated as such. Furthermore, UX/EXP is not a stand-alone concept, but rather comes to be as a result of influences from several other aspects that either directly or indirectly relate to it. The link between OSA and UX/EXP, for example recommendation accuracy (OSA) and satisfaction with the chosen item(s) (outcome-EXP), should not be regarded as a direct one, as it first should be established how the user perceives the OSA. This can be done by using a SSA as a mediator between OSA and EXP; in this case perceived recommendation quality (SSA).

Netflix has shown to have a particular interest in algorithm accuracy, illustrated by the Netflix prize competition and the fact that the company centers its business model around its state-of-the-art recommender system, which is constantly in development. This means that Netflix focuses on

(9)

recommendation accuracy – which is an OSA – and measure this through implicit behavioral user data and continuous A/B testing. Consequently, the company does not incorporate user-centric evaluations of the UX of its recommender system.

The approach of Netflix in regards to UX is not in line with the framework of Knijnenburg et al. (2012). As a result, two hypothesis/predictions may be constructed for the second research question:

1. High (or low) recommendation accuracy does not necessarily lead to a good (or bad) evaluation of the recommendations by Netflix users.

2. Netflix users find different aspects than recommendation accuracy to be important in the UX of the Netflix recommender system.

(10)

3. Methodology

3.1 Participants

A total of 19 participants were interviewed during a nine day period, of which 6 were male and 13 were female. They were approached either through a messaging application or in real-life, after which a face-to-face interview at a location of their preference was planned. The age of the participants ranges from 21 to 29 years old. All of them had either previously completed or were currently enrolled in a form of higher education (i.e. HBO/university). Only people were selected that had a personal Netflix profile, watched mainly individually and used Netflix at least one hour per week.

3.2 Task scenario

A task scenario was created for this study consisting of a set of subtasks that the participants were asked to complete within the Netflix platform. The task scenario formed the base of the controlled experiment and helped activating the Netflix user experience for the participants, so that they were engaged with the platform before completing the survey discussed in 3.3. Each subtask featured interaction with a part of the user interface in order to nudge the participants into engagement with the platform and its recommendation features. The task scenario was created based on the Netflix recommender system algorithms and their output in the user interface as described in 2.3.2. When the participants were asked to “pick an item”, they were expected to consider the

recommendations in a specific part of the user interface and subsequently select an item that they might enjoy watching, but had not watched before. They were asked to voice their chosen items after which they were written down on a piece of paper; creating an overview for them to reflect on during the completion of the survey. This was done to assist the participant, as some questions concerned the actual choices made during the task scenario and without an overview, the participants would have been less likely to remember their chosen items and decision process. Each of the six algorithms of the Netflix recommender system had one or two corresponding subtasks in the task scenario. The PVR had two subtasks that expected the participants to pick an item from a row with a genre/topic that appealed to them and from the popular row. The Top-N video ranker had one subtask that asked the participants to pick an item from the “Top picks for [user]” row. The Trending algorithm had one subtask, which asked the participants to pick an item from the “Trending now” row. The Continue watching ranker featured one subtask in which the participants were asked to pick an item from the “Continue watching for [user]” row. For the Video-video similarity algorithm one subtask was created that expected the participants to pick an item from a “Because you watched [item]” row. Lastly, for the page generation algorithm two subtasks were included, in which the participants did not have to pick an item, but were asked to spend some time looking at the item featured at the top of the home feed and to skim through the entire home feed, while looking at the different rows and recommendations that were generated for them. It is important to note that due to the dynamic nature of the Netflix home feed (caused by the page generation algorithm), some subtasks were not be possible to fulfill as a result of certain rows not being present at that moment in time. Consequently, those subtasks were skipped.

(11)

Algorithm Subtask(s) Personalized video ranker

(PVR)

- Look for a row with a genre/topic that you might like and pick an item.

- Pick an item from the “Popular” row. Top-N video ranker - Pick an item from the “Top picks for [user]” Trending - Pick an item from the “Trending now” row. Continue watching ranker - Pick an item from the “Continue watching” row.

Video-video similarity - Look for a “Because you watched [item]” row and pick an item.

Page generation

- Take some time to look at the featured item at the top of the home feed.

- Take some time to scroll through the home feed and look at the different rows, row titles and the corresponding items for each row.

Table 1: The task scenario 3.3 User experience survey

The framework as discussed in 2.2 was used to evaluate the Netflix recommender system from the perspective of the user. To achieve this, a user experience survey was created. The survey consisted of three parts.

The first part of the user experience survey consisted of the twelve statements provided by the framework that measure each framework element. An overview of these statements with their corresponding framework elements and aspects can be found in table 2.

(12)

Statement(s) Element Aspect The recommendations contained a lot of

variety. Perceived recommendation set variety

SSA I like the items recommended by the system.

The recommended items fitted my preference.

Perceived recommendation quality

I would recommend the system to others. Perceived system effectiveness and

fun

system-EXP The system is convenient.

I have to invest a lot of effort in the system. Usage effort

process-EXP

Making a choice was an overwhelming task. Choice difficulty

I like the items I've chosen. Satisfaction with the chosen items

outcome-EXP Technology never works.

I'm less confident when I use technology. Trust in technology PC

I'm afraid the system discloses private

information about me. System-specific privacy concerns SC

I like to give feedback on items. Intention to provide feedback INT

Table 2: The statements provided by the framework of Knijnenburg et al. (2011) for the first part of the user experience survey.

The participants were asked to state on a likert-scale of 1 to 7 to what extent they agreed with each of the statements, with 1 being “completely disagree” and 7 being “completely agree”. The results from this part of the survey provide data about the participants (PC, SC and INT) and how they evaluated the Netflix recommender system and its user experience (SSA and EXP).

The second part of the survey was where the participants were expected to do two things. Firstly, they answered how important they considered each framework element to be for them in the Netflix recommender system UX. They did so on a likert-scale from 1 to 7, with 1 being “not important at all” and 7 being “very important”. The framework elements were rephrased in a way so that they were easier to understand for the participants (table 3). Only elements that concern the system and its UX were included (SSA, EXP).

Element Rephrased

Perceived recommendation variety (SSA) The variety of the recommendations Perceived recommendation quality (SSA) The quality of the recommendations Perceived system effectiveness and fun

(system-EXP)

The effectiveness and fun of the interaction with the platform

Usage effort (process-EXP) The effort required to use the platform Choice difficulty (process-EXP) The difficulty of choosing an item Satisfaction with the chosen items

(outcome-EXP) The satisfaction with the chosen items

(13)

Secondly, the participants were asked to rank the six elements from first to sixth place based on their importance to them in the Netflix recommender system UX. Ranking the elements forced the

participant to place one above or below the other.

The third and final part of the survey, consisted of several demographics questions. The questions about age and domain knowledge were used to get an understanding of the personal characteristics of the participants in this study. Also included were two questions about their usage behavior of the platform, which can be classified as situational characteristics within the framework.

The complete user experience survey can be found in the appendix. 3.4 Procedure

The interviews were conducted in a one-on-one and face-to-face manner at a location of the

preference of the participant. Each participant partook in a controlled experiment on a laptop within the Netflix desktop application. To ensure a realistic and personalized experience, the participant was asked to login to Netflix with his/her own credentials. The experiment consisted of a short

introduction, the task scenario and the online user experience survey.

First, the participant was asked to take place in front of the laptop. Subsequently, the introduction was given to him/her, in which the procedure of the experiment was explained. After this, (s)he was asked to log in to Netflix and select his/her personal profile. After (s)he was presented with the Netflix home feed, the task scenario was read to him/her one subtask at a time, each of which (s)he had to complete. The task scenario was finished after the completion of the last subtask. At this point, the participant was asked to complete the user experience survey through a Google form on the laptop. A hand-written overview of the items that (s)he had chosen during the task scenario was placed next to the laptop. This helped the participant reflect on his/her decisions when completing the survey.

With the completion of the online survey, the experiment was concluded. After being thanked for participating, the participant was free to leave.

(14)

4. Results

The results from the user experience survey are discussed below.

Answer scale: 1 (completely disagree) – 7 (completely agree)

Aspect Statement 1 2 3 4 5 6 7

SSA

The recommendations contain a lot

of variety. 0 1 2 3 7 4 2

I like the items recommended by the

system. 0 1 5 3 7 3 0

The recommended items fitted my

preference. 0 1 1 7 7 2 1

system-EXP I would recommend the system to

others. 0 0 3 4 4 6 2

process-EXP

The system is convenient. 0 1 3 3 3 8 1

I have to invest a lot of effort in the

system. 2 6 3 4 2 1 1

Making a choice was an

overwhelming task. 1 4 2 1 4 5 2

outcome-EXP I like the items I've chosen. 0 0 0 2 1 10 6

PC

Technology never works. 7 11 0 1 0 0 0

I'm less confident when I use

technology. 8 8 1 1 1 0 0

SC I'm afraid the system discloses

private information about me. 4 5 4 1 3 2 0

INT I like to give feedback on items. 2 5 3 0 6 1 2

Table 4: Results from the first part of the user experience survey (statements) (frequencies, N=19). The results from the first part of the survey can be found in table 4.

The statement that the participants agreed the most with is I like the items I’ve chosen. A total of 17 participants answered agreeingly of which 16 assigned a score of either 6 or 7. This statement corresponds to the outcome-EXP aspect and the satisfaction with the chosen items element in the framework. In practice this means that the participants stated to be highly satisfied with their chosen items.

Looking at the statements concerning the perceived recommendation quality (SSA), I like the items recommended by the system and the recommended items fitted my preference, the participants answered slightly skewed to the right. For each of the statements, 10 participants answered agreeingly (5-6-7 on the answer scale). This shows that they experienced the recommendation quality to be higher than neutral, but not overwhelmingly so. A difference between the two statements is that for the first 6 participants answered disagreeingly (1-2-3 on the answer scale),

(15)

while only 2 did so for the second. This indicates that the participants acknowledged that the items fitted their preferences, but that they did not always appreciate them on an equal level.

Answer scale: 1 (not important at all) – 7 (very important)

Aspect Element 1 2 3 4 5 6 7

SSA

The variety of the recommendations. 0 2 1 2 4 7 3 The quality of the recommendations. 0 0 0 0 2 8 9 system-EXP The effectiveness and fun of the

interaction with the platform. 0 2 3 2 5 5 2

process-EXP

The effort required to use the

platform. 0 0 0 3 5 6 5

The difficulty of choosing an item. 0 0 1 2 3 9 4 outcome-EXP The satisfaction with the chosen

items. 0 0 0 0 0 8 11

Table 5: Results from the second part of the user experience survey (importance of elements, grading) (frequencies, N=19).

The results from grading the importance of the elements during the second part of the survey can be found in table 5.

The results show that two elements stand out compared to the others in terms of importance to the participants. The first being the satisfaction with the chosen items (outcome-EXP), which all 19 participants rated with either a 6 or 7 on the answer scale. The second being the quality of the recommendations (SSA), which was given a score of 5-6-7 by all 19 participants and a score of either 6 or 7 by 17 participants. While all statements were generally found to be important, these two scored the highest, showing that the participants cared the most about their contentment with the chosen items and the quality of the recommendations, respectively.

(16)

Answer scale: ranking; 1 (most important) – 6 (least important)

Aspect Element 1 2 3 4 5 6

SSA

The variety of the recommendations. 2 0 3 7 4 3 The quality of the recommendations. 4 6 2 1 3 3 system-EXP The effectiveness and fun of the

interaction with the platform. 2 2 3 1 6 5

process-EXP

The effort required to use the

platform. 1 2 5 6 3 2

The difficulty of choosing an item. 3 4 5 2 3 2 outcome-EXP The satisfaction with the chosen

items. 7 5 1 2 0 4

Table 6: Results from the second part of the user experience survey (importance of elements, ranking) (frequencies, N=19).

The results from ranking the importance of the elements during the second part of the survey can be found in table 6.

Analyzing the results from the ranking, it becomes apparent that the satisfaction with the chosen items (outcome-EXP) was ranked in the first place the most times as 7 participants did so, while 5 participants ranked it in second place. This adds up to a total of 12 participants who ranked it in either first or second place. This shows that the participants believed this element to be the most important. The quality of the recommendations was ranked in second place the most as 6

participants did so, while 4 participants ranked it in first place. This adds up to a total of 10

participants who ranked it in either first or second place. As a result, this element was found to be the second most important by the participants. These results are in line with those of the grading.

(17)

Age 21 22 23 24 25 29

Freq. 2 2 7 5 2 1

Gender Male Female

Freq. 6 13

Usage frequency Daily Once every few days Once a week

Freq. 9 10 0

Usage/week 1 – 3 hours 3 – 5 hours 5 – 7 hours 7 – 9 hours >9 hours

Freq. 1 7 5 4 2

Answer scale: 1 (much less knowledgeable) – 7 (much more knowledgeable)

Question 1 2 3 4 5 6 7

How knowledgeable are you in terms of video entertainment productions (series, movies, documentaries, etc.) compared to your peers?

0 2 3 5 4 4 1

Table 7: Results from the third part of the user experience survey (demographics) (frequencies, N=19). The results from the demographics can be found in table 7.

Looking at the demographics of the participants, two distinct groups can be identified; those who use Netflix on a daily basis (9 participants) and those who use the platform once every few days (10 participants). No participants indicated that they used the platform once a week. For the most part, the results from the two subgroups are the same as the combined results. However, two differences can be found, which are discussed below. For convenience, the daily group is referred to as ‘group A’ and the once every few days group as ‘group B’.

Answer scale: 1 (not important at all) – 7 (very important) I like the items recommended by the system.

Daily (N=9) Once every few days (N=10)

1 2 3 4 5 6 7 1 2 3 4 5 6 7

0 1 1 1 3 3 0 0 0 4 2 4 0 0

Table 8: Grouped results for the statement “I like the items recommended by the system” (frequencies, N=19).

The first difference is that of group A 66,7% (6/9) agreed with the statement I like the items recommended by the system (5: 33,3% (3/9) and 6: 33,3% (3/9)), while this was only 40% (4/10) for group B (5: 40% (4/10)) (table 8). This shows that the participants who used Netflix daily liked the recommendations more than those who used the platform once every few days.

(18)

Answer scale: ranking; 1 (most important) – 6 (least important) The satisfaction with the chosen items.

Daily (N=9) Once every few days (N=10)

1 2 3 4 5 6 1 2 3 4 5 6

5 3 0 1 0 0 2 2 1 1 0 4

Table 9: Grouped results for the ranking of the element “The satisfaction with the chosen items” (frequencies, N=19).

The second difference is that 55,6% (5/9) of group A ranked the importance of the element The satisfaction with the chosen items in first place, while this was only the case for 20% (2/10) of group B (table 9). Moreover, 88,9% (8/9) of group A ranked the element in either first or second place, while only 40% (4/10) of group B did so. In fact, 40% (4/10) of group B placed the element in last place, which none of the participants of group A did. This indicates that the participants who used Netflix on a daily basis found the satisfaction with the chosen items to be more important than those who did so once every few days.

(19)

5. Conclusion and discussion

Through analyzing the Netflix recommender system and relevant literature, it can be concluded that Netflix’s view on UX has a substantial focus on algorithmic performance. Research has explained that the perspective of the user is crucial in the evaluation of the UX of recommender systems and that UX is a multidimensional concept that goes beyond just recommendation accuracy. However, Netflix does not incorporate subjective user evaluations and focuses mainly on objective recommendation accuracy as a measure of performance.

The Netflix users in this study have indicated that they were highly satisfied with the items they had chosen from the recommendations provided for them by the platform. However, the perceived quality of the recommendations was evaluated just slightly positively. In other words, the users have indicated that the accuracy of the recommendations did not translate to an equal evaluation of the quality of the recommendations (table 4), which is in line with hypothesis 1.

Furthermore, the results showed (table 5) that the participants graded the satisfaction with the chosen items slightly higher than the actual quality of the recommendations in terms of importance to them. When they were forced to rank both the satisfaction with the chosen items and the quality of the recommendations (table 6), the satisfaction also was regarded as the most important, but this time a larger difference between the two was found. Apparently, the participants found the

satisfaction with the chosen items to be a more important element of UX than the quality of the recommendations, which is in line with hypothesis 2.

An explanation for both of these findings could be that recommendations are always a set of items, while the user might only enjoy a few singular items. As a result, the larger set of recommendations is evaluated to be of lower quality and less importance than the smaller subset of the actually chosen items.

Interestingly, the participants who use the platform daily have evaluated the recommendations to be of higher quality than those who do so once every few days. Additionally, the daily users found the satisfaction with the chosen items to be more important than the less frequent users. This could be explained by the daily users having had a more developed user profile on the platform, which resulted in more fitting recommendations being provided for them. Another explanation could be that the more frequent users were more experienced with the platform, which caused them to be more efficient in their interaction with the recommender system.

Comparing Netflix’s view on UX to that of Netflix users, the most polarizing difference is that Netflix does not consider user evaluations, meaning that there is currently no way for users to express their wishes and needs for the UX of the platform. Because the company relies solely on analyzing

behavioral data, it has no insight into the context concerning a user’s interaction. Illustrating the need for this is that the Netflix users in this study stated that there is a discrepancy between the evaluation of the quality of the recommendations and the actual satisfaction with the items they ended up choosing. Moreover, they stated that they did not find recommendation accuracy to be the most important element of UX, while this is exactly what the company focuses on.

A proposal for Netflix and future research is to create a system dedicated to eliciting user

(20)

6. Limitations and further research

6.1 Limitations

Scheduling the face-to-face meetings with participants was a time consuming effort. However, it was necessary in order to ensure that each participant completed the task scenario as instructed. Due to limitations in the time available for the execution of the methodology, fewer participants have been interviewed than was initially aimed for. Due to the small size and homogeneity of the sample group, it is difficult to accurately predict how the results of this study can be generalized among all Netflix users. In the future, more data should be collected to make better predictions. Another result of the small sample size is that no statistics have been applied to the data. Therefore, the findings of this study lack statistical support. Moreover, additional correlations and meaning could possibly be found in the data when doing so.

The participants in this study were asked to base their answers to the user experience survey on the task scenario they had previously completed during the controlled experiment. However, the task scenario is not a realistic representation of the real-world usage behavior of Netflix users. For example, the actual behavior of users of the platform is unlikely to include selecting multiple items from different rows in the user interface during the same session. Moreover, many different task scenarios could be created, each with possibly different results, while this study only incorporated one.

In this study, only one specific recommender system was evaluated, for which a special methodology was created. Therefore, it should be mentioned that the findings may not be representative of other recommender systems. These would require their own evaluations and appropriate methodologies. 6.2 Further research

This study has found that there is a difference between what users want and what developers of recommender systems measure in terms of UX. Additionally, user evaluations could provide valuable context to the users’ behavioral data. However, companies such as Netflix that do not incorporate user evaluations still thrive. Further investigation is required to better understand in what way user evaluations should be implemented in order for them to be valuable in the evaluation of UX in real-world recommenders.

Evaluating a recommender system that is operating in an industry is a complicated effort. For example, the Netflix platform is not a static experience, but rather a dynamic one, due to the page generation algorithm and the fact that the company is constantly testing different versions of algorithms and user interfaces. As a result, the OSA varied between participants in this study. Of course, this is to be expected from a real-world recommender. Still, this is something that future research should consider.

Even though this study has addressed the topic of implicit and explicit feedback, no analysis of the feedback system of the Netflix platform was made. One of the reasons for this is the company’s intransparency how (if at all) and to what extent they incorporate the data from their feedback system in their recommender algorithms. Further research on the dynamics of the feedback system should be done as this could prove to be an important factor in the UX of the platform.

(21)

References

Amatriain, X. & Basilico, J. (2018). Netflix Recommendations: Beyond the 5 stars (Part 1). Retrieved June 11, 2019, from https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429

Bell, R. M., Koren, Y. & Volinsky, C. (2010). All Together Now: A Perspective on the Netflix Prize. CHANCE 23(1), 24-29.

Bennett, J. & Lanning, S. (2007). The Netflix prize. In Proceedings of KDD cup and workshop, 2007, 35. Chang, S., Harper, F. M. & Terveen, L. (2015). Using Groups of Items to Bootstrap New Users in Recommender Systems. Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing - CSCW ’15, 2015, 1258-1269.

Gomez-Uribe, C. A. & Hunt, N. (2015). The Netflix Recommender System. ACM Transactions on Management Information Systems 6(4), 1-19.

Hassenzahl, M. (2008). User experience (UX): towards an experiential perspective on product quality. IHM 8, 11-15.

Knijnenburg, B. P., Willemsen, M. C., Gantner, Z., Soncu, H. & Newell, C. (2012). Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction 22(4-5), 441-504. Knijnenburg, B. P., Willemsen, M. C. & Kobsa, A. (2011). A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. In Proceedings of the fifth ACM conference on

Recommender systems, 2011, 321-324.

Konstan, J. A. & Riedl, J. (2012). Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, 22(1-2), 101-123.

Pu, P., Chen, L. & Hu, R. (2011). A User-Centric Evaluation Framework for Recommender Systems. Proceedings of the fifth ACM conference on Recommender systems, 2011, 157-164.

Rashid, A. M., Karypis, G. & Riedl, J. (2008). Learning preferences of new users in recommender systems: an information theoretic approach. ACM SIGKDD Explorations Newsletter 10(2), 90-100. Ricci, F., Rokach, L. & Shapira, B. (2015). Recommender Systems Handbook (second edition). New York, NY: Springer.

(22)

(23)

(24)

(25)

(26)

A user-centric evaluation of the Netflix recommender system: User experience beyond algorithm accuracy