High-Performance Recommender System for offering relevant Next-Best-Actions

(1)

for offering relevant Next-Best-Actions

Computational Science Master Thesis

Submitted to the examination committee in September, 2016. In partial fulfillment of the requirements for the degree of

MSc in Computational Science. Author: Christos Petropoulos

Supervisors: Committee Members:

Dr. Drona Kandhai Dr. Drona Kandhai

MSc. Aziz Mohammadi Dr. Alfons Hoekstra

MSc. Aziz Mohammadi MSc. Ioannis Anagnostou

University of Amsterdam

Faculty of Science, Graduate School of Informatics Master Computational Science

(2)

(3)

Recommender Systems in e-commerce and advertising have been proved to be very use-ful and successuse-ful in providing relevant content to the customers. Several techniques for achieving that have been purposed, with Collaborative Filtering being the most favourable of all. In this thesis project, we present two scalable machine learning frame-works of Collaborating Filtering, which are designed and implemented for delivering relevant Next-Best-Actions (NBAs) to the customers of ING Bank N.V. Specifically, we compare a User-based approach to a Latent-factor approach. We evaluate them both in terms of accuracy and scalability. We experiment under different settings (of pa-rameters and modeling techniques) and show that the User-based approach outperforms the predictive performance of the Latent-factor approach. The resulting system uses a significant amount of anonymous customer data and a cluster of computer nodes, in order to fulfil the requirement of daily NBA selection. The implementation is based on Apache Spark, which is one of the latest Map-Reduce frameworks that is introduced for large-scale data processing and machine learning tasks. With its use on the cluster, ING manages to achieve an equivalent solution (in terms of predictive and computing perfor-mance), similar to that of the original system which runs on a powerful Netezza machine. The Netezza machine is designed by IBM and is used for multiple data processing tasks at ING, including the NBA selection. Our solution replaces that process, which in turn leads to an increasing availability of resources on Netezza.

(4)

(5)

1 Introduction 1

1.1 The Next-Best-Action Method . . . 2

1.1.1 Definition . . . 2 1.1.2 Process . . . 2 1.1.3 Strategy . . . 2 1.2 Recommender Systems . . . 3 1.3 Models . . . 3 1.4 Cluster Computing . . . 4 1.5 Use Cases . . . 5

1.6 Motivation & Goal . . . 6

2 Literature Study 7 2.1 Recommender Systems . . . 7

2.1.1 Methods . . . 7

2.1.2 Evaluation . . . 17

2.1.3 Applications on Personalized Advertising . . . 22

2.2 Cold-Start Problem . . . 23 2.2.1 Introduction . . . 23 2.2.2 Contextual Bandits . . . 24 2.3 Feature Selection . . . 30 2.3.1 Algorithms . . . 30 2.3.2 Criteria . . . 35 2.3.3 Methods . . . 36

3 Data Availability and Features 40 3.1 Data Availability . . . 40

3.2 Feature Engineering . . . 43

3.2.1 Categorical Features . . . 44

3.2.2 One-Hot-Encoding . . . 44

3.2.3 Normality & Scaling Transformations . . . 44

(6)

3.3.1 Chi-Squared Test of Independence . . . 45 3.3.2 CART Algorithm . . . 45 3.3.3 Results . . . 46 3.4 Privacy Preservation . . . 47 4 Implementation 48 4.1 Scalability . . . 48 4.2 Map-Reduce . . . 49 4.3 Spark . . . 50 4.4 System Overview . . . 51

5 Experimental Setup & Results 52 5.1 User-based Model . . . 52

5.1.1 Varying the Feature Space . . . 54

5.1.2 Stability of Modeling Techniques . . . 57

5.1.3 Parameters Tuning . . . 58

5.1.4 Effect of Sample Size . . . 59

5.1.5 Data Freshness . . . 61

5.1.6 Computing Performance . . . 62

5.2 Latent-Factor Model . . . 64

5.2.1 Parameters Tuning . . . 64

5.2.2 Data Freshness & Sub-sampling . . . 65

5.3 Comparison . . . 67

Conclusions & Future Work 69

Acknowledgements 69

(7)

Introduction

The recent, sudden, and sharp increase of data availability (also known as big data) has motivated companies and organizations to build large-scale systems capable of processing terabytes of data, with the purpose of personalizing their services to the needs of their customers. Examples of that trend include many major tech companies of the industry, such as Google, Facebook, Microsoft and many more. In general, Recommender Systems (RSs) in e-commerce and advertising have been proved to be very useful and successful for providing relevant content to the customers [1].

In this thesis project, we study the personalisation of online advertising and specifically, the design and implementation of a recommender system which lists the k most relevant offers to the customer. Each of those recommended offers can also be referred to as a Next-Best-Action (NBA).

To develop the strategy by which the top NBAs are selected, we adopt some techniques of Machine Learning. Other than providing the actual lists of NBAs per customer, the main objective of the system and of this study, is to find a strategy which maximizes the likelyhood of a customer accepting a given offer, which also translates to maximizing the probability of a customer clicking on the online banner displaying that offer. This probability is also known as the Click-Through-Rate (CTR).

To carry out our study, we make use of demographics and historical data of customers at ING Nederland. The developed strategy is used by ING Nederland, in order to provide a ranked list of recommended actions across multiple channels of communication, such as of the personal MijnING web-page of each customer. The recommender system is designed and implemented to run on a cluster of computers , by using the Map-Reduce framework of Spark.

(8)

1.1 The Next-Best-Action Method

1.1.1 Definition

The Next-Best-Action (NBA) Method is well known in the field of Decision Making and is universally quite common to use for marketing purposes. It specifies the available actions to choose from and which of those are the best to follow under a customer-centric point of view. Such actions may relate to accepting or refusing an offer, proposition or service made by a company (or organisation). The method is used to estimate the set of products and services that the customer would be more interested in.

1.1.2 Process

With the aim of determining what the best actions are, a company needs to consider its customer’s interests, but also the objectives and policies under which it operates. Due to the complexity of what that involves, it has become a very common issue that the majority of the offered actions are either against the customer’s interests or just completely irrelevant to them. In order to limit the frequency of such occurrences, a company may define a process by which it determines a set of possible actions that would be beneficial to both of the parties at the same time. The practice of such a process can lead to an increasing trend of customers heading towards a positive response to a given offer.

1.1.3 Strategy

The key strategy to the appliance of the NBA method, is to approach the customers at the proper time and deliver to them an attractive set of possible offers across multiple channels of communication (such as web, e-mail, etc.).

Nowadays, it is typically easy for a company to reach out to its customers and be able to store and look at their historical data (by always being abide by the privacy laws & regulations). That acquired information can help to unveil possible patterns of consuming behaviour, which in turn can be used to effectively segment the customers into different groups, with each group receiving a unique collection of NBAs.

However, segmentation by itself is not effective enough. That is because of the large uncertainty about the general behaviour, needs and interests of each customer which can differ by each case. Therefore, a next step in improving the strategy of the NBA method involves the development and use of predictive modeling that suggests a list of the most relevant NBAs per customer. Those modeling techniques compute the likely-hood (as a score) of a particular customer being interested in a specific offer. Based on those scores a list of offers is delivered to each customer.

(9)

The use of predictive analytics is a very important tool for a company, since it provides the capability of delivering personalized offers in real-time, without the need of any pro-fessional consultation. It is able to increase the response rate for product/service offers, but also reduce a company’s financial costs, as the traditional campaign management is now becoming obsolete.

1.2 Recommender Systems

To automate the process of the NBA method, we implement a Recommender System (RS). A RS is a personalized information filtering technique [54], of which the objective is to develop a policy that produces a list of k ranked items according to the user’s preferences and characteristics. In our case, it represents a function f that maps a user’s context of information X, to a subset of k items which are selected from the set of all possible items I. Each item is associated to a score that is used to indicate its rank. The rank represents the position of an item in the the list of recommendations that is sorted according to those scores. To produce a list of k ranked items, we need to select the first k items with the highest score values, so that the following expression becomes true f : X → {i1, i2, ..., ik}, where ∀ij ∈ I ∧ ∀j ∈ [2, k] : score(ij−1) ≥ score(ij).

There are many approaches regarding the development of a RS and those are mainly distinguished to each other by their functionality and class of information they make use of. In the next chapter of Literature Study, we discuss in further detail what those approaches are and which of them are more useful at each case.

1.3 Models

There has been significant work in the literature regarding the use of machine learning for developing models that predict a user’s response to a displayed offer [105][106][107]. Those models define, update and improve the strategy by which the Recommender Sys-tem (RS) decides which of the offers are more relevant to the customer. That process is typically consisted of two main stages.

At the first stage, a set of data instances is collected, where each instance is represented by a set of feature values f = (f1, f2, ..., fn) that is associated with one of the possible

classes c1, c2, ..., ck. Our case is placed under a binary classification problem (k = 2), that

is: predicting the customer to either being interested in a given offer or not. Specifically, ING receives implicit feedback that indicates whether the customer did click on the displayed offer. That feedback represents the class of each instance.

The second stage of the process involves the training of a model with the collected data. Given a set of feature values, the model estimates the conditional probabilities of

(10)

P (C|F = f ) for each possible class c ∈ C 1. Our first model is based on the Logistic Regression, which is very common to use for RSs, due to the ease of parallelizing it and handling large-scale problems [3]. In addition to that model, we are also experimenting with the Random Forests, which compared to Logistic Regression, are able to capture non-linear correlations. Both of the previous modeling techniques are used to create a RS that is categorized as User-based. A User-based RS makes recommendations to a customer based on what similar customers like. Thus, the feature values f in this case represent the characteristics and the behaviour of the customer. Moreover, we are also experimenting with another approach called Latent-factor. A Latent-factor RS makes recommendations based on the hidden user/item factors. For more details regarding the RS approaches, please refer to the chapter of Literature Study.

1.4 Cluster Computing

The use of clusters (type of distributed system) has been a common practice and even a default prerequisite in the industry, for setting up and building a production-level system capable of handling and processing loads of streaming data. Companies need to be proactive and respond in real time (or at least within a small time window) to the customers requests and changes of behaviour. Therefore, a high performance system is required.

A computer cluster is usually consisted of a large number of nodes which are intercon-nected under a low-latency network. Special treatment of the algorithms and tweaking is necessary, so that the full power and utilization of the system is reached. Most of the tweaks concern the balanced distribution of work among the nodes and how that can be achieved with the least need of communication (amount of transfered data).

Under the scope of this project, we make use of a 28-node cluster. We are also using Spark which is a Map-Reduce framework for large-scale data processing and distributed machine learning algorithms. It is also a framework that retains the fault-tolerance and scalability properties that a Map-Reduce framework is supposed to provide [4]. In Chapter 4, we discuss in further detail the specifications of the cluster and how we can take full advantage of its computational power by the use of Spark.

Running the tasks of the RS in a parallel fashion is important. It gives the compu-tational capabilities of providing offers that are frequently updated (usually on a daily basis) using the latest history of the users. In this way, ING expects to offer a more per-sonalized experience to their customers, by becoming more relevant and aware of their needs.

1

(11)

1.5 Use Cases

ING Nederland uses the RS on the channels of Internet Banking (MijnING), Mobile Banking and Service calls. Those channels include several cases of personalized offers, which are either viewed on a banner, or are suggested on the phone (during service calls).

(a) Internet Banking Case 1: Marked area contains personalized banners which users see when they login to their MijnING account.

(b) Internet Banking Case 2: Marked area contains a personalized banner for the users that have just logged out of their MijnING account.

(12)

(c) Mobile Banking Case 1: Marked area contains a personalized banner for the user of the mobile app.

Figure 1.1: Use cases of the RS for personalized banners.

1.6 Motivation & Goal

The motivation behind this thesis project, is to compare different Machine Learning techniques and approaches for predicting a user’s response to a displayed offer. The goal is to create a scalable RS that delivers relevant NBAs to the customers of ING. We experiment under different settings, parameters and techniques, in order to evaluate which combination of those achieves the greatest possible solution. To create a scalable system, we use a cluster of computer nodes and the Map-Reduce framework of Spark. An additional goal to the thesis, is to obtain a solution which is equivalent to that of the original system which runs on a powerful Netezza machine. The Netezza machine is designed by IBM and is used for multiple data processing tasks at ING, including the NBA selection. By creating an identical solution in the cluster, we can replace that process and thus increase the limited resource availability on Netezza.

(13)

Literature Study

In this chapter we explain and discuss some of the most well known methods and tech-niques for developing and evaluating a recommender system. Additionally, we include literature that is devoted to some applications of recommender systems in advertising and to common problems that can appear during the process of their development.

2.1 Recommender Systems

The Recommender Systems (RSs) were introduced nearly twenty years ago (mid-1990s) [24] [25] and have been evolving and improving since that time. Today, they are mas-sively being used to develop products and services which adjust to the characteristics and likings of the person that is using them. Their main functionality consists of fil-tering large amounts of stored information in order to provide relevant knowledge and personalised suggestions that can range from health-care decision topics [23] to books[52], movies[53][51] , music [57] or even financial products and services (such as bonds, mort-gages etc.). A great number of methods for RSs have been researched and studied through the last few years, due to that wide range of applications and the sudden growth of large-scale computing. A RS applies different methods and techniques depending on the domain of use and the availability, sparsity and dimensionality of the data [54][56].

2.1.1 Methods

To decide which method to use, we first need to identify and understand the way the user interacts with the system. That is, defining what information is possible for the system to extract by the feedback of its users. It is essential to use that information in order to improve the performance (i.e relevance of recommendations). The feedback can be either explicit or implicit. The explicit feedback refers to a direct response from the user to

(14)

the system that is specified by a rating within a predefined range (e.g rating of a movie, artist etc.) or a Boolean value (i.e positive or negative response of a user to an item). On the other hand, the implicit feedback refers to a response that is being reflected by the user’s behaviour (e.g which items did the user buy, or searched for, etc.).

Secondly, but equally important, is the nature of the available data in regards of the items that are being recommended (item-based data), but also of the users that those recommendations relate to (user-based data). All that information can be summarized into a collection of features f , that describe the basic properties of an item (e.g category, price, etc.) and the characteristics of a user (e.g demographic information, user history, etc.). The capabilities and techniques of a RS are clearly dependant on the availability and quality (i.e sparsity, missing values, outliers, etc.) of those features which essentially operate as an input to a function that maps a user u into a collection of items i1, i2, ..., ik∈

I, where I is the set of the all available items in the system. Given the above, a RS may use a method that falls into one of the following main categories [56]:

• Content-Based Methods

• Collaborative-Filtering Methods • Hybrid Methods

Content-Based Methods

A content-based method makes recommendations by considering the preferences of a user u and how those match with each of the existing items I = {i1, i2, ..., in} of the

RS. The preferences are determined by a subset of items (let us call it S) for which the user indicated interest in the past. Suppose that every item i is paired with a vector of descriptive features (i, f ). Then, the RS searches through the set of {I − S} and selects the items which are more similar to those of S based on their feature values.

There have been many purposed metrics for measuring the similarity between a pair of items. The most well known are the Cosine-Based Similarity[54], TF-IDF [56] and the Jaccard Similarity[61]. There are other metrics that use a distance measure across multi-dimensional distributions[60] or a learning algorithm to estimate the similarity[59] or category of an item by an underlying model such as of a linear classifier or Naive Bayes [86].

• Cosine-Based Similarity

Suppose that we have a pair of items (i1, i2) and their respective properties

ex-pressed as a vector, f1 and f2. Then, their Cosine-Based Similarity is measured by

computing the cosine angle between those vectors. That is: sim(f1, f2) = cos(f1, f2) =

f1· f2

(15)

• TF-IDF

The TF-IDF (term frequency-inverse document frequency) was first introduced in information retrieval for finding relevant documents (Salton and McGill, 1983) and it has turned out to be a very useful and common similarity measure for RSs [56][62], where keywords are involved for the description of items (i.e vector of features containing the frequency of each keyword). Suppose there is a description di of an item i and that there is a definite collection of possible keywords K =

{k₁, ..., kn}. Then, for each possible keyword kj ∈ K, the TF-IDF of an item’s

description di is defined as:

T F IDFi,j =

fi,j

argmaxk∈{K−kj}fi,k

(2.2) where fi,j is the frequency of keyword kj in the description di of item i and

argmaxk∈{K−kj}fi,k is the maximum frequency value of a keyword k in item i.

Each combination of (i,j) consists of a feature descriptor for the particular item. To measure the distance (or similarity) between a pair of items with certain fre-quencies of keywords, we can use the cosine-based similarity. The authors of [56] suggest to also include the weight factor of logN_c

j (where N is the total number

of items, and cj is the frequency of a keyword kj appearing in the description of

those items) in the computation of TF-IDF, in order to avoid using very common keywords for measuring the similarity.

• Jaccard Similarity

The Jaccard Similarity is mostly used with items that can be described by a set of binary variables. For example, in a movie recommendation system, the properties of an item can be represented by a set of binary values indicating the genres to which a movie belongs to (e.g horror = 1, comedy = 0, adventure = 1 etc.). The Jaccard Similarity between two items (i1, i2) and their corresponding binary

vectors f1 and f2 is:

sim(f1, f2) = P jmin(f1j, f2j) P jmax(f1j, f2j) (2.3) A content-based method assumes that the user prefers to receive items which are similar to each other. It is a method that can incorporate both, explicit and implicit feedback. It can be quite accurate[54], but it does not scale well when the number of items grow by a lot (similarity computations increase). Another drawback of content-based methods is that when a new user enters the system, there is not sufficient amount of information to provide the user with relevant recommendations. Finally, it is a method that is impossible to use for RSs where the items cannot be described by a set of features, therefore it is limited to a subset of RSs applications.

(16)

Collaborative-Filtering Methods

In contrast to a content-based method, Collaborative-Filtering (CF) is able to make recommendations using the entire user community and their behavioural data. The assumption in this case is that similar users have similar likings and thus their opinion towards an item should be identical. Suppose that a user uA with characteristics cA

favours an item i. Then, with high probability, a user uB of similar characteristics cB,

will be favoring that item as well. There are two classes of algorithms to determine item/user similarities: the memory-based and model-based [56][24].

The memory-based algorithms incorporate the use of heuristics and metrics such as that of Cosine-Based Similarity or Correlation-based Similarity. They aggregate the similarity over all the available users and the items of which they have responded to. For example, suppose that we want to evaluate the interest of a user uk∈ U (where U is the

set of all users) with characteristics ck towards an item i. Also, suppose that Ui ⊆ U

is the set of users who have indicated their negative or positive interest towards item i. Then, the interest of the user uk can be estimated as:

interest(uk, i) =

X

uj∈Ui

sim(uk, uj) × interest(uj, i) (2.4)

where sim(uk, uj) is the user similarity function acting as a weight (i.e opinion of similar

users is more important) and interest(uj, i) is the interest of user uj towards item i

specified by a value scaled to the range of [0, 1]. The similarity of a pair of users can be computed based on their personality features (age, gender, etc.) and the items of which both of them have responded to. Suppose that IA is the set of items that user uA has

rated and IB the set of items that user uB has rated respectively. Then, their similarity

is measured across the vectors containing their personality features and responses to the items of IA∩ IB. Similarly to the case of TF-IDF measure, the authors of [24] make the

justification that items of great popularity among the users should have no impact on the similarity measure. Therefore, they introduce the term of log_nn

i to the ratings (or

interests) similarity function: RS(IA, IB) =

X

i∈IA∩IB

log n ni

(1 − |rating(uA, i) − rating(uB, i)|) (2.5)

where n is the total number of users, ni the number of users that have responded to item

i. The total similarity between two users can then be measured by:

sim(uA, uB) = RS(IA, IB) + CS(cA, cB) (2.6)

where CS(cA, cB) is the characteristics similarity between the two users (computed by a

(17)

The collaborative filtering can be either user-based or item-based. In the user-based, the RS recommends items which are preferred by users with similar feature values. Such features can represent characteristics of the user (demographics, user-related information etc.) and ratings of items. On the other hand, an item-based RS recommends items by computing their similarities and by detecting possible patterns occurring in the behaviour of the users (i.e users who have preferred an item ij seem to also prefer item ik). The

interest metric that was defined above falls into a user-based RS as it uses the similarity across the users characteristics and preferences to determine which items to recommend. The following similarity measure is based on conditional probabilities [54] and can be used for item-based RSs.

• Conditional Probability-Based Similarity

This measure collects the positive responses as evidence for proposing new items to the user. Suppose a user u has responded positively to an item ik and we need to

estimate the interest of that user to an item ij given the response to ik. The author

of [54] suggests to find the percentage of users who have given positive response to both of the items. Thus, the similarity between them can be defined by the total amount of evidence supporting that users who are interested in item ik are also

interested in ij. That can be expressed by the following conditional probability:

P (ij|ik) =

Interested(ij, ik)

Interested(ik)

(2.7) where, Interested(i) is the frequency of users that have showed interest in item i. Although that measure is quite effective, it should be mentioned that it comes with two potential pitfalls: First, it is asymmetric (i.e P (i2|i1) 6= P (i1|i2)) and secondly,

it tends to be biased towards items of high popularity (i.e when interested(i2) → 1).

The authors have purposed adding an extra term to address the second problem: sim(i2, i1) =

Interested(i1, i2)

Interested(i1) × Interested(i2)α

(2.8) where α is a free parameter that can range from 0 to 1.

The authors of [24] have purposed some additional modifications to the above memory-based similarity measures. They claim that their performance can be quite poor at cases where the intersection of IA∩ IB contains a very small amount of items. Therefore, they

introduce an extension called Default Voting which assumes a default value of interest (or rating) for each item i that has not been rated yet by either, user uAor user uB. The

default value should indicate either a neutral or negative preference towards the unrated item. Moreover, the authors introduce a transformation for similarity values called Case Amplification which penalizes low values of similarity and favours only the highest (i.e

(18)

those which are closest to 1). That transformation is used to reduce the noise of the data. Some additional measures that should be mentioned are the Similarity Fusion which is a probabilistic fusion framework [63] and the use of Weighted-Majority as a memory-based algorithm [64].

So far we have discussed the memory-based algorithms for CF which estimate the rat-ings of a user based on similarity metrics and heuristics. The model-based approach on the other hand, uses the available data to construct models of machine learning that pre-dict a user’s unobserved rating. The models try to recognize possible patterns that may lay underneath the behaviour of the users to a collection of items. Such patterns usually relate to the responses of the user, its characteristics (i.e certain types of users prefer item i1) and interests (i.e users who are interested in i1 seem to also like item i2). There

are many available model-based techniques. The most notable are the Cluster models, Bayesian models, Latent-factor models and Matrix Factorization models [24][56][75][76]. The Cluster models, also known as Neighborhood models, select the nearest neighbor-hood that an item/user belongs to. The selected neighborneighbor-hood acts then as the ratings predictor for a certain pair of a user u and item i. The neighbourhoods or clusters are formed by using a clustering algorithm like the infamous k-means algorithm (James MacQueen, 1967), which groups the items/users into k different clusters given their char-acteristics and their distance to the center of each cluster. Suppose that a user u belongs to a cluster C of users {u1, u2, ..., uk}. Then, the rating of a particular item i can be

estimated based on the average ratings of those users to that item (or to items similar to that), which for example is illustrated at the work of Herlocker et al [77] and Udgar et al [80]. Many metrics can be used to define the distance of a user/item to a cluster. It can be either by a memory-based algorithm like Cosine-based Similarity or by any other multi-dimensional distance measure. The authors of [77] suggest to use Pearson correla-tion or Spearman’s rank correlacorrela-tion to do that. In addicorrela-tion, the authors of [78] suggest to use Principal Component Analysis (PCA) prior to applying a clustering technique, in order to to reduce the dimensions of the ratings matrix, which can help to avoid sparsity problems. For item-clustering, suppose that each item i can be explained by a set of features fi (or properties), then each cluster C, contains a collection of items which are

similar to each other. Thus, in that case, a rating of an item i can be estimated based on past ratings of the user u for items similar to i. There are some general issues concerning both of the approaches. One is that most of the similarity measures are inconsistent, as they don’t take into account the existence of other neighbours. And secondly, there is the possibility that there are no close neighbours and thus the rating is predicted based on very distant ’neighbours’ which can lead to bad estimations. For that reason, the author of [76] introduces the use of interpolation weights for item clustering (can also be applied to user clustering), which helps to resolve those two issues.

(19)

An alternative to Clustering models is the Bayesian models. A Bayesian model uses the past ratings E = {Rj, Rj+1, ..., RN} as evidence to estimate the probability of a certain

item i getting a rating r, that is P (Ri = r|E). Such a model is the Bayesian Network

Model introduced by Heckerman at al (1997) [81], where each item is represented by a node in a network. The model assigns a state to each node which corresponds to the rating of the user. Then, the model searches for possible dependencies between the items (nodes) and creates a tree-structured network that outputs a probability for an item’s rating based on its dependencies and the states of the rest of the network. That can be expressed as P (Ri = r|E) = P (Ri = r|x0, x1, ..., xN, β, θ), where x0, x1, ..., xN are the variables

describing the states of each node, β represents the structure of the network and θ includes the parameters of the local probability distributions (i.e probability of each state per node). Another common Bayesian approach is the Naive Bayes [75] where each item with properties X0, X1, ..., XN, is assigned a probability, given the distribution of each

class (rating) in respect to the properties. The naive bayes selects the class (rating) with the highest probability. Suppose that a rating ranges in [0, k] and we need to estimate the rating of an item i with a set of properties X = {X0 = x0, X1 = x1, ..., XN = xN}.

Then,

Ri = rating(i, X) = argmaxr∈[0,k]P (R = r)

Y

j=0,...,N

P (Xj = xj|R = r) (2.9)

Thus, the naive-bayes considers two factors: the probability of the user assigning a rating r and the joint probability of an item i of properties X to be having that rating. These probabilities are estimated according to the past ratings of the user.

A Latent Model assumes the existence of some hidden factors z known as latent class variables that are discovered by the model and are used to characterize both the users and the items of a system, but also the interactions between them. A latent model for CF has been introduced by Hofmann and Puzicha [82], where each latent class variable z is associated to a pair of (x, y), where x is a user and y is a binary indicator of an event, such as of a user being interested towards an item i. The authors use the EM (expectation maximization) algorithm to fit a probabilistic model that makes predictions by the following formula:

P (x, y) =X

z∈Z

P (z)P (x|z)P (y|z) (2.10)

The EM algorithm re-calculates the terms of P (x|z) and P (y|z), to estimate the maxi-mum likely-hood, given the latest parameters which are updated during the algorithm’s execution. The authors have concluded that their latent model performs significantly better than the corresponding clustering model. However, the work of Yehuda (2008) [83] demonstrates that the combination of both, latent models and clustering models,

(20)

can achieve an even better performance. Another method that uses latent models is the Latent Dirichlet Allocation (LDA) [84] where the interest probability of a user to-wards an item is estimated by a mixture over an underlying set of other users [84]. The LDA works as an extension to Hofmann’s latent model and is capable of categorizing a text/document based on the underlying probabilities of topics across different documents containing different word distributions. The authors extend their technique, so that it can be used for CF. It works similarly by just replacing the documents with the users and the collection of words with a collection of items. In addition to that method, there are others that are based on Matrix Factorization techniques, which have recently proved to be quite useful for CF [51][79]. The main idea behind them is to create a matrix that captures the interactions between the users and the items:

ru,i = qiTpu (2.11)

where qiis a vector that indicates which of the factors does that item i possess. While pu

is the corresponding vector for the factors of the user u. The interactions are captured by the dot product of those two vectors.

Figure 2.1: Visualization of the matrix factorization.

Their size needs to be determined prior to creating the model. The learning of the factors is done by fitting a regularized model that minimizes the root mean squared error (RMSE) given the known ratings K:

min

q∗_,p∗

X

(u,i)∈K

(rui− qTi pu)2+ λ(||qi||2+ ||pu||2) (2.12)

where λ is the regularization parameter that controls the intensity of the regularization which is capable of reducing the over-fitting and increasing the generalization of the model. In such models, filling missing ratings by techniques like default voting, does not work well because it increases the density of the end matrix which can cause the training process to be a quite expensive operation. The above equation can either be solved by the stochastic gradient descent (SGD) or by tls he alternating least squares (ALS) algorithm. The authors of [51] note that SGD is much faster and easier to implement.

(21)

On the other hand, the ALS can work in a parallel setting, since each factor can be computed independently to each other. Thus, it can become much faster if a multi-core/multi-processor system is available. Finally, it is a method that addresses better the implicit feedback of the users. For more details regarding the implementation of ALS, please refer to [85]. The ALS in general handles the personal preferences very well (since those are stored per user case), which may change over time and is able to also extract information regarding possible patterns (i.e users who like i1 usually also like i2).

However, ALS comes also with some issues. The most common is related to the sparsity of the matrices which usually grows as the number of users/items do. The sparsity reduces the capability of the system to find users with similar interests. In addition, ALS can also be exposed to Shilling attacks, which essentially means that a user’s biased opinion towards a certain collection of items (e.g brand of products etc.) can have a bad influence on the performance of the RS.

To avoid the above issues, some authors build models which treat the rating prediction problem as a regression or classification problem. Those are usually implemented by the use of Neural Networks, Logistic Regression or any other supervised learning technique [75].

Hybrid Methods

A Hybrid Method is a collection of multiple RS techniques which are combined together to overcome the weaknesses that each of them has at an individual level. The techniques are either combined separately (i.e by merging their predictions) or are integrated together by incorporating some of their characteristics (e.g by inserting features of the Content-based method into the CF).

A Hybrid Method is used to overcome certain limitations of each recommendation technique. For example, in User-based CF, it is difficult to estimate the likeness of new items that appear in the system (known as cold-start problem). However, a Content-based method can easily handle them, since it can use past data of correlated items to make recommendations. Another problem of CF is what is called the gray sheep problem [73]. The gray sheep problem refers to a small percentage of users who can barely be distinguished into a single segment or group, because their likings are very diverse. The authors of [73] suggest to use a mix of Content-based techniques with CF to overcome that problem. Specifically, the authors of [37] apply that approach to create a RS that provides recommendation services (e.g movie reviews, restaurant guides). In addition, Yehuda Koren introduces another hybrid method that incorporates the use of a Latent-factor Model together with a neighbourhood model [36]. That hybrid method is also able to overcome the cold-start problem.

(22)

Any possible hybrid methods can been distinguished into one the following categories [74]:

• Cascade

It works as a filtering technique where the recommendations of a RS are filtered through another RS.

• Feature augmentation

The feature augmentation is similar to Cascade, it uses the output of a RS as a feature into another RS.

• Feature combination

Features from multiple RSs are combined to create a single RS. • Switching

Depending on the possible state of a prediction problem (e.g gray-sheep, new-user problem) the hybrid system switches to the technique which is the most suitable for that state.

• Weighted

It uses a linear combination of outputs of different RSs (multiplied by a weight) to define the likeness of each item.

Knowledge-based Methods

A knowledge-based method provides recommendations as a response to a query that is defined by a user’s set of requirements, that operate as knowledge to the RS [65]. For example, an application that recommends restaurants to the user given their location (Bhargava, Sridhar & Herrick, 1999), or recommending products and services that can explicitly solve a user’s reported problem [66]. The authors of [67] suggest a hybrid RS by inter-grating a knowledge-based method together with collaborative filtering, in order to obtain a more robust system that uses a two-phase filter. However, the authors mention that combining those two methods together proved to be a quite difficult task.

Community-based Methods

A community-based method takes advantage of a user’s social network to estimate item ratings. The authors of [14] mention the infamous phrase of “Tell me who your friends are, and I will tell you who you are” to describe the main idea behind it. Lots of work has been done recently in the field, since the latest rise of social networks. For example, the authors of [68] have purposed two different regularization terms within the context of social networks, which they use as constraints for the objective function

(23)

of matrix factorization (used for predicting ratings). Additionally, the authors of [69] suggest a method that incorporates information of social networks to create a probabilistic model that gives higher weight to the immediate connections, but also to the users with similar preferences. The authors of a study [72], have concluded that users prefer recommendations from people they know of and that the main aim for using social networks is to connect with like-minded people. Therefore, the use of data coming from social networks can have a huge impact on the effectiveness of the RSs. That can also be seen at the work of [68][70][71], where the authors combine CF with information of social networks that proves to be quite useful for increasing the system’s performance.

2.1.2 Evaluation

The evaluation and measure of a RS’s performance is a quite challenging and important step for determining which techniques to use and for understanding the strengths and weaknesses of each of them. The possible evaluation metrics for RSs are extensively discussed in the literature [98][99][101], due to the difficulty of distinguishing which met-rics are more suitable and representable for estimating the performance (in an offline setting) that is closer to the true performance of the system. That is, an uplift of the model’s performance should indicate a system that performs better in an online setting. The chosen evaluation metric should always be defined according to the properties and applications of the RS. That is, the size of the data, its sparsity and the rating scale [99]. A not-well considered metric can lead to misleading results and interpretations, which in turn can lead to a badly designed RS. The evaluation metrics can be categorized into two classes: Statistical accuracy metrics and Decision support accuracy metrics [55].

A Statistical accuracy metric compares the predicted scores or probabilities to the recorded ratings or binary values of implicit/explicit feedback. The overall accuracy depends on the total error induced by the set of the predictions. The most common metrics in this category are the following:

• Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) is one of the few most common metrics for RS evaluation [75]. It measures the average difference between the predicted rat-ings/responses pui and their true values indicated by the users. Suppose that we

have a collection of predictions pij ∈ P (where |P | = N ) and the corresponding

collection of the users recorded ratings/responses rij. Then, the MAE is measured

by summing up the equally weighted errors between the predictions and the actual recorded values:

M AE = P

(i,j)∈P|pij− rij|

(24)

• Root Mean Squared Error (RMSE)

The Root Mean Squared Error (RMSE) is an unbiased estimator of the Mean Squared Error (MSE) that can easily be interpreted, as it uses the same measure-ment unit as with that of the predictions. The MSE is computed similarly to the MAE, but each difference is squared instead. That is, MSE penalizes more towards greater differences (i.e deviations/errors) and less towards insignificant ones. The RMSE is defined as:

RM SE =√M SE = s

P

(i,j)∈P(pij− rij)2

N (2.14)

The RMSE is typically used for evaluating RSs that use explicit feedback (e.g ratings for products, movies etc.). In that case, the measure estimates the average deviation of the predicted ratings to the true ratings.

• Logistic Loss (LL)

The Logistic Loss (also known as Log-Loss or Cross-Entropy) evaluates the good-ness of the probability estimations made by a binary classifier. The LL penalizes the wrong predictions which are made with higher confidence (e.g predicting a probability of 0.9 for a positive sample to be negative). The LL is defined as:

LL = −1 N

X

(i,j)∈P

(rijlog(pij) + (1 − rij)log(1 − pij)) (2.15)

To avoid values of log(0) and log(1) we use a minimum value that is slightly higher than absolute zero (e.g 10−8) and a maximum that is slightly lower than absolute one (e.g 1 − ). Thus, the input value at the logarithmic functions is always filtered, so that no values are either exact zero or one. In general, the LL is quite common for evaluating a binary classifier and thus also for evaluating the predictions of implicit feedback. In addition, the LL is a useful measure for calibrating the probabilities of a binary classifier [100].

On the other hand, a Decision support accuracy metric computes the overall perfor-mance and quality of recommendations by measuring how much those deviate from the user’s preferences. Some metrics also consider the ranking of each item to evaluate the performance. The most common metrics in this case are the following:

• Recall

The recall metric was first introduced in Information Retrieval [103] for measuring the relevance of the documents that are retrieved for a specific task. It is used in RSs to measure the fraction of relevant items that are selected out of all the

(25)

items which are considered to be relevant for the user. This measure is typically used in RSs where the user can view many recommendations in an ordered manner. Suppose that the RS contains a total of N items that are considered to be relevant, then the recall is computed as:

recall = tp tp + f n =

|relevant recommended items|

|relevant items| (2.16)

where tp is the number of true positives, f n is the number of false negatives and essentially N = tp + f n. A higher recall value indicates a RS that has a higher probability of selecting a relevant item. The task of finding the set all the relevant items is quite difficult and especially when it comes to defining what relevance is. Thus the recall is typically approximated by some other methods [98]. The recall is computed per user and thus its final value is defined by average all the individual recalls.

• Precision

In addition to the recall metric, there is also the precision metric which is also adopted from Information Retrieval. The precision measures the fraction of rele-vant items out of all the recommended items. Thus, suppose that the RS recom-mends a total of N items at a time, then the precision is computed by:

precision = tp tp + f p =

|relevant recommended items|

|recommended items| (2.17)

where tp is the number of true positives, f p is the number of false positives and essentially N = tp + f n. A higher precision value indicates a RS that has a higher probability of selecting items which are relevant. Similarly to the recall metric, we need to take the average of all the individual precision values (i.e precision per user).

• F1 Score

The F1 Score is a quite common performance measure for RSs. It is a combi-nation of the two information retrieval metrics: precision and recall. It can be more balanced due to the use of both metrics for which an equal weight is used. Specifically:

F 1 = 2 ∗ precision ∗ recall

precision + recall (2.18)

The term balanced refers to the fact that the two metrics are complementary to each other. For example if N is increased then the total recall does so too, while the precision get decreased. Again, in order to calculate the final F1 score value, we need to compute the average F1 Score of all the users. The F1 score is a quite useful metric to use for training purposes when the data contains skewed classes [101].

(26)

• Area under ROC (AUROC)

The Area under ROC (also known as AUC), is a metric oriented towards the evaluation of a binary classifier. It computes the confusion matrix of the predictions (i.e number of True/False - Positives/Negatives) for several threshold values. The measurement at each threshold is represented by a point on the two-dimensional space, with the x-axis representing the True Positive Rate (TPR) and the y-axis representing the False Positive Rate (FPR). The AUC is typically measured by calculating the actual area under the curve that is formed by the threshold points. That area can also be estimated by the following formula:

AU C = PN1

i=1ri− N0(N0+ 1)/2

N0N1

(2.19) where N0 is the number of negative samples, N1 is the number of positives samples

and the termPN

i=1ri is the sum of the ranks for each positive sample. Essentially

the AUC uses the ranking of the positive samples to determine the prediction performance. The estimator with the best AUC is considered to be the one that ranks all of the positive samples higher than the negatives. The AUC is a very common measure for binary classification and is an alternative to the referenced information retrieval metrics. Furthermore, the AUC is a well established metric that uses solid statistical theory to provide as many relevant items [101]. For more details and better understanding of AUROC, please refer to [118].

Figure 2.2: Example of a ROC curve.

In addition to the above measures, there are also the ranking measures. A ranking measure evaluates the relevance of the ranking of a list of items that is presented to the user. That list is usually long and can still be extended as long as the user requests it.

(27)

A common ranking measure for evaluating the relevance of those lists is the Normalized Cumulative Discounted Gain (NDGC) [104].

• Normalized Cumulative Discounted Gain (NDGC)

The NDGC ranges in the interval of [0, 1], with the value of 1 representing a perfect ordering. Two main values need to be computed: the current score of the ranked list (DCG) and the maximum possible score that can be achieved (DCG*). Suppose that the ranked list is consisted of N items, then the DCG is defined as:

DCGN = N X i=1 2ri− 1 log2(i + 1) (2.20) where ri represents the relevance of the item appearing at position i. The relevance

can either be a rating given by the user or a binary value indicating the interest of that user in the item. The maximum value of NDGC is computed by placing the most relevant items on the top of the list and then recomputing the DCG. The NDGC is defined as:

N DGCN =

DCGN

DCG∗_N (2.21)

The NDGC is more common for applications at which the user searches for a collection of relevant items.

In contrast to offline evaluation, there is also the online evaluation. The online evalu-ation refers to using the recommendevalu-ations of the system in real cases and then directly evaluating them. The experiment in consisted of several A/B tests. Each group receives recommendations through a different model and then each of them is evaluated according to the feedback of the users. There are two common measures for the online evaluation: the Click-Through-Rate (CTR) and the Life-Time-Value (LTV). The CTR measures the fraction of viewed items which seemed relevant to the users. In this case the relevance is represented by the user’s action of cliking on a certain item. The CTR is defined as:

CT R =#Items Clicked

#Items V iewed (2.22)

The LTV is similar to CTR, but more suitable towards a long-term evaluation of the RS, because it does not consider the multiple clicks of each user on a certain item. It is defined as:

LT V = #Items Clicked

#V iewers (2.23)

The reason for using an online evaluation is that a model with significant gain on offline metrics may not perform as well in a online environment [101]. We should note that the previous measures are also applicable to online evaluation, but are more difficult to compute (since we need to store the predictions of the RS). For more details regarding the pros and cons of each measure please refer to [98][99][101].

(28)

2.1.3 Applications on Personalized Advertising

The latest come up of RSs has evolved the field of personalized advertising. Adver-tisements (ads) or NBAs, are now considered to be items that are offered to the user, according to their relevance and clicking probability. For example, the authors of [19] suggest the use of an algorithm based on a Bayesian approach, in the aim of offering ads on the search web-page of Bing. The algorithm predicts the click-through-rate (CTR) of an ad (as a probability), given a set of features. Each feature can either be related to the properties of the ad itself (i.e description, group etc.), the keywords of the search and the user’s context of information (history, location, time etc.). The authors decide to use the Logistic Loss and AUC (Area under the Curve) as their evaluation metric.

In addition to that, the authors of [106] suggest another method of personalized adver-tising on a search engine. They use the properties of the ads (terms, length, advertisers) to create a Logistic Regression model that predicts the CTR of a new ad.

The authors of [20] present an application of personalized advertising on the platform of Facebook. The authors use a hybrid model that combines the machine learning techniques of Decision Trees and Logistic Regression. In this case, the offered advertisement is not selected according to a search query, but only based on the demographic information of the user. The authors show that the hybrid model outperforms the individual models that it is consisted of.

Similarly to the previous work, the authors of [3] present a scalable solution for predict-ing the user’s response to a displayed ad by the use of a machine learnpredict-ing framework based on Logistic Regression. The Logistic Regression is a very common modeling technique for predicting clicking probabilities, due to the model’s simplicity and computational performance [105].

A RS for contextualized mobile advertising is presented in [21]. The authors use a two-level Neural Network model and study its performance using several parameters. The author’s aim is to create a RS that selects an advertisement which is related to the user’s profile and geo-location. Each advertisement is represented by a binary vector that indicates the categories that it belongs to. The authors train their model given those vectors and the contextual information and feedback of the users.

The authors of [105] conduct several experiments related to the practices of personal-ized advertising through modeling techniques. They present some tricks for reducing the used memory and provide several methods for computing the confidence interval of the estimated probabilities and the possible ways to calibrate them.

For more related applications, search in the literature using the keywords of: display ad-vertising, machine learning, click prediction, click-through-rate (CTR), ad-recommendation, behavioural targeting.

(29)

2.2 Cold-Start Problem

2.2.1 Introduction

The cold-start problem is very common in RSs, and can raise many issues also in other research fields of which the data to analyse or model contains extreme sparsity or very little amount of information (i.e low entropy). Cases of that problem can be seen at the research of health care, at which several patients receive a treatment A, but not a new alternative treatment B, due to the luck of evidence supporting its effectiveness. In that case, physicians may be able to obtain several responses to treatment A, but not so many for treatment B. Thus, it becomes a very difficult task to study and compare those two groups of different treatments, or for example to identify which genes are responsible for a rare decease [18] in which only a small group of samples can be used for gene comparison. Under the scope of RSs, the cold-start problem essentially means that there is no available information regarding the preferences and responses of a user to a new item that has just been added to the RS. That is very common for systems of collaborative filtering. As a result, it is not possible to create a model that predicts a user’s response to that item, since the training data is missing. The authors of [28] suggest a probabilistic method to avoid that by using the preference information of items which are similar or associated to the new added item. It fits an aspect model that hypothesizes the existence of a hidden or latent cause z that motivates the user to give a positive or negative response to an item i. Notice that, in order to use this method it is required that items can be described by a set of properties. Thus, this technique is essentially a content-based method.

I

Z

U

C

Z

U

Figure 2.3: Graphical representation of the aspect model. A latent variable z (type of content) causes the user u to respond positively/negatively to an item i with content c.

The method is created as an extension to Hofmanns folding-in algorithm [29] which is based on a two step procedure known as EM algorithm (Dempster at al , 1977) [38]. At the first step called expectation we find the expected class of category z that fits best the pair of item i with its content c (that is argmaxzP (z|c, i)). At the second step

called maximization we estimate the parameters such as mean values (by likely-hood maximization). Finally, the algorithm makes recommendations using the probability

(30)

estimations of P (u|i):

P (u|i) =X

z

P (u|z)P (z|i) (2.24)

where P (u|z) is the Bayesian probability of a user’s preference over a given category of an item and P (z|i) is probability of the item belonging to that category.

In addition to the technique of aspect models, the authors also introduce another technique which adopts a Naive-Bayes approach that creates a model per user, based only on the context of items (i.e no use of collaborative information in this case). The following formula is used to compute the likeness of the new item i (i.e rate at which users respond positively to it) given a collection of items I that the user may have already indicated interest in.

P (lj|I) = P (lj) P (I) |I| Y i=1 P (ci|lj) (2.25)

where ci is the category of context at which the item i belongs, I is the collection of

items and lj is the likeness of item j. According to the authors, both of the methods

seem to be quite effective for treating the cold-start problem. Similar methods have also been adopted by [30], [35], [76], while other authors use information filtering techniques [31] (particularly filterbots [33]), but also pair regression techniques that make predictions for new users/items [32].

Despite the solutions of the above methods, the cold-start problem remains an issue for a RS that does not use any item-based context of information (i.e items are not comparable to each other). In this case it becomes impossible to directly bridge a user’s list of preferences to a collection of items that share similar characteristics. The only way to address this problem is to create a model per item i that predicts the interest of a user with characteristics c based on the indicated interest of other users to that item. However, in order to obtain such information, it is required that the item i is shown to some users even though there is no indication regarding their potential interest in it. The RS has to actually decide to either take the risk and show an item that it has no evidence about (cold item), or to show an item that is more probable to be interesting to the user. Thus, there is a trade-off of accuracy to obtaining new knowledge. That choice to make is known in the literature as the exploitation vs exploration dilemma.

2.2.2 Contextual Bandits

The multi-armed bandit is a classic problem of probability theory (introduced by Robbins, 1952)[42] at which a gambler tries to maximize its profit given a collection of n slot machines. The gambler defines a policy π that decides which of the arms is best to use for drawing the next reward. That policy is determined given the gambler’s knowledge

(31)

of the distribution of rewards P (Rj) for each machine j, 1 ≤ j ≤ K and where for every

reward Rj ∈ [0, 1]. Once a new reward is received the gambler reevaluates its knowledge

and redefines the policy by which an arm is selected. Similarly, the same technique can be applied to a RS or each possible action A = {a1, a2, ..., aK}. An example of those

distributions is shown at the Figure 2.4 below.

0 0.5 1 reward p P(R1) P(R2) P(R3)

Figure 2.4: Estimated distribution of rewards for three different actions. Each distribu-tion contains a different level of uncertainty and average level of reward. The acdistribu-tion a1

is this case the most probable to give a high reward.

After multiple draws the uncertainty reduces and thus it becomes feasible to get better estimations of which action ai to choose. That can be derived from the fact that the

standard error of the estimated mean (SEM) is reduced as the number of samples increase: SEM (sx, nx) = sx √ nx ⇒ lim nx→∞ SEM (sx, nx) = 0

where sxis the sample standard deviation of the random variable x and nxis the number

of samples (i.e draws).

0 0.5 1 reward p P(R1) P(R2) P(R3)

Figure 2.5: Estimated distribution of rewards for the three different actions after multiple draws of rewards (compared to Figure 2.4). Uncertainty has been reduced contributing in narrower distributions.

(32)

To choose the most optimal action in a multi-armed bandit problem, several algorithms have been proposed and are known as contextual bandit algorithms. The term bandit refers to the amount of information that is limited to the received feedback of a chosen action aj. Each of those algorithms may be using a different strategy for their decision

policy. To formally define a contextual bandit algorithm under a RSs setting, suppose that a system contains a collection of items I = {i1, i2, ..., iK} that can be displayed to

a user u given its context of characteristics c ⊆ <d. When an item i is displayed to that user, its characteristics are observed and then registered as an association to the collected reward R(c, πt) of the action aj (i.e displaying the item ij) which was initially

determined by the user’s characteristics c and the current algorithm’s policy πt. Once

the reward is collected, the algorithm computes a new policy πt+1 based on the pair of

(c, R(c, πt+1)). To evaluate the effectiveness of an algorithm’s policy πt (where t refers

to the number of applied updates to that policy, i.e number of received feedbacks), we use a performance measure called regret. The regret measures by how much the current policy πt abstains from the ideal policy π∗, that is:

Regret(N ) = N X t=1 R(ct, π∗) − N X t=1 R(ct, πt) (2.26)

where N is the total number of received feedbacks, R(ct, π∗) is the reward received

by applying an action determined by the policy π∗ and R(ct, πt) is the corresponding

reward to the action determined by the policy of πt. The term of R(ct, π∗) refers to

maximum possible reward that can be received given the characteristics of the user ct

and the available actions to choose from. A contextual bandit algorithm has to at least guarantee that the cumulative regret after N updates, is bounded by that linear factor, i.e Regret(N ) ≤ O(N ). In the context of RSs, when an item is presented to a user, the system receives a reward of 1 if the user clicks on that item and a reward of 0 otherwise. Thus, minimizing the regret is actually equivalent to maximizing the CTR (click-through rate), but to also reducing the uncertainty that is contained within the reward distribution P (Rj) of each item ij. It is a common practice to use a policy π in

combination with a binary classifier [43][44] to determine that probability (CTR) for a user u with characteristics c. The most noticeable policies are the UCB1 [45], -greedy (Sutton, & Barto, 1998), LinUCB [43] and Exp3 [46]. Each of those policies guarantee a different worst-case cumulative regret. The UCB1 (or Upper Confidence Bound) is the most popular among those and it comes with strong theoretical proofs regarding its regret bound guarantees of O(√KN log N ), which are achieved by using the index-based policy of Agrawal (1995). The algorithm of U CB1 is derived by the use of the Chernoff-Hoeffding inequality, which states that for a collection of random variables xj ∈ [0, 1], the

(33)

by a constant α is bounded by:

P (X + α < µ) ≤ e−2nα2 (2.27)

The UCB1 algorithm can be summarized by the following pseudo-code: Algorithm 1 UCB1 (I)

1: Initialize:

Choose all of the possible actions at least once.

2: loop:

3: Select the action (item) from I: aj ← argmaxj(E[Raj] +

q

2ln(N )

nj )

4: Pick new reward R0(aj) for the selected action aj.

5: _{Update E[R}aj] based on that reward.

Figure 2.6: Pseudo-code of the UCB1. E[Raj] is the estimated mean of the reward

towards the action aj and nj is the number of times that the action aj has been selected.

A simple explanation of the UCB1 algorithm is that we choose the action with the highest estimated reward (CTR) within an one-sided confidence interval that is expressed by the second term. The beliefs of the system change as more rewards are collected.

The policy of -greedy is also fairly simple and is the one the first policies to introduce randomness into the decision making of the item to select. Suppose that we have a collection of items I = {i1, i2, ..., iK}, the one with the highest average reward E[Raj] is

selected with probability . In the case of 1 − a random item is selected instead (with all items ij having equal probability). That enables the policy to achieve a worst case

regret of O(N ).

Algorithm 2 -greedy (I, )

1: Initialize:

Choose all of the possible actions at least once.

2: loop:

3: Select the current item ij ∈ I with the highest average reward E[Raj].

4: With probability set the selected item s ← ij.

Else, pick a random item ij and assign it as the selected item.

5: Get the reward of the selected item R(as) and update its average of E[Ras]

(34)

The authors of [45] have suggested a mild modification to the -greedy by adding an update procedure to the parameter of . The update runs at each iterative step of the algorithm and is defined as:

n= min

n 1,cK

d2_n

o

where n is the current iteration, K is the total number of items, and c, d are two constant parameters that define the rate at which decreases over time. After a couple of updates, will go to zero. That addition enables the policy to adjust as the uncertainty for each action is reduced which results to a worst case regret of O(logN ).

The LinUCB is a policy that works as an extension to the already introduced policy of UCB1. It uses a linear model to fit the reward samples Raj of each action aj with certain

features xj that combine information of both, the user and the certain action that is

involved with the reward sample. That extra information carried by xj can help to relate

certain kind of users to certain kind of actions. The worst case regret of this policy is O(√KdN ), where d indicates the dimension of the feature vectors. We should note once more that this policy assumes that each bandit (item) contains a set of features by which it can be described. It is not directly related to solving the problem of unlabeled items (i.e items which cannot be described by a set of features), but it is still an interesting approach that should be mentioned.

Algorithm 3 LinUCB (I, x)

1: Initialize: For every i ∈ I: Ai← Id (identity matrix) bi← 0d×1 (zero vector) 2: loop: 3: For every i ∈ I: 4: θi ← Ai−1bi 5: pi ← θiTxi+ q xT i Ai−1xi

6: Select the action (item): aj ← argmaxjpj

8: Aj ← Aj+ xjxTj

9: bj← bj+ R0(aj)xj

Figure 2.8: Pseudo-code of the LinUCB algorithm. Should note that xi represents a

(35)

The Exp3 (which stands for ”Exponential-weight algorithm for Exploration and Ex-ploitation”) [46], is a policy similar to LinUCB (i.e both of them fit a model to predict the correct probabilities) and is based on the exponential weight algorithm. That algorithm assigns a weight to each probability of reward per action (i.e item). All of the weights are initially set to 1 and then adjusted by an exponential factor that its magnitude depends on the error rate that is produced by the predicted outcomes and the actual results. Finally, the Exp3 is proved to have a worst-case regret of O(√N ) which is the lowest among the rest that have been presented so far.

Algorithm 4 Exp3 (I, γ)

1: Initialize: For every i ∈ I: t ← 0, wi(t) ← 1 2: loop: 3: For every i ∈ I: 4: pi(t) = (1 − γ)PKwi(t) j=1wj(t) + γ K

5: Select an action (item) aj randomly based

on the current probabilities of p1(t), p2(t), ..., pK(t).

7: wj(t + 1) = wj(t)eγ

R0 (aj )pj (t) K

8: t = t + 1

Figure 2.9: Pseudo-code of the Exp3 algorithm. γ is a free parameter that the user can tune depending on the use case.

Several other modifications based on the above policies have been suggested in the literature [47][48], but is not under the scope of this study to cover them all. Other work, such as of [49] suggests to use similarity information among the bandits (items) so that less exploration is required. The authors of [50] suggest a Bayesian approach to the UCB1 algorithm. Each of those algorithms can help to increase the item space coverage which refers to the diversity of items that are recommended to the users. Two metrics for item diversity have already been introduced at section describing the evaluation of RSs.

To conclude, the cold-start problem is very challenging and is critical to solve, since it can seriously affect the performance of the system, but also our confidence in trusting its recommendations. Depending on the data availability and the design of the RS, a different policy may be suitable to each case.

(36)

2.3 Feature Selection

The power of Big Data and cluster computing, consist of great factors for improving today’s prediction models. However, those alone are not sufficient enough to contribute in producing a model of high accuracy. Factors of noise and occurrence of irrelevant, redundant information, can cast the modeling process to be quite challenging.

The purpose of Feature Selection (also known as Variable Selection) is to deal with that problem. It searches for a subset of features that provide the same, or even better prediction performance than by using all the available features. Including redundant information in the training process can lead to a model that contains additional noise and that overfits the training data. Similar issues have extensively been discussed under the problem of Curse of Dimensionality (introduced by Richard E. Bellman), which explains that for any additional feature, more data is required in order to explain its properties (sample size should grow with an exponential rate as the number of features increase). Feature Selection is also a process that embraces the principle of Occam’s Razor, which states that one should try to avoid a complex model and instead use a simpler one, that contains only what is necessary in order to achieve an acceptable level of prediction performance. That allows the model to generalize and capture the properties of the data.

Another possible reason for reducing the feature space is that less features need to be measured, which may lead to fewer financial costs, but also fewer computational costs, since the predictor may become significantly faster to compute and to predict. Finally, it gives a better insight into knowing which features are more relevant to the nature of the problem. For example, it is quite common to use in microarray analysis for accurate classification of phenotypes [6] or gene selection for a typical classification task like separating healthy patients from cancer patients [12] [13].

In general, feature selection refers to selecting the features which are more relevant to the target variable (are usually called good features [10]). The relevance can be defined in terms of correlation or mutual information that the feature has in regards to the target variable.

In the case of RSs, the feature selection is useful in terms of improving the scalability (since the amount of customer data keeps increasing), but also the quality of recommen-dations [22].

2.3.1 Algorithms

There are various algorithms that use different principles and approaches to search for an optimal subset of features. Most of them, are used as a pre-processing step and can be placed into one of the two categories known as wrappers and filters (John, Kohavi and Pfleger, 1994) [2]. The wrappers estimate the overall accuracy of the produced model