Text and Social-enhanced Business Recommendation

(1)

U

NIVERSITY OF

A

MSTERDAM

M

ASTER

T

HESIS

Text and Social-enhanced Business

Recommendation

Author: Ioannis-Giounous AIVALIS Supervisor: Dr. MaartenDE RIJKE June 5, 2017

(2)

(3)

iii

University of Amsterdam

Abstract

Artificial Intelligence

Masters Artificial Intelligence

Text and Social-enhanced Business Recommendation by Ioannis-Giounous AIVALIS

Recommender systems are a very popular field of study both in academia and industry. These systems usually rely on user product ratings to make their personalized predictions. However, platforms on which recommendation is applicable often have additional contextual informa-tion that can be useful for improving recommendainforma-tion. Such informainforma-tion may include social structures which users are part of, or text information expressing the opinion of the users. In this work we leverage this additional information in order investigate the impact that so-cial circles and natural language that expresses users’ opinions have on the recommendation scenario. This research question has been addressed before using social information and lan-guage processing on facets separately. We combine these two facets to improve an automated recommender system. We extract features from these two facets and apply state-of-the-art machine learning algorithms that use these features to improve recommendation. We evaluate our method on a dataset of a commercial business rating platform. The experimental results demonstrate the effectiveness of our methods and show an improvement over the baselines when using both the facets in a hybrid method.

(4)

(5)

v

Acknowledgements

I would like to thank my supervisor Maarten De Rijke for giving me the opportunity to work with his team on interesting research problems.

Special thanks to Zhaochun Ren for providing on point insights throughout the whole project and in particular giving valuable insights for the experiment formulation and remote help when needed.

I would also like to express my gratitude to to Tom Kenter for putting me on track and Nikos Voskarides for all the insightful comments and help on writing up this thesis.

Finally, I thank and Maarten de Rijke and Evangelos Kanoulas for agreeing to be members of the examination committee.

(6)

(7)

vii

List of Figures

2.1 Recommendation methods . . . 3

4.1 Toy example social graph . . . 18

5.1 Word cloud . . . 24

5.2 Review length per review score . . . 25

5.3 Friends average RMSE . . . 27

(10)

(11)

xi

List of Tables

4.1 Toy example trust matrix . . . 18

5.1 Overview of Yelp academic dataset. . . 22

5.2 User-Item Rating Matrix statistics . . . 23

5.3 Review text statistics . . . 24

5.4 All method overview . . . 26

6.1 Rating prediction result overview. . . 33

6.2 Item recommendation addressing RQ1.1 . . . 34

6.3 Item recommendation addressing RQ1.1 . . . 35

6.4 Item recommendation results for dense data subset . . . 36

6.5 Example user-documents . . . 38

(12)

(13)

1

Chapter 1

Introduction

In the age of information overload, the availability of an abundance of information, prod-ucts and services, despite the obvious freedom of choice it presents to the consumer, also poses a problem [44]. The complexity of the labor required in order to weed out the irrel-evant items has pushed the industry in finding smart automated ways to solve this problem. Recommender systems (RSs) are here to aid in this problem by lifting the weight off of the consumer’s shoulders. RSs can be found in a variety of modern e-commerce (among other) applications such as Amazon1and Netflix2.

Industrial applications aside, research on recommender systems has proven to be a very prominent academic topic over the past years [49, 39, 54, 24, 30]. The task of a recom-mender system can be approached in two ways, rating prediction and ranked list generation. We discuss those approaches in detail in Section3.1. The main difference of the two lies in the fact that prediction is concerned with only the observed ratings, whereas ranking accounts for all items in a given collection, whether they have been rated or not [58]. In this work we propose solutions for both approaches.

1.1 Research Questions

This thesis focuses on the development and evaluation of a recommender system using data that relates to social features as well as textual information. In the traditional setup, the task of such a system is limited to predicting user ratings given previous ratings, which are a product of mathematical models being applied on the item ratings [14]. In this work, we go beyond the standard scenario where ratings are the only signal and investigate how additional features extracted from different types of data affect the task of recommendation. The data we use comes from a commercial business rating platform and comprises of text and social network-related information. By testing the performance of the system using different feature sets we aim to evaluate how well each feature performs for this task and we hope to discover the best possible combination of automatically extracted features.

1

http://www.amazon.com

2

(14)

2 Chapter 1. Introduction

The main research question for this work is whether we can enhance recommendation in it’s two approaches, rating prediction and ranked list generation, using textual and social information. In order to address this question, we answer the following sub-questions: Research question 1. Can the performance of a traditional recommender system be im-proved using a connected graph of users, which defines their social circle in the form of friendships on an online platform?

Research question 2. Can the performance of a traditional recommender system be im-proved using textual information which expresses users’ complementary comments on item reviews?

1.2 Contributions

The main contributions of this work are:

∙ a pipeline that boosts recommendation using social and textual information from a business rating platform;

∙ insights into how the performance varies using different feature sets;

∙ proof of concept regarding the performance of the system when not addressing the cold-start problem;

The rest of this thesis is as follows. Chapter2 lays the foundations of the task of recom-mendation. This helps provide some basic insights on which the rest of this work builds on. Chapter3provides extensive insights on current state-of-the-art regarding the task of recom-mender systems. Chapter4presents the main body of this work and details the methodology we introduce. In Chapter5we describe the experimental setup and in Chapter6we report and analyze our results. Finally, in Chapter7 we summarize this work and discuss about future research directions.

(15)

3

Chapter 2

Background

The purpose of this Chapter is to provide background knowledge about recommender sys-tems. To achieve that, some of the fundamentals that have been developed in this research area over the course of the last years will be briefly introduced.

2.1 Definition of Recommender Systems

One of the first definitions for recommender systems was presented by Paul Resnick and Hal R. Varian in 1997. In their work they state that “in a typical recommender system, people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” [49].

Recommender systems (RSs) are typically classified into non-personalized and personalized methods. As their names suggest, the former do not consider the individual user and the latter build on the so called user profile in order to make recommendations. Personalized recom-menders are further divided into persistent (or long term) and ephemeral (or short term), [43]. Those notions are illustrated in Figure2.1.

(16)

4 Chapter 2. Background

Some examples of non-personalized methods consist of wisdom of the crowds oriented meth-ods such as “most-popular” items [48]. Obviously this type of recommendation is trivial and does not involve machine learning.

In this work we are interested in the personalized scenario. Briefly put, personalized rec-ommender systems output recommendations that are targeted at specific individuals based on their feedback, expressed explicitly or implicitly. This feedback forms the user profile, which is then used to generate recommendations. An example of information that can be included in a user profile is user demographics [36].

The formal definition of a recommendation system can be expressed as follows. Given a set of users, a set of items, and reviews that characterize the users’ opinions towards a set of items, the system aims at pointing a user towards new, not-yet-experienced items that may be relevant to the users current task [50].

2.2 Why Recommender Systems?

As the CEO of Amazon put it, “If I have 3 million customers on the Web, I should have 3 million stores on the Web” [56]. J. Bezos, stresses that the effect a successful recommenda-tion system has over a web service can dramatically improve the customer experience and therefore the service’s overall success.

The general purpose of such a system is to satisfy the customer by providing them with what they want to see when browsing through a service. Research indicates that it is much more cost-efficient to keep a current customer than to find a new one [57], therefore it is important to keep the user experience satisfactory by showing them content related to their interests. RSs help in providing personalized engaging content for services such as web stores with low overhead, therefore a lot of research has been focused on this area.

2.3 Categorization of Recommender Systems

There are various ways to categorize recommender systems. In this section we shortly present a taxonomy that is relevant to the way we approach the problem.

2.3.1 Content-Based Filtering

We start off with the simplest form of filtering, Content-Based Filtering (CBF). In CBF user activity is constantly monitored, and the system provides the user with items that match items that she has left positive reviews for, in the past. This type of recommendation is often applied for suggesting documents, web pages and mostly text-based items.

(17)

2.3. Categorization of Recommender Systems 5

CBF is currently not very widely used since it adheres to a specific set of features and tends to overspecialize, due to the fact that previously rated items by individual users are deemed the most important resource for recommendation [44,2].

A positive remark for this type of system is the simplicity of implementation, since in order to generate a user profile it is sufficient to traverse only the items rated by the specific user.

2.3.2 Collaborative Filtering

A method that is the most mature and most widely applied for recommender systems is Collaborative Filtering (CF) [12]. CF is a term first coined by the developers of Tapestry [19], the first ever recommender system. In this approach, recommendations are again deduced by user-item ratings, however here the origin of the recommendations are users that are evaluated as similar to the recommendation subject. Contrary to the systems mentioned above, CF requires a user profile that is formed by accumulated information regarding the user. This user profile consists of a vector of item ratings that associate the given user with a set of items from the item pool. These ratings can contain any type of information but are usually limited to a boolean value (like/dislike) or in a more fine-grained resolution, a rating (e.g. a number in range of [1, 5]). An example of a platform that uses this method is Amazon1, with a method called item-to-item collaborative filtering [34]. This algorithm produces high quality real-time recommendations and manages well with the scaling required in a big datasets such as that of Amazon It produces recommendations under the spectrum of “users who bought item x, also bought y”, in an attempt to direct users to a new purchase based on what other users found useful to combine item x with.

The main difference of CF compared to the CBF is that the system is more versatile since it can offer a bigger variety of recommendation items by deducing user similarities and pre-dicting user preferences by looking at other users’ past activity.

In the CF scenario, the recommendation consists of the following steps. First, determining a user neighborhood which is usually calculated by observing user ratings [44,12]. Second, determining a set of recommendations based on this neighborhood by extracting a subset that adheres to a given user profile the most. This is one of the most effective out-of-the-box solutions to recommendation to this day.

2.3.3 Hybrid Methods

Research on RS has matured enough to show that strictly adhering to the methods introduced previously comes with a cost that we illustrated in Sections2.3.1and2.3.2. Therefore, to tackle such issues various hybrid methods have been introduced which combine the best of two worlds in order to improve performance. Such is the nature of most of the methods that we will introduce in detail in Chapter3.

(18)

6 Chapter 2. Background

2.4 Natural Language Processing

As stated in Chapter1, part of our proposed methods depend on processing text that accom-panies user reviews. In order to leverage the text in our data we employ two methods, topic modeling and word embeddings. In machine learning and natural language processing, a topic model is a statistical model for discovering the abstract topics that occur in a collection of documents. A topic is a distribution over a fixed vocabulary.

2.4.1 Topic Modeling

We can now formulate the problem into that of extracting topics from all the user generated text and then assigning users to those topics in a stochastic manner. Topic modeling has proven [1,15] to be a good way to encode text to enhance recommendation, in this work we will investigate how well it performs compared to word embedding. We choose to use Latent Dirichlet Allocation (LDA), a state-of-the-art solution introduced by [8]. The hypothesis LDA makes is that each document consists of a probabilistic mixture of multiple topics. Each topic has a vocabulary of tokens that it produces with a given probability. LDA is most easily described by its generative process, the imaginary random process by which the model assumes the documents arose. Furthermore, each term of each document, itself has a marginal contribution to the mixture since it can also be expressed as a mixture of the same topics as follows: from the collection of words that is each document 𝐷 the model describes how each document arose. First the 𝐾 priors are set which represent the topic distributions as multinomials containing 𝑉 elements each, where 𝑉 is the number of tokens in the corpus. Let 𝛽𝑖 represent the multinomial for the 𝑖-th topic, where the size of 𝛽𝑖 is 𝑉 : |𝛽𝑖| = 𝑉 .

Given these distributions, the LDA generative process is defined as follows: For each document:

∙ randomly choose a distribution over topics (a multinomial of length 𝐾) 𝜃𝑑

∙ for each word in the document:

– Probabilistically draw one of the 𝐾 topics from distribution 𝑡𝑖, say topic 𝛽𝑗

– Probabilistically draw one of the 𝑉 words from 𝛽𝑗

This generative model emphasizes that documents are comprised of multiple topics. The first step of the process reflects that each document contains topics in different proportions. The second step shows that each word in the document is drawn from one of the 𝐾 topics in proportion to the document’s distribution over topics as determined in the first step. To be noted here that this model does not make any assumptions about the order of the words, otherwise known as the bag-of-words assumption.

Sampling-based algorithms attempt to collect samples from the posterior to approximate it with an empirical distribution. The most commonly used sampling algorithm for topic modeling–and the one we use–is Gibbs sampling, where a Markov chain is constructed, each

(19)

2.4. Natural Language Processing 7

dependent on the previous-whose limiting distribution is the posterior. Further details on Gibbs sampling and why we picked are presented in Chapter4.

2.4.2 Word Embeddings

The second method we utilize to vectorize text is word embedding, a well studied and re-ported on method among machine learning and data mining techniques [40,23,33]. Word embedding is the collective name for a set of language modeling and feature learning tech-niques in natural language processing where words or pieces of text are mapped to dense vectors of real numbers. The term originates from the mathematical concept of embedding from a space with one dimension per word, to a continuous vector space of much lower dimension.

Mikolov et al. [41] introduced a popular [20, 51] framework which solves the problem of text vectorization by utilizing the context in which words appear in text, called word2vec. Word2vec comprises of two distinct models: continuous bag-of-words (CBOW) and continu-ous skip-gram. In the continucontinu-ous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. According to the authors CBOW is faster while skip-gram is slower but more efficient when it comes to infrequent words.

Word2vec uses a single hidden layer fully connected neural network. The input and output layers is set to have as many neurons as there are layers in the training vocabulary. The hidden layer size, defines the dimensionality of the resulting word vectors. The size of the output layer is the same as the input layer.

We have layed out some of the basic methodology introduced to address the task of recom-mendation. Chapter3shows some state-of-the art methods that motivated our work.

(20)

(21)

9

Chapter 3

Related Work

In this chapter we will first define the two approaches that have been previously defined in the literature in order to tackle the recommendation task (Section3.1). Next, we introduce work on RSs that are of interest to the research questions formed in Chapter1. To the best of our knowledge, there is little work combining the social aspect with text-based recommendation, which is the main motivation for this work. That is why the last two sections are dedicated introducing work that has addressed the two domains separately, textual information based (Section3.2) and social based recommendation research (Section3.3).

3.1 Approaches

3.1.1 Rating Prediction

Rating prediction is a straight-forward and popular among academia [53, 52, 37, 29, 28,

45] way to approach recommendation. The task is broken into a prediction problem with the rating a user would rate a given item being the target label. Recommendation accuracy in this case is quantified in terms of error to the actual values usually using root mean squared error (RMSE). For our experiments we use some state-of-the-art rating prediction frameworks, namely we use BiasedMF, Probabilistic Matrix Factorization (PMF), Bayesian PMF, and RegSVD. The difference between those methods lies in the way the mathematical model to generate recommendations is built.

Those methods’ input consists of the ratings matrix 𝑅. Given 𝑁 items and 𝑀 users, the rat-ings matrix is defined as an 𝑁 ×𝑀 matrix where each row corresponds to a user, each column to an item, and the entries to a rating. The goal of Matrix Factorization is to reveal attitudes or preferences of users by a small number of unobserved factors, through decomposition of the 𝑅 matrix.

It appears that the simple matrix factorization (MF) performs better if user and item biases are taken into account. A user bias is defined as the difference that can be noticed throughout users in their rating behaviors for similar items. Incorporating these to the ratings matrix be-fore applying matrix factorization improves the outcome of the MF. This comprises a method called BiasedMF.

(22)

10 Chapter 3. Related Work

PMF is a Collaborative filtering approach introduced by Salakhutdinov and Mnih [53]. Un-like previous work, they managed to produce a linearly scalable method. Furthermore, it proved superior to other state-of-the-art algorithms which have trouble making accurate pre-dictions for users who have very few ratings, PMF performs well on very sparse and imbal-anced datasets, such as the Netflix dataset.

BayesianPMF [52] reduced overfitting that PMF is prune to and showed further improvement over its performance. That is achieved by training PMF models using Markov chain Monte Carlo methods as opposed to maximum a posteriori (MAP) estimation used by PMF models. Singular value decomposition (SVD) is another matrix decomposition algorithm. Regular-ized SVD [45] predicts ratings based on item features and user preferences inferred by the SVD. It trains each feature, trying to minimize the error between the actual rating and the prediction and provides significant improvement over the reference algorithm Netflix Cine-match.

3.1.2 Learning to Rank

Learning to rank (LTR) is a machine learning framework whose task is to order a set of results (documents, images or other) by relevance based on a query. The accuracy is measured by metrics like precision and recall among others. This framework has received plenty of attention in recent work [9,17,25,61,62,10].

There are three main approaches for LTR: pointwise (such as RandomForests [9] and MART [17]), pairwise (such as RankBoost[25], RankNet [10]) and listwise (such as LambdaMART [61]). The difference between those three methods lies in the calculation of the error metric on training time.

Pointwise methods try to approximate the true label of the documents. The labels of docu-ments are real numbers and reflect the position they should appear in in the rank list.

Pairwise approaches cast the ranking problem to pair ordering. The quality of the rating is measured in terms of correctly relatively placed pairs of documents throughout the produced rank lists.

Listwise methods extend the pairwise approach to a broader perspective by addressing the ranking problem by taking the ordering of all the documents into account when calculating the error function.

In this work, we experimented with plenty of LTR methods, but chose LambdaMART [61] to report on, which is a listwise method. LambdaMART uses an approximation to the gra-dient of the cost by modeling the gragra-dient for each document in the dataset with lambda functions, called 𝜆-gradients. LambdaMART extends MART, the main difference being that in LambdaMART computes the derivatives, just like LambdaRank. Each iteration, the tree constructed models the 𝜆-gradients over the entire dataset.

(23)

3.2. Text-Based Recommendation 11

3.2 Text-Based Recommendation

A well-studied field is that of text-based recommendation in the microblogging sphere [1,

35, 15, 27, 46]. Most of the work we present in this Subsection is somewhat related to microblogging.

Abel et al. [1] address the task of news recommendation using Twitter. They also incorporate temporal information in order to recommend news articles to users. One of their methods successfully employs topic modeling on text for recommendation, which we investigate in our methodology. In particular, they use topic modeling in order to represent both users and news article categories. Their topic modeling approach has three variants: hashtag based, category-based (using a narrow categorization of their topics by assigning each news article to a set of 18 broad categories (such as sports, politics etc.) and entity-based (using entities extracted from external sources such as Wikipedia). In our work we capitalize on the results obtained by Abel et al. [1] by replacing the laborious (and expensive) task of manual topic modeling with Latent Dirichlet Allocation.

Lu, Lam, and Zhang [35] much like us, choose to vectorize text in order to generate recom-mendations. They also make use of the social aspect in order to generate recommendations, utilizing a microblogging platform, Twitter. For the task of mapping a Twitter post to a set of concepts, they employ Explicit Semantic Analysis (ESA) [18], which is an algorithm for vectorizing text using a weighted Wikipedia concept vector. These vectors is what comprises the features they use for their experiments which demonstrate their effectiveness for tweet recommendation. This work shows also addresses the task of recommendation using vec-tors to represent text, however calling on a knowledge base like Wikipedia for vectorizing words effectively narrows the nature of texts that can be modeled. Another shortcoming of this method is that in case of free text in the form of a general opinion or sentiment ESA is a less reliable source to vectorize text. In our work we try to broaden their perspective and generalize it based on vectors that occur based on any training set.

In the same domain, Diao and Jiang [15] work with Twitter data in order to identify timely events. Event detection is a topic that has received attention in recent years [46, 60, 4]. They too leverage topic modeling for the task of recommendation. To elaborate, they employ Latent Dirichlet Allocation (LDA) for the task of topic modeling in order to detect events. What differs drastically in their setup compared to ours is that the temporal aspect of their work is deemed the most important aspect, since it is their goal to identify real-time events, in contrast to our work where time is not considered.

3.3 Social-Based recommendation

When humans make decisions they tend to follow the example of others in their social sur-roundings who have had to make similar decisions [32]. Furthermore, using social structures

(24)

12 Chapter 3. Related Work

for recommendation helps traditional CFs’ needs for explicit feedback by enriching user pro-files based on previous user activity. These are some of the observations some recent work has capitalized on, in order to boost recommendation performance.

As stated in Chapter2, RSs confront a set of challenges crucial to their performance, some of which are data sparsity and scalability. Ma et al. [38] address those issues by utilizing the so-cial aspect that can be extracted from user-to-user relations. They base their proposed method (SoRec) on the assumption that users are not identically distributed and build on interactions and connections in order to boost their recommendation quality. To achieve this, they first construct a social network graph by pulling information about their users from Facebook1, that connects users in a weighted manner, each weight quantifying the level of trust between users. Upon constructing the ratings matrix, they use the graph extracted from the social rela-tions to enrich the ratings matrix by inducing missing matrix values using the weighted edges of the graph. Finally, they apply Matrix Factorization [31], a standard method that is called on the enriched ratings matrix and basically applies Singular Value Decomposition (SVD) on the matrix, to generate predictions. The results they report outperform other state-of-the-art CF algorithms and produce a competitive system that is scalable to large datasets. Their results are indicative of the added value of the social aspect in recommendation. For their work they had to pull the social features from a complex social media platform which adds to the complexity of their implementation. In this work, we try to investigate how elaborate the trust measure needs to be to produce sufficiently good results by trying different measures. Guy et al. [22] use a social software application suite2which provides them with five social applications: profiles, activities, social bookmarks and blogs/communities. They focus on creating recommendations for the last three of those and their main area of focus is on the gain that is achieved by capitalizing on user-to-user familiarity over user-to-user similarity. Their results show a clear superiority of the familiarity network they build using the data they have for predicting interests. This work is built on a platform that offers a variety of social related features, therefore cannot be directly in line with our work. However, the important contribution of this work is the boosting of recommendation using the social aspect.

In this work, we combine methods and ideas inspired from the work described in this chapter for to build a system that addresses our research questions. Section4.1provides details into our proposed method.

1

http://www.facebook.com

2

(25)

13

Chapter 4

Text and Social-enhanced Business

Recommendation

In this Chapter we will discuss the methods we introduce to solve the task of business recom-mendation using complementary textual and social data, gathered from a business reviewing platform. To that end, we utilize machine learning and learning to rank techniques using fea-tures extracted from our dataset. We extract social and textual feafea-tures and combine them in order to find the best performing rating prediction and item recommendation models. In Sec-tions4.1and4.2introduce those features that address the research questions we introduced in Chapter1, and in the Section4.3we introduce a hybrid method that is the combination of those two features.

4.1 Text-enhanced Recommendation

As stated previously, in this work we aim to leverage the free text that accompanies reviews to boost recommendation. Our hypothesis is that the textual comments of a user contain information that is representative of the user’s interests in a more elaborate way than the plain numeric value that is a business rating. In order to work with text and extract similarity from any given text, we encode it in a way that makes it easy to process. In this work we employ two methods in order to vectorize text, namely topic modeling and word embedding. The building blocks for this method will be vectors that encode user-documents and item-documents. By utilizing those vectors we seek to characterize text from pairs of users aggre-gated review text and items aggreaggre-gated review texts. We then use similarity metrics in order to quantize the similarities of the vectorized text for any user-item pair, which we interpret as the likelihood of user-item pairs being good matches. These similarities will be used in order to generate feature sets to learn from in order to generate recommendations.

(26)

14 Chapter 4. Text and Social-enhanced Business Recommendation

4.1.1 Topic Modeling

The first method we use to achieve text vectorization is topic modeling through Latent Dirich-let Annotation (LDA). In this section we will go through all the details for calculating the feature that quantifies the similarity between a user and an item using topic modeling.

4.1.1.1 LDA notation

For the rest of this document we define the following terms:

∙ A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, . . . , 𝑉 }. We represent words using one-hot vectors of length V that have a single component equal to one and all other components equal to zero known as one-hot encoding. Thus, using superscripts to denote components, the 𝑣-th word in the vocabulary is represented by a 𝑉 -vector 𝑤 such that 𝑤𝑣 = 1 and 𝑤𝑢 = 0 for 𝑢 ̸= 𝑣.

∙ A document is a collection of 𝑁 words denoted by 𝑑 = (𝑤1, 𝑤2, . . . , 𝑤𝑁). When we

speak of documents in our implementation, we are referring to either a user document which consists of all of the text a specific user has placed on business reviews, or an item document which consists of all of the text that has used by all the users to characterize a specific item.

∙ A corpus is a collection of 𝑀 documents denoted by 𝐷 = {𝑑1, 𝑑2, . . . , 𝑑𝑀}. In our

case, we define two corpora, one that consists of all the user-documents and another from all of the item-documents.

4.1.1.2 LDA using Gibbs Sampling

As described in Section2.4, LDA is a generative probabilistic model of a corpus. The idea is that documents are represented as mixtures over latent topics, where each topic is character-ized by a distribution over words. We previously mentioned (Chapter2.4) that LDA assumes a generative process for each document 𝑤 in a corpus 𝐷, so if we have 𝑇 topics we can express the probability of the 𝑖-th word in a given document as:

𝑃 (𝑤𝑖) = 𝑇

∑︁

𝑗=1

𝑃 (𝑤𝑖|𝑧𝑖 = 𝑗)𝑃 (𝑧𝑖= 𝑗), (4.1)

where 𝑧𝑖 is a latent variable indicating the topic from which the 𝑖-th word was drawn, and

𝑃 (𝑤𝑖|𝑧𝑖 = 𝑗) is the probability of the word 𝑤𝑖 under the 𝑗-th topic. 𝑃 (𝑧𝑖 = 𝑗) gives the

probability of choosing a word from topics 𝑗 in the current document, which will vary across different documents. Intuitively, 𝑃 (𝑤|𝑧) indicates which words are important to a topic, whereas 𝑃 (𝑧) is the prevalence of those topics within a document.

(27)

4.1. Text-enhanced Recommendation 15

Viewing documents as mixtures of probabilistic topics makes it possible to formulate the problem of discovering the set of topics that are used in a collection of documents. Given 𝐷 documents containing 𝑇 topics expressed over 𝑊 unique words, we can represent 𝑃 (𝑤|𝑧) with a set of 𝑇 multinomial distributions 𝜑 over the 𝑊 words, such that 𝑃 (𝑤|𝑧 = 𝑗) = 𝜑(𝑗) and 𝑃 (𝑧) with a set of 𝐷 multinomial distributions 𝜃 over the 𝑇 topics, such that for a word in document d, 𝑃 (𝑧 = 𝑗) = 𝜃(𝑗). To discover the set of topics used in a corpus 𝐷 = {𝑑1, 𝑑2, . . . , 𝑑𝑀}, where each 𝑤𝑖 belongs to some document 𝑑𝑖, we want to obtain an

estimate of 𝜑 that gives high probability to the words that appear in the corpus.

Blei, Ng, and Jordan [8] gave an algorithm for obtaining approximate maximum-likelihood estimates for 𝜑(𝑗)and the hyperparameters of the prior on 𝜃(𝑑𝑖)_{, defining this procedure as}

LDA. We make use of Gibbs sampling which uses a symmetric 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼) prior on 𝜃(𝑑𝑖)

for all documents, a symmetric 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛽) prior on 𝜑(𝑗)for all topics, and Markov chain Monte Carlo for inference. An advantage of Gibbs sampling is that we do not need to explic-itly represent the model parameters: we can integrate out 𝜃 and 𝜑, defining a model simply in terms of the assignments of words to topics indicated by the 𝑧𝑖. Since we are not performing

inference in the Dirichlet hyperparameters, using Gibbs sampling is not necessarily going to lead to the same results as the original LDA. In particular, the symmetric prior on topics is likely to mean that there probably little variation in the way topics are used, although the ex-tent to which this is true will be influenced by the choice of 𝛼. An empirical Bayes procedure could be used to estimate asymmetric 𝛼 parameters, resulting in an approach closer to LDA. We discuss the choice of those parameters in Chapter5.

4.1.1.3 From LDA to recommendation

In order to generate recommendations we leverage LDA to quantize the similarity between user document-item document pairs. Using LDA we aim to encode documents from the two corpora as vectors, using which we can determine the similarity.

Having defined the two corpora (user-documents, item-documents), we can now input them to the LDA pipeline to extract topic vectors from the corpus. We choose the item documents to train the LDA model. The intuition behind choosing the item documents as our LDA corpus is that we expect each item to have a narrower vocabulary in it’s documents. Furthermore, the product itself doesn’t allow for as much richness as the users’ documents, who tend to have a broader vocabulary since their taste might vary in items reviewed. This intuition is confirmed upon exploration of the dataset, we discuss this further on Section5.1.

We fix the number of LDA topics to 𝑇 and the output of LDA leaves us with a vector 𝑤 that characterizes each token of the vocabulary as a vector of probability over the topics (the length of the vector is thus 𝑇 ). Using those vectors, we can now easily define any given piece of text by utilizing the bag-of-words approach and calculating the mean of the probability vectors that the text comprises of.

(28)

16 Chapter 4. Text and Social-enhanced Business Recommendation

For any given piece of text defined as a collection of 𝑁 words 𝑑 = (𝑤1, 𝑤2, . . . , 𝑤𝑁), we

define it’s topic vector 𝑡(𝑑) of length 𝑇 as follows:

𝑡(𝑑) = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝜃𝑤𝑖, (4.2)

where 𝜃𝑤𝑖 represents the distribution over topics for any given word 𝑤𝑖.

LDA outputs the following distributions: i) 𝜃 (topic-document), which characterizes the prob-ability of each topic given every one of the input documents (item-documents), and ii) 𝜑 (word-topic), which characterizes the probability of each topic given the words in the vocab-ulary. Given the 𝜑 distributions we can now estimate 𝜃 for any given document by treating it in a bag-of-words manner as mentioned previously and calculating the mean of the words it comprises of. This is the process we follow in order to define the user-documents. Using this process we have successfully managed to automate the task of encoding any user document and item document as a mixture of topics, the vectors of which are symbolized by 𝑇𝑢and 𝑇𝑖

respectively.

The last step in order to extract a user-item similarity is to compare those vectors to each other, for which we use cosine similarity. The product of this process leaves us with a feature that. The feature extracted from this process for a user item pair (𝑢, 𝑖) characterizes the sim-ilarity between the user 𝑢 and an item 𝑖 based on text that accompanies each. It is calculated as follows: 𝑠𝑖𝑚𝐿𝐷𝐴(𝑢, 𝑖) = 𝑇𝑢· 𝑇𝑖 ‖𝑇𝑢‖ ‖𝑇𝑖‖ , (4.3) 4.1.2 Word Embedding

The second form of text vectorization we employ is word embedding using the word2vec framework. In our experiments we use the continuous skip-gram architecture, in which as stated in Section2.4the model uses the current word to predict the surrounding window of context words. This method varies from the previously mentioned LDA in the sense that documents are not processed using the bag-of-words assumption, rather context is taken into account. The processing of the occurring vectors is done similarly to LDA; the process is as follows.

4.1.2.1 Training the word2vec model

First, all the words of the vocabulary are converted into one-hot vectors index vectors of length 𝑉 which then comprise the input of the neural network.

In the case of CBOW, the input to hidden layer connection is instead replicated 𝐶 times (𝐶 denotes the number of context words) and adding a divide by 𝐶 operation prior to the hidden

(29)

4.2. Social-enhanced Recommendation 17

layer. That way, context of multiple words is taken into consideration in training time. The output layer remains the same and the training is done in the manner discussed in Chapter2.4. The output of this procedure are the word vectors, which will be used to encode documents.

4.1.2.2 From word2vec to recommendation

After having obtained the vector representation for each word in the vocabulary we can en-code all of the words in the vocabulary by a vector 𝑤. Similarly to how we explained the process for LDA similarity for the user documents in the case of LDA, using the bag-of-words approachfor each item and user document we define their vectors as the mean of the words they comprise of. So for each user document of length 𝐾, the user vector 𝑈 is defined as:

𝑈 = 1 𝐾 𝐾 ∑︁ 𝑗=1 𝑤𝑗, (4.4)

and for every item document of length 𝐿 the item vector is given by:

𝐼 = 1 𝐿 𝐿 ∑︁ 𝑗=1 𝑤𝑗, (4.5)

where 𝐼 and 𝑈 stand for item and user vector given user vector given. Now we can proceed to define the similarity feature extracted using the word2vec method as such:

𝑠𝑖𝑚𝑤2𝑣(𝑢, 𝑖) =

𝑇𝑢· 𝑇𝑖

‖𝑇𝑢‖ ‖𝑇𝑖‖

, (4.6)

4.2 Social-enhanced Recommendation

The method presented in this section addresses RQ1.1. “Can the performance of a traditional recommender system be improved using a connected graph of users, which defines their social circle in the form of friendships on an online platform?” Through the work presented in this Section we will be able to reflect on the gain we achieve by incorporating the social aspect to the task of recommendation. The intuition behind social-enhanced recommendation is that the users’ decision making process tends to be influenced by their social surrounding. Our hypothesis for motivating social-enhanced recommendation is the following: “if user A is related to the user B, then A’s behavior is influenced and therefore can be predicted, by observing user B’s interests.” Further, we will compare the gain we obtain by incorporating the global averages on ratings linearly, and argue that by weighting the users neighborhood more, we achieve better results.

(30)

18 Chapter 4. Text and Social-enhanced Business Recommendation FIGURE 4.1: Social graph U1 U2 U3 U4 U5 U6 U1 - 1 1 1 0.5 0.5 U2 1 - 0.5 0.5 0 0 U3 1 0.5 - 0.5 0 0 U4 1 0.5 0.5 - 1 1 U5 0.5 0 0 1 - 0.5 U6 0.5 0 0 1 0.5 -TABLE 4.1: Trust matrix

The way we incorporate the social aspect in our data requires a graph that characterizes the social connections between users in a weighted manner. Using the social graph we seek to quantize user-to-user relations in the form of a trust measure.

User-to-user friendship connections are compiled into a social network graph in which each user is represented by a node and edges connecting users represent a friendship. Each edge is accompanied by a weight 𝑤𝑖𝑗 which quantifies the trust that characterizes the familiarity

between users 𝑢𝑖and 𝑢𝑗.

To create the trust measure, we traverse the graph for every user with an edge depth of 2. This means that for each user, we look at their immediate connections and also for the connections of her connections in order to expand further over the given social graph. The trust weights we assign to the edges are 1 for the immediate connections and 0.5 for the second rank connections. The product of this step is the trust matrix that is characterized by two user ids and a trust value 1 or 0.5. This helps to fight the sparsity of the ratings matrix by expanding it to connected users in the graph.

A toy example is illustrated on4.2. User 𝑈 1 has 3 immediate connections to users 𝑈 2, 𝑈 3, 𝑈 4. The trust matrix is formed as shown on the right hand side of4.2. We can see that even though no connections exist for user pairs (𝑈 1, 𝑈 5), (𝑈 1, 𝑈 6), (𝑈 2, 𝑈 4), (𝑈 2, 𝑈 5), (𝑈 3, 𝑈 4), (𝑈 5, 𝑈 6) we deduce a trust measure for those connections thanks to the graph expansion we discussed previously.

Having established the trust matrix we can define the feature that is the output of this method by multiplying the trust measurement by the rating the trusted user has given the item. There-fore, each trusted user that has rated the item contributes a weighted amount to the predicted rating feature which will be used in order to aid in the prediction task. For a user item pair (𝑢, 𝑖), the feature extracted from this process is the following:

𝑠𝑖𝑚𝑠𝑜𝑐𝑖𝑎𝑙(𝑢, 𝑖) = 1 𝑀 𝑀 ∑︁ 𝑗=0 𝑟(𝑗, 𝑖), (4.7)

where the given user 𝑢 has 𝑀 social connections and 𝑟(𝑖, 𝑗) is the score each user 𝑗 from the social vicinity of 𝑢 has given to item 𝑖.

(31)

4.3. Text-Social hybrid Recommendation 19

4.3 Text-Social hybrid Recommendation

Having established the previous two methods we will now combine them into one method which incorporates both aspects and addresses both the research questions in a combinato-rial fashion. This method is a text-enhanced recommender that takes the social circles into account. In this implementation, we make use of the vectors that have been extracted in the method explained in Section4.1 and the social weights that have been extracted using the methodology from Section4.2.

In this method however, we take advantage of the social graph and rather than the ratings, we incorporate the users’ neighborhood’s reviews into the user vector by weighting them as per social-enhanced recommendation suggests. To elaborate, the user vector now consists of a weighted average of both the given user’s text as well as the user’s social-neighborhood’s text. A formal representation of the updated user vector is given:

𝑇_𝑢′ = 𝑤1𝑇𝑢+ 𝑤2 1 𝑀 𝑀 ∑︁ 𝑗=0 𝑇𝑗, (4.8)

where user 𝑢 has 𝑀 social connections in the graph, and 𝑤1and 𝑤2are the weights assigned

to each of the contributing factors.

We employ linear regression to optimize this function on the two weights for our experiments. The feature extracted from this process is now used as per LDA-enhanced recommendation. This feature is formalized as:

𝑠𝑖𝑚ℎ𝑦𝑏𝑟𝑖𝑑(𝑢, 𝑖) =

𝑇_𝑢′ · 𝑇_𝑖 ‖𝑇′

𝑢‖ ‖𝑇𝑖‖

, (4.9)

Having defined the features that we can pick from, we proceed to experiment in order to find their best performing combination.

(32)

(33)

21

Chapter 5

Experimental Setup

In this Chapter we first introduce the dataset we use through some interesting statistics. Next, we go through each specific method we implement and the specifics of our implementation for it. Finally, we present the evaluation metrics that we use to evaluate our methods.

5.1 Dataset selection

We saw in Chapter3that a very common application for RS is on datasets that include user ratings of items. In this work, motivated by our two research questions we lay out some requirements that need to be met to enable us to answer those questions. Namely we require the dataset to contain two features, first, a social feature which somewhat reflects users’ social circles and second, we want to feedback in the form of free text that accompanies the rating. A variety of datasets are employed in the bibliography1[42, 1, 35, 55], however one that meets our needs and the one we used is the Yelp dataset [63]. This dataset has been provided by a commercial website therefore includes a lot of information that is irrelevant to our work, however it satisfies our needs for social and textual information.

5.2 Yelp academic dataset

Yelp is a company which provides a business reviewing platform. Users can rate and com-ment on services provided by businesses. Furthermore, this service provides the users with the ability to incorporate a social aspect to their profiles by adding people as friends, there-fore following their activity on the platform. Businesses in turn are provided with a platform which encourages their clientele to leave constructive feedback, therefore enabling them to improve on aspects that their clients mention in their reviews.

The Yelp academic dataset2 consists of five different types of information, namely: “busi-ness”, “review”, “user”, “check-in”, “tip”. Table5.1provides a brief overview.

1

Movielens, Twitter etc.

(34)

22 Chapter 5. Experimental Setup

Table Description Entry count

business Object to the recommendation process (item, 𝑖)

15, 584

user Subject to the recommendation process (user, 𝑢)

70, 816

review User generated feedback (rating, 𝑟𝑢,𝑖) 335, 021 check-in Holds the count of users that visited a

business on an hourly basis

11, 433

tip A short message left by a user regarding a business for future reference

113, 992

TABLE5.1: An overview of the Yelp academic dataset

The business data type refers to what we have been referring to as an item throughout this document. The information that is enclosed in this array consists of general information about the business such as “type”, “id”, “address”, “review count”, “stars”3(rounded to 0.5) and “opening hours”. This table contains 15, 584 entries.

The user table contains user credentials (“user name”), “review count”, “average stars” and “member since”. This table also contains information about the given user’s social circle, the field “friends” maps all of the users friend ids, therefore enabling us to construct a social network graph with user-to-user relations. The “user” table consists of 70816 user instances. Lastly the review table is the most important resource used in our implementation since the recommendations we produce are largely based on the feedback that the user has left to other items. Each entry of this table contains information about a single review of a user to an item, more specifically the information contained is as follows: “business id”, “user id”, “stars” (an integer in the range [1, 5]), “text”, “date” and “votes”. In the dataset we have 335, 021 review entries.

To reduce noise we opted out users who have not left at least two reviews, considering that the main goal of this work is not to address the cold-start problem. This way we make sure that the users we are left with have a consistent flow of reviews and we therefore, we can proceed to make recommendations to them based on their previous activity on the platform. In addi-tion, since we want to incorporate social circles in our recommendations we selected users who have at least two friends in their circles. These actions left us with a subset database con-sisting of a total of 14, 311 businesses in “business_subset”, 14, 352 users in “user_subset” and finally 193, 995 reviews in “review_subset” to work with.

In their work Adomavicius and Zhang [3] confirm the correlation between data density and recommendation accuracy. Having introduced the the dataset, we can now define the density of the user-item matrix as such:

193995

14352 × 14311 = 0.094451341% (5.1)

3

As in other similar platforms the rating of a business is evaluated with a star pointing system which varies in the range [1, 5]

(35)

5.2. Yelp academic dataset 23

To put this into perspective, compared to some examples of similarly popular collaborative filtering datasets in the research community, Movielens 100K4(100, 000 ratings, 1, 682 items and 1682 ratings) and Netflix[5] (3, 000 users 3, 000 items and 105, 256 ratings) are 6.30% and 1.17% respectively[3]. In particular in the Movielens dataset all the users are guaranteed to have votes on at least 20 items whereas in our case we are using 2 as a threshold. It is now safe to say that compared to others, our user-item matrix is a rather sparse one. In yelp dataset however we are given the opportunity to exploit social network structures, which are not available in the two previously mentioned examples. The statistics regarding the yelp user-item ratings matrix are summarized in Table5.2.

Statistics User Item Min. number of ratings 3 3 Max. number of ratings 3, 286 1, 170 Avg. number of ratings 107 24

TABLE5.2: Statistics of the rating matrix

5.2.1 Review Text

In this section we will discuss the “text” field of the “review_subset” table. This field is deemed the most important part of the review itself, since the main assumption behind our methods is that in the text review, the taste of the user is expressed. Therefore deciphering what the comment consists of will ideally boost the performance of a method that utilizes only the user rating.

5.2.1.1 Text preprocessing

Prior to using the review text, some text preprocessing needs to be applied. This preprocess-ing consists of three steps as per traditional Information Retrieval, namely: text normaliza-tion, stop-word removal and stemming.

The goal of the normalization step is to map terms in the text into some universal form. For example we would like the tokens in capital letters to refer to the same tokens in lowercase. Stop-words are those words whose appearance in text is so frequent that their contribution in selecting and matching documents is minimal. Some examples of those words are given: a, and, at, by etc. All such words are therefore removed from the user comments.

Finally during stemming, words are converted in a heuristic process that aims for their lemmatization. Lemmatization is a linguistic, language specific process which implies re-duction of tokens to their dictionary headword form, otherwise called the lemma of the word. For example the words am, are, is following the linguistic rules would all reduce to be. In

(36)

our implementation we made use of the Porter Stemmer algorithm implementation from the NLTK package [6].

FIGURE5.1: Word cloud generated from text in user reviews

In Figure5.1we observe the 250 most commonly used words among the user comments (af-ter the preprocessing steps, hence some stripped of their suffixes). Sizes of words correspond to the frequency of their appearances.

Other interesting statistics of the review text data include, the average token count per review, the average sentence length of each review. These statistics can be found in table5.3.

Statistics Type Value average word count per review 150 average sentence count per review 8

TABLE5.3: Statistics regarding the review text

Finally, Figure5.2depicts the average comment length of the comment (in tokens) per rating. We can see that users tend to be more verbose on average when they are dissatisfied by the services provided. On the contrary a high rating seems to be an adequate indicator of satisfaction, thus resulting in less lengthy item reviews.

5.3 Baseline Methods and Comparisons

In this section we discuss the methods we used as baselines for the recommendation task as well as the methods implemented. Each subsection presents a more thorough view on each specific method.

(37)

5.3. Baseline Methods and Comparisons 25

FIGURE5.2: Average length of review per score.

In Table5.4we present an overview of what lies in this section. All the methods implemented and tested on our dataset that will be discussed as well as the abbreviation with which they will be referenced in the rest of the paper are presented.

There exist a plethora of RSs that base their predictions strictly on the ratings matrix. For the purposes of this experiment we pick a handful of state-of-the-art methods and compare them against our methods. The algorithms we picked are Probabilistic Matrix Factoriza-tion (PMF), Bayesian PMF, Social RecommendaFactoriza-tion (SoRec)5, BiasedMF and Regularized Singular Value Decomposition (RegSVD). We have introduced those methods in Section3.1

5.3.1 Global Average

The most intuitive and simplistic way to generate a recommendation is by using the average ratings of items. This method obviously does not incorporate any type of personalization in the results since the recommendations will be the same for any given user. For any given user 𝑢, the predicted rating for an item 𝑖 is given by:

𝑟𝑢,𝑖 = 𝑚𝑒𝑎𝑛(𝑟𝑖) (5.2)

5

SoRecwhich introduces a trust based social recommendation. For the purpose of this paper we alternate between different variants for the user-to-user trust measure and keep the rest of the implementation intact as provided by [38]

(38)

Acronym Gloss Reference

Naive Methods

Global Average Proof of concept naive method

-Majority Proof of concept naive method

-Baselines

Friends Average Extension of Global Average towards personalization

This thesis

PMF Probabilistic Matrix Factorization [53] Bayesian PMF Bayesian probabilistic matrix factorization

using Markov Chain Monte Carlo

[52]

BiasedMF Implicit Recommender Systems: Biased Matrix Factorization

[29]

SoRec Social recommendation using probabilistic matrix factorization

[37]

RegSVD Regularized Singular Value Decomposition [28] LambdaMART Adapting Boosting for Information

Retrieval Measures

[61]

Our Contribution

Text-enhanced Recommendation Leverages the textual information This thesis Social Enhanced Recommendation Leverages the social information This thesis Text-Social hybrid Recommendation Hybrid method that leverages both textual

and social features

This thesis

TABLE 5.4: An overview of the methods implemented and used in this pa-per.

Where 𝑚𝑒𝑎𝑛(𝑟𝑖) stands for the average star rating of the item 𝑖 throughout all reviews

con-cerning this item. The predicted rating for a user-item pair produced consists solely of the overall average rating for that item, without making any assumption about the user.

5.3.2 Majority

For proof of concept we provide another naive method, one that is based on the major vote per item. This is yet another unpersonalized method since the value per item does not vary depending on the user. For any given user 𝑢, the predicted rating for an item 𝑖 is given by:

𝑟𝑢,𝑖= 𝑀 𝑅(𝑖) (5.3)

Where intuitively 𝑀 𝑅(𝑟𝑖) stands for the most common vote a given item 𝑖 has received

throughout the review corpus.

5.3.3 Friends Average

In this implementation we took the basic idea of the global average a step further into per-sonalization. The main difference is that here we are interested in the given user’s social circle’s opinion. To incorporate that in the recommendation, we generate recommendations that consist of both the user’s social circle as well as the global average rating of the items.

(39)

5.3. Baseline Methods and Comparisons 27

To elaborate, we introduce a linear combination of the two values to deduce the predicted rating. The formula used for predicting a rating is given below:

𝑟𝑢,𝑖= ⎧ ⎨ ⎩ 𝑤1* 𝐹 𝑀 (𝑟𝑖) + 𝑤2* 𝑚𝑒𝑎𝑛(𝑟𝑖), if 𝐹 𝑀 (𝑟𝑖) is defined. 𝑚𝑒𝑎𝑛(𝑟𝑖), otherwise. (5.4)

Where 𝐹 𝑀 (𝑟𝑖) stands for the average rating the friends of the given user 𝑢 have given item

𝑖. In case the value of 𝐹 𝑀 (𝑟𝑖) is not defined (the users’ friends have left no review to item

𝑖), a simple fall-back function is called upon; which uses the global average (as in5.3.3). As seen in equation5.4the weights 𝑤1and 𝑤2are parameters upon which optimization needs

to be made in order to obtain the best score. The process followed to obtain the optimal weighting scheme is as follows. We sampled the test set 𝑁 = 60 times selecting a total of 𝑠 = 𝑡𝑒𝑠𝑡_𝑠𝑒𝑡_𝑠𝑖𝑧𝑒/8 samples at a time and then proceeded to estimate the RMSE score for each sample. In figure5.3 we can see the range of the values the error metric gets for the various values of 𝑤1 (x-axis).

(40)

5.3.4 Text-enhanced Recommendation

5.3.4.1 Topic Modeling

In our methods we make use of JGibbLDA6[47,7], a Java implementation that also applies Gibbs sampling for parameter estimation and inference. In Section4.1, values of the Dirichlet parameters have been assumed to be known. These hyperparameters, however, significantly influence the behavior of the LDA model.

Heuristically, sufficiently good model quality [21] has been reported for 𝛼 = 50/𝐾 and 𝛽 = 0.1. In our case, since our initial experiments assumed 𝐾 = 50, we use 𝛼 = 1 and 𝛽 = 0.1 which proved to provide good results and kept it after changing the number of topics to that 𝐾 = 25.

5.3.4.2 Word Embeddings

To extract the word2vec similarity feature we make use of the official Google framework7. There are some parameters that this implementation allows for tuning which affect the train-ing speed and quality. The first of those parameters is for pruntrain-ing the dictionary. Words that don’t appear often throughout the vocabulary are deemed uninteresting, as especially in our context typos and garbage can be very frequent, for that reason we define a lower threshold for word appearance count below which the words are disregarded. We set the pruning count to 10. Another parameter is the size of the Neural Network layers, which correspond to the “degrees” of freedom the training algorithm has. Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds. We set the NN layers to 50 which results in vector representations of vectors of length 50.

5.3.5 Social-enhanced Recommendation

For this method, we make use of the previously mentioned Global Average feature and com-bine those with the social graph that we extracted using the methodology from4.2.

5.3.6 Text-Social hybrid Recommendation

For the hybrid implementation we used the parameter tuning from the two methods men-tioned prior and combined their results as menmen-tioned in4.3.

6

http://jgibblda.sourceforge.net/

(41)

5.4. Evaluation metrics 29

5.3.7 Item Recommendation

For the task of item recommendation we experimented using a variety of learning to rank algorithms from the RankLib8 package. Namely, the algorithms we experimented with for this task are: LambdaMART, Random Forests and Coordinate Ascent. We picked Lamb-daMART to report on, a combination of LambdaRank and MART, which is a widely used learning to rank method that proved to be the most robust performer. LambdaMART de-scribed in “Adapting Boosting for Information Retrieval Measures” by Wu and Burges [61] is an ensemble method consisting of boosted regression trees (MART) [17] in combination with LambdaRank [11]. The tuning of this algorithm was set to have 1000 trees with 10 leaves each, the default setup of this method from the provider package. We use ERR@10 as the error metric on training.

5.4 Evaluation metrics

The task of recommendation is often divided into two subtasks, one is rating prediction which–as its name suggests–is the procedure of predicting a rating ^𝑟𝑖,𝑗 for a user 𝑖 and item

𝑗. The second subtask is that of item recommendation, in which case, the objective is to present the user with a ranked list of items that are considered most relative to that user’s taste. Those two tasks consider different metrics to measure their performance. The eval-uation metrics used in this work adhere to the standards used widely in the recommender systems scenario. We divide the metrics into two categories, each of which depending on the use case.

5.4.1 Rating prediction

The accuracy for this task is measured by the ability to predict the rating a user would give to an item. Since the dataset used provides a numerical (star) value to the review of a user towards an item it is most common to assume the proximity of the prediction to the actual value as the most significant aspect of the evaluation. In our implementation the main evalu-ation metric we optimize for is Root Mean Squared Error (𝑅𝑀 𝑆𝐸), the formula for 𝑅𝑀 𝑆𝐸 is given bellow: 𝑅𝑀 𝑆𝐸 =√︃ 1 𝑁 ∑︁ 𝑖,𝑗 (𝑟𝑖,𝑗− ^𝑟𝑖,𝑗)2, (5.5)

where 𝑟𝑖,𝑗 denotes the rating the rating user 𝑖 has given to item 𝑗, ^𝑟𝑖,𝑗 denotes the predicted

rating and 𝑁 is the number of predicted ratings.

(42)

To accompany 𝑅𝑀 𝑆𝐸 we make use of another metric namely Mean Absolute Error (MAE), the formula for MAE is as follows:

𝑀 𝐴𝐸 = 1 𝑁 𝑁 ∑︁ 𝑖=1 |𝑟_𝑖,𝑗− ^𝑟𝑖,𝑗|, (5.6)

the notation being the same as described above.

The “stricter” nature of 𝑅𝑀 𝑆𝐸 becomes apparent when comparing the two formulas, how-ever it is one of the most common metrics used in this type of setup hence it is the one we optimized for.

5.4.2 Item recommendation

The task of item recommendation differs from the rating prediction in the sense that in this case the objective is not to predict a rating for a given user-item pair, rather to generate a list of items for a given user which aims at placing the most relevant items to that user towards the top of the list. It becomes evident that the metrics presented in Subsection5.4.1are not applicable anymore since the objective of item recommendation is not to predict rating, rather to generate ranking lists. The most broadly used evaluation metric for used primarily in Infor-mation Retrieval applications whose task is to produce ranking lists of relevant documents, is Normalized Discounted Cumulative Gain (𝑁 𝐷𝐶𝐺) [26]. The task of item recommendation approximates that of Information Retrieval assuming that each user 𝑢 has a “gain” 𝑔𝑢𝑖, from

being recommended an item 𝑖. The average Discounted Cumulative Gain (𝐷𝐶𝐺) for a list of items 𝐽 is defined as:

𝐷𝐶𝐺 = 1 𝑁 𝑈 ∑︁ 𝑢=1 𝐼 ∑︁ 𝑖=1 𝑔𝑢𝑖 𝑚𝑎𝑥(1, log_𝑏𝑖) (5.7) Where the logarithm base is a free parameter, typically set to 2 (as is in this work). The normalized 𝐷𝐶𝐺 is given by:

𝑁 𝐷𝐶𝐺 = 𝐷𝐶𝐺

𝐷𝐶𝐺* (5.8)

(43)

31

Chapter 6

Results and Discussion

We will now evaluate the work that has been implemented for our newly proposed methods and discuss our results in tandem with the results obtained by other reproduced state-of-the-art methods for our dataset. The results are split in two Sections, one that addresses the rating prediction task and another which focuses on the item recommendation task directly. We conclude this Chapter with a Section dedicated in qualitative analysis of the topic modeling using some examples drawn from our dataset.

6.1 Rating Prediction

We first reproduce the simplistic approaches as well as some state-of-the-art methods that have been covered in Section 5.3. We compare the results we obtain for state-of-the-art methods to those achieved by incorporating social and textual information.

The Global Average and Majority methods’ performance is shown in the Naive Baselines segment of Table6.1. As stated previously, those methods carry no intelligence and therefore only exist to provide an intuition for the contents of our ratings matrix and to present some-thing to improve upon. We notice however, that despite their simplicity they seem to yield particularly good results on the dataset.

The Global Average’s performance clearly exceeds that of Majority’s. The superior naive method achieves an impressive score below than 1.0 on MAE and very close to the 1.0 mark for RMSE which sets the bar high in terms of performance for the rest of the implemented methods.

As discussed in Section 5.3, we applied a variety of machine learning algorithms to our dataset in order to provide the baseline performances that we want to improve upon. The results yielded by each of the baseline methods are outlined in the State-of-the-art section of Table6.1.

At a first glance we can see that the best performance is achieved by BiasedMF for both metrics, by far outperforming all the rest. The second best performing method is RegSVD. Probabilistic Matrix Factorizationhas the worst performance out of the six methods yielding

(44)

32 Chapter 6. Results and Discussion

a poor 3.03 for RMSE and close to the 3 mark again for MAE which renders it the most unsuitable for the given experiment.

Lastly for the task of rating prediction, we take a look at the performances yielded by the models we implemented. In Table6.1we can have a look at at the collective performances of all the methods we reproduced, with the last segment being dedicated to the newly introduced methods. As we saw in the analysis of the state-of-the-art methods, the method that seemed to outperform the rest was BiasedMF which achieved top notch performance in both error metrics.

6.1.1 Social-enhanced recommendation

We evaluate three variations of SoRec, all having the user-to-user trust weighted by the ex-tracted similarity measure as described in Chapter4. Unlike Ma et al. [38] who had an elab-orate way of determining social trust through their platform, we determine trust using the similarity features described in Chapter4. We notice in the SoRec experimentation that the best performance is achieved by the hybrid method which determines user trust by weighing in both the social network as well as the textual information.

Looking at the optimally weighed Weighted Friends Average as it turned out to be (see Figure

5.3) with 𝑤1 = 0.25, 𝑤2 = 0.75, which takes advantage of the the social factor, we see that

we achieve a great performance which turns out to outperform all the previously introduced methods by far in the case of MAE and have a loss of 0.025696 on RMSE performance over the best performer BiasedMF. The similarity between the two methods’ performance can be explained by the fact that both those methods resort to the global average for rating prediction when addressing the cold-start problem.

Weighted Friends Averageachieves improvement over all variations of SoRec as well, proving that weighing friends ratings in, along with the Global Average is a good formula to tackle the problem by maximally utilizing the wisdom of the crowds.

6.1.2 Text-enhanced recommendation

The second method we present is the LDA Similarity based Regression, in this method we make use of the features extracted from the LDA similarity of the comments to predict the ratings. As presented in the table, a simple regression over the carefully selected LDA features extracted makes for a very effective implementation which can easily stand very close to the advanced mathematical models that are used by the State of the art methods presented previously.

Text and Social-enhanced Business Recommendation

U

NIVERSITY OF

A

MSTERDAM

M

T

Text and Social-enhanced Business

Recommendation

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Research Questions

1.2

Contributions

Chapter 2

Background

2.1

Definition of Recommender Systems

2.2

Why Recommender Systems?

2.3

Categorization of Recommender Systems

2.4

Natural Language Processing

Chapter 3

Related Work

3.1

Approaches

3.2

Text-Based Recommendation

3.3

Social-Based recommendation

Chapter 4

Text and Social-enhanced Business

Recommendation

4.1

Text-enhanced Recommendation

4.2

Social-enhanced Recommendation

4.3

Text-Social hybrid Recommendation

Chapter 5

Experimental Setup

5.1

Dataset selection

5.2

Yelp academic dataset

5.3

Baseline Methods and Comparisons

5.4

Evaluation metrics

Chapter 6

Results and Discussion

6.1

Rating Prediction