Multi-modal Sentiment Analysis for Social Venue Recommendation

(1)

Multi-modal Sentiment Analysis for Social Venue

Recommendation

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Bouke Hendriks

10992197

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

March 8, 2017

1st_Supervisor ₂nd_Supervisor

Dr. Masoud Mazloom Dr. Stevan Rudinac

(2)

Multi-Modal Sentiment Analysis

for Social Venue Recommendation

Bouke Hendriks

University of Amsterdam

boukehendriks@gmail.com

supervisor: Masoud Mazloom

ABSTRACT

Understanding a user’s preference towards a particular venue is a key problem in venue recommendation systems. People often ex-press their opinions about visiting a particular venue through social media. Leveraging this abundance of social media posts for un-derstanding users’ opinions in this setting is called social venues recommendation. Common methods to do so often ignore the vi-sual and textual sentiment analysis of posts. In this thesis we pro-pose a novel multi-modal sentiment analysis framework by jointly considering the textual and visual information to predict user rat-ings. Using multi-modal features selected by our framework, a sen-timent driven user-item matrix can be constructed to infer a user’s ratings to items. Using a data set collected from Instagram with over one million posts related to venues, we verify the effective-ness of our proposed method for social venues recommendation. For each modality, we examine the contribution of analyzing senti-ment to the effectiveness of venue recommendations. In particular, we show that visual and textual sentiment analysis are complemen-tary for creating a user-item matrix for the purpose of generating social travel recommendations.

Keywords

Multimedia analysis; Sentiment analysis; Recommendation system

1. INTRODUCTION

Nowadays, many social media services, such as Twitter and In-stagram, start to provide an easy form of communication that en-ables users to broadcast and share personal opinions by posting both textual (comments/hash-tags) and visual (images/videos) in-formation [2]. Recommender systems are now pervasive in con-sumer lives with their aim to help users in finding items/interest points that they would like to buy, visit, or consider based on huge amounts of data collected. Parsing a huge amount of data to predict a user preference is the core of a recommender system. Once we have collected the data about users, items, and having created a ma-trix of user-item ratings, an algorithm is used to predict the rating for a user who has not rated the item yet. A rating for an item can be predicted from the ratings given to the item by users who are similar in taste to the given user. In this thesis, we present a multi-modal sentiment analysis for creating matrix of user-item ratings. Inspired by [24], where the success is shown of using both visual and textual sentiments for predicting the popularity of product re-lated posts, we utilise features taken from both modalities to extract sentiment, and use them as a rating in user-item matrix creation.

One of the active research problems in this area is venue recom-mendation which suggests places of interest to visit to a user. By increasing number of travelers, online travel recommender systems

Figure 1: Global overview of our proposal for a multi-modal recommendation system. Phase one describes the data crawl-ing, where we collect Instagram posts. Phase two describes the visual, textual and combined sentiment extraction. Phase three describes our user-item matrix creation, where the extracted sentiment is taken as a rating the user shows versus a certain venue.

such as TripAdvisor are more demanding. Typically, classic venue recommenders depend on a explicit rating based user-item matrix that matches users to the venues. Thus, they require active users to rate different venues before the recommendations start. As a re-sult a binary or numeric assignment of user to the items makes the user-item matrix. However nowadays, users prefer to show their opinion about visiting a venue by mentioning their feeling about it and share several posts with friends, family, and their followers in social networks. If they are happy of visiting a place they mention positive emotions about it, otherwise they mention neutral or neg-ative emotions. Feelings such as happiness, sadness, or neutral are beyond the scope of the binary assignment of the user to the item in the user-item matrix and are hard to capture without content analy-sis. Yet at the same time users implicitly express their opinions by the mood of the pictures they post. As previously mentioned, only a content based analysis can reveal those hidden opinions. Thus, the applicability of the classic recommender systems might be limited. In this thesis, we present a content-based venue recommendation which uses both image and text associated to user posts for creat-ing a user-item matrix. To this end, we crawled a large real-world dataset of Instagram posts related to predefined point of interests, venues, in Amsterdam city. Then, we extract visual and textual sen-timents from the user posts to compute a rating score of the venue and create an item matrix. We then evaluate whether this user-item matrix can successfully be utilised in a venue recommenda-tion. The pipeline of our multi-modal venue recommendation is

(3)

conceptually depicted in Figure 1.

In this thesis we make the following main contributions: • We obtain a new data set containing Instagram posts related

to venues in Amsterdam;

• We propose a novel framework to predict user ratings by ex-tracting visual, textual and combined sentiment of user posts. • We create sentiment-based user-item matrices and evaluate the performance of the recommendation system on various algorithms.

We organize the remainder of this thesis as follows. We begin by mentioning related work in section 2. Section 3 describes our problem formulation. We describe the process of data gathering in section 4. In section 5, we elaborate on our methodology. We introduce the experimental setup on our dataset in section 6. Re-sults are presented in section 7 and a discussion of several possible directions for future work is given in section 8. Finally, section 9 concludes with a summary of our findings.

2. RELATED WORK

In this section, we detail relevant related work on recommender systems, and multimedia sentiment analysis.

2.1 Recommender systems

The core idea behind a recommender system, is a system that can automatically predict user responses to certain options [18]. In recent years, recommender systems have attracted increased at-tention. Previous work on recommendation can be divided into content-based recommendation and collaborative filtering.

The early content-based recommendation, as an outgrowth and continuation of information retrieval [3], recommends items to a user by analyzing the user’s profile [19]. A typical content-based recommendation consists of three steps: item feature representa-tion, user profiling and content filtering [45]. To model each item, bag-of-words feature representations, e.g., TF-IDF, Latent Seman-tic Indexing (LSI) or Latent Dirichlet Allocation (LDA), have been applied to extract item specific features [34]. Agarwal and Chen [1] propose a latent factor model which combines document features through their topic intensities to predict user preferences. Sev-eral classification models have been successfully applied to infer a user’s personal interest [32, 34]. Unlike content-based filter-ing strategies [20] that predict ratfilter-ings usfilter-ing the analysis of user profiles, collaborative filtering (CF) methods [36], either memory-based CF or model-memory-based CF, predict ratings using user-item rat-ing matrices. Early CF-based methods apply memory-based tech-niques. The most widely used memory-based CF methods include the user-based methods [31] and item-based methods [33]. Among the model-based CF methods, latent factor models [16] have be-come very popular as they show state-of-the-art performance on multiple datasets. Aimed at factorizing a rating matrix into prod-ucts of a user-specific matrix and an item-specific matrix, matrix factorization based methods [16, 17, 25] are widely used. In re-cent years, to tackle the the “cold-start” problem, more and more researchers have started to consider combining the content-based recommendation with collaborative filtering. Wang and Blei [42] apply topic models to explainable recommendation problem to dis-cover explainable latent factors in probabilistic matrix factoriza-tion.

In recent years, recommendation on Twitter has attracted in-creased attention. Collaborative filtering is being applied to so-cial media recommendation tasks [21, 22, 43]. Yang et al. [44]

Table 1: Dataset statistics. The datasets Dx, represent the

dif-ferent datasets we obtained through our research. The table shows the amount of posts each dataset holds. Statistics marked with∗are achieved after clean up.

Term Size

tagList 472

venues with > 50 posts 152 Dinitial 3,129,709

Dclean 1,837,219∗

unique users in Dclean 726,238

Dtext 726,238∗

Dvisual 599,756∗

Dataset 599,756∗

address recommendation and link prediction tasks based on a joint-propagation model between social friendship and interests. Chen et al. [8] propose a collaborative filtering method to generate per-sonalized recommendations on Twitter through a collaborative rank-ing procedure. Similarly, Pennacchiotti et al. [28] propose a method to recommend “novel" tweets to users by following users’ interests and using the tweet content. However, as far as we know, mul-timedia recommendation on social media is still an ongoing re-search topic. Recent work on location-based multimedia recom-mendation using community-contributed content have been pro-posed [47]. Pang et al. [27] apply user generated travelogues to select representative Flickr images for a particular tourist destina-tion. Kofler et al. [15] propose a system analyzing users’ captured Flickr images to provide personalized point-of-interest recommen-dations. An interactive city explorer has been proposed to navigate New York venues through the eyes of New Yorkers having a similar taste to the interacting user [47].

Unlike previous work on social media recommendation, social travel recommendation focuses on providing reasonable travel des-tination to a specific user by analyzing the user’s interest from so-cial media streams. In this thesis we propose to use visual and textual sentiment associated to the post for computing the user’s interest to venues, create a new user-item matrix, and apply it in a social venue recommendation system.

2.2 Multimedia sentiment analysis

In recent years, textual sentiment analysis has received a lot of attention as a core task in text mining. As a fundamental task in sentiment analysis, sentiment classification [41] is crucial to un-derstand user generated content in product reviews. Lexicon-based methods [10, 40, 41] utilize a lexicon of sentiment words to predict sentiment labels, whereas corpus-based methods [14, 26, 48] clas-sify sentences to sentiment polarities using corpora that are labeled with sentiment labels. With the recent success of deep neural net-works [4, 12], more and more approaches to the sentiment classifi-cation task learn low-dimensional feature vectors. Tang et al. [37], propose a sentiment-specific word embedding method for short text sentiment classification in social media. For encoding relations be-tween sentences in documents, a recurrent neural network has been proposed to learn representations of documents for sentiment clas-sification [38]. By taking user information into account, Tang et al. [39] present a user-vector composition model to predict user rat-ings.

Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. The analysis of emotion and sentiment from visual content has become an exciting area in the multimedia community allowing to build an

(4)

Algorithm 1 General algorithm used to create the three different user-item matrices. The function get.userID(P OST [x]) returns the user id related to post x.

Require: n holds the amount of venues and m holds the amount of post of the current venue

1: functionCREATE USER-ITEM MATRIX

2: for each venue ← 0 to n do 3: for each P OST ← 0 to m do 4: venuei,d= n

5: useri,d= get.userID(P OST [m]))

6: SV = SentiScore(P OST [m])

7: ST = SentiScore(P OST [m])

8: g = mean(SV,ST)

9:

10: visualU I = visualU I ∪(useri,d, venuei,d, SV)

11: textualU I = textualU I ∪(useri,d, venuei,d, ST)

12: combiU I = combiU I ∪(useri,d, venuei,d, g)

13: return visualU I,textualU I,combiU I

opinion miner [5]. Borth et al. [5] propose SentiBank, a set of 1200 trained visual concept detectors towards the prediction of sentiment reflected in visual content. Inspired by the recent successes of deep learning, several works develop effective deep convolutional net-work architectures for visual sentiment analysis [9, 46]. Mazloom et al. [24] show that prediction of sentiment from visual content is complementary to textual sentiment analysis in brand popularity prediction on social networks. Inspired by [24], we employ fea-tures taken from both visual and textual modalities of user post to extract sentiments and emotions of user for rating different venues.

3. PROBLEM FORMULATION

Before introducing our proposed framework for multi-modal venues recommendation, we formally introduce our notation and key con-cepts in this section.

We assume that there are U users denoted by U = {u1, . . . , uU},

I items denoted by I = {i1, i2, . . . , iI} and a set of observed

in-dices Q = {(u, i)}, where each pair (u, i) ∈ U × I indicates an observed rating ru,i > 0, the rating which user u gives to item i.

In this thesis, we assume that the ratings ru,ican be estimated by

extracting the feeling and sentiment of user u to item i by analyzing du,i, a social media post which user u has posted about item i. We

assume each du,iposted from user u ∈ U to an item i ∈ I consist

of two modalities: image and text related to item i. We suppose we can infer the rating ru,iby combining sentiment extracted in both

modalities. Formally, given a set of users, U , and a set of candi-date items, I, during recommendation we need to learn a function f , i.e., f : U × I → R, where R indicates the ratings set be-tween users and items. Thus, given a user u ∈ U , the target of the recommendation process is to find a proper item v ∈ V, so that:

v = arg max

v0∈Vf (u, v 0

), (1)

Next, we define the task of collaborative filtering recommendation. Following the definition used by Shi et al. [35], given the ratings matrix R between users U and items I, the collaborative filtering task is to recommend to each user a list of items that are ranked in descending order of relevance to the user.

Finally, we define the notion of the multi-modal venues recom-mendationthat we propose in this thesis. We aim to develop a system that can predict the ratings of users versus certain interest

points, venues, located in a large urban area. We assume that the viewpoint distribution θvuis derived by a finite mixture over a

per-sonalized base distribution θ0u,vand viewpoint distributions of u’s

trusted relations. Given a user u and an item i, we set a multinomial distribution fu,i, which derives from the viewpoints distribution πi

for item i, to reflect the viewpoint chosen by u for their rating to item i. If a user u writes a user review du,ifor item i, there is a

corresponding rating ru,i∈ [1, R] derived from a multinomial

dis-tribution over θufu,i. We proposed to compute rating ru,iin R, for

each (u,i) by function g(u, i) which is given by: g(u, i) = g(Vu,i, Tu,i) = mean(SV, ST),

SV = SentiScore(Vu,i),

ST= SentiScore(Tu,i),

(2)

where Vu,iis the visual content which user u posted related to item

i and Tu,iholds the textual content. Here SV and ST depict the

visual and textual sentiment score respectively.

4. DATASET

To evaluate the effectiveness of our proposed solution, in this section we first elaborate on our dataset.

Instagram is chosen as a data collection platform, as it has a strong focus on self-expression and also offers a vast amount of publicly available multi-modal data. As we target city specific so-lutions we target data from one particular city namely Amsterdam. Our focus lays on data related to specific point of interests, venues, found within the Amsterdam metropolitan area. Following, we de-scribe our method for constructing our dataset.

The first step involves defining a list of venues found within the Amsterdam metropolitan area. We define a list of venues denoted as tagList = {tag1, tag2, . . . , tagn} (n = 472), where tagi

holds the title of venue i (e.g., “anne frank huis") defined by Ams-terdam Open Data [23]. We utilise the Instagram API [13] to crawl the Instagram platform for posts related to hashtags, where hash-tags are related to our hash-tags in tagList . As a hashtag on Instagram can solely consist of one string of characters, they are often a con-catenation of multiple words. Special characters and white spaces are removed for every tagiin tagList before starting the crawling

process (our example tag becomes “annefrankhuis").

We then proceed to mine our dataset by querying the Instagram API [13] to crawl for posts related to every hashtag in tagList. The returned data is denoted as Dinitial, which holds all posts related

to every tagi. For every tagi we collect a set of posts, denoted

by Ptagi = {Pi1, Pi2, . . . , Pim}, where m is the amount of posts

found for tagi. The oldest post in the dataset is dated 28-Oct-2010

and the most recent post is dated 07-Apr-2016.

We clean our data set by removing both duplicate and corrupted posts. Corrupted posts may be caused by changing Instagram terms of use, or because the post was recently removed by its user. In our experiments, not all tags are included in our user-item matrix. To avoid sparsely represented venues, we limit our data to tags that have a minimum of 50 posts, besides containing both visual and textual information. Statistics of our dataset are shown in Table 1, we reach a final dataset Dclean containing a total of 1,837,219

posts. Database Dcleanholds all posts related to every tagi, yet

only holds posts of those tags where Pimhas both visual and

tex-tual content and m > 50. In our dataset, the posts made by 726,238 unique users represent the sentiment these users expressed versus a total of n = 152 different venues. As we see in Table 1 out of the 726,238 posts of unique users, 599,756 of posts have visual data which we consider as Dataset in our experimental setup.

(5)

Figure 2: Example of computing user-item rating by visual and textual sentiment analysis. This post was crawled in relation to #amsterdamcentraal.

5. METHOD

In this section we give a general overview of how we eventually predict ratings of users versus venues based upon Instagram posts analysis. After crawling a Dataset of user posts related to spe-cific venues, which include visual and textual contents, we attempt to create user-item matrix using textual, textualU I, and visual, visualU I, sentiment analysis. We also create a matrix by com-bining visualU I and textualU I which we call combiU I. Using the information found within these user-item matrices, the trained models are then used to predict ratings to new users.

5.1 User-item matrix creation

As social media is often used as a platform where users express their opinion, posts infer sentiment. On Instagram specifically, post contain information of two modalities; both textual and visual in-formation. Users often post a picture together with a caption and some hashtags. We hypothesize the extracted sentiment, from both channels, can be utilised as a rating in a user-item matrix for use in a recommendation system. To the best of our knowledge, there has been no work about utilising extracted sentiment as a rating in a user-item matrix recommendation system.

5.1.1 Textual user-item matrix

To extract sentiment from textual data for creating textualU I matrix, we utilise SentiStrength [40]. SentiStrength uses psychol-ogy research to determine both negative and positive sentiment in short informal texts of post du,i, giving near human accurate

esti-mate of sentiment (see Figure 2). The overall sentiment of the post du,iis found by averaging these scores, giving us a sentiment score.

We convert textual sentiment scores to a rating by normalizing scores to scale [1, 11] (denoted as ru,i= ST= SentiScore(Tu,i)),

to prevent negative ratings in recommender algorithms. We pursue this procedure for computing a textual sentiment score for all user posts in our dataset. In this way we create textual user-item matrix textualU I.

5.1.2 Visual user-item matrix

We are inspired by work of Borth et al. [6] for creating visualU I, which address the challenge of sentiment analysis for visual con-tent and developed SentiBank. SentiBank is a visual concept li-brary that can be used to detect the presence of 1,200 adjective noun pairs within an image. SentiBank is developed utilising a

vi-sual sentiment ontology from user generated content, and provides us with a large-scale VSO (Visual Sentiment Ontology) founded by a psychological model (Plutchick’s Wheel of Emoticons [29]). By using the extracted adjective noun pairs, we are able to infer the sentiment found within an image.

To enable us to extract visual sentiment for post du,i, visual

sen-timent features are first extracted by running the SentiBank visual concept detector [6] over image of du,i. SentiBank computes a

1200 dimensional vector by mapping each image to probability scores on 1200 adjective noun pairs, with each adjective noun pair being related to a sentiment score provided by the ontology. We take the top-k adjective noun pairs’ scores and the overall senti-ment of a post is computed by average pooling the related sen-timent scores of adjective noun pairs (denoted as ru,i = SV =

SentiScore(Vu,i)), as can be seen in Figure 2. Sentiment scores

given by SentiBank are scaled to fall within the interval [−2, 2]. Same as textual sentiment, we convert the sentiment scores to a rating by normalizing scores to the scale [1, 11]. We follow this procedure to compute a visual sentiment score for all user posts in our dataset. Then, we create visual user-item matrix visualU I.

5.1.3 Combined visual and textual user-item matrix

After creating textualU I and visualU I matrices for our dataset, we compute a new user-item matrix, combiU I, by averaging the ratings for each user-item pair found in both visual, SV, and

tex-tual, ST, matrices; denoted as ru,i= mean(SV, ST).

The pseudocode in Algorithm 1 shows the process of creating different user-item matrices from user posts.

5.2 Rating prediction

In the final phase we utilise the created user-item matrices to predict item ratings for new users. Predictions are computed with the data crawled from Instagram, where we assume that similar users share similar interests. We adopt Factorization Machine (FM) models from both the LibFM [30] and LibRec [11] packages.

The software tool LibFM gave us the opportunity to implement three learning methods: Stochastic Gradient Descent (SGD) and Alternating Least-Squares (ALS) optimizations, as well as Bayesian inference using Markov Chain Monto Carlo (MCMC), as described in [30]. LibRec toolkit gave us the opportunity to implement SVD++, and biasedMF algorithms.

(6)

25 50 75 100 150 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Iterations RMSE MCMC ALS SGD SVD++ BiasedMF (a) 0.0010 0.005 0.01 0.05 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Learning rate RMSE ALS SGD BiasedMF SVD++ (b) 5 10 15 20 25 30 40 50 75 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Factors RMSE (c)

Figure 3: Recommendation performance in terms of RMSE using the textual user-item matrix in various algorithms. (a) shows the result of different algorithms when iteration parameter is changing while the other two parameters are fixed. (b) shows the result of different algorithms when learning rate parameter is variable, the value of iteration found from (a), and number of factor is fixed. (c) shows the result of different algorithms using best iteration and learning rate found in (a) and (b) and factor parameter is variable.

variables, and thus has a high prediction quality in sparse settings such as our UI matrices. Matrix factorization models such as these learn low rank representations (latent factors) of users and items from the information in the UI matrix. These latent factors are then utilised to predict new scores between users and items [35]. LibFM iteratively estimates the weights of the interactions between ratings, by minimizing a cost function at every step. Optimality of model parameters is defined with a loss function L where the task is to minimize the sum of losses over the observed data D [30].

Every user-item matrix is split in a train and test set (20-80% division). For each user-item matrix, we train a model on train set. The model is then used to predict ratings for all user-item pairs in test set.

6. EXPERIMENTAL SETUP

6.1 Dataset

To evaluate our multi-modal venue recommendation we split our Dataset, user- item matrices (see Section 5.1), equally into a train and test set. We train a model on train set and use it for predicting a rating score per user-item at test time.

6.2 Implementation details

We describe detailed parameter setting and tuning approaches for user-item matrices used in the thesis.

Visual user-item matrix parameterTo find an optimal k value for the top-k adjective noun pairs, from which we average their sentiment scores to find the overall rating score SV of an image

(Section 5.1.2), we vary the value of k from {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. In other words, we create ten visual user-item matrices, visualU I, to find the optimum value of k.

Learning algorithmWe investigate the effectiveness of using our extracted sentiment as ratings in venue recommendation by imple-menting a variety of learning algorithms: SGD, ALS, MCMC, bi-asedMF, and SVD++.

Model parameters To find the optimum value for the param-eters of our implemented learning models, number of iterations (I), learning rate (LR), and number of factors (F ), we varied the value of I from {25, 50, 75, 100, 150, 250, 500, 1000}, value of LR from {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05}, and F from {5, 10, 15, 20, 25, 30, 40, 50, 75, 100}. In each step we fix the value of two of these parameters to find the optimal value for the other one. 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k: Number of top ANP’s

RMSE

Figure 4: Recommendation performance in terms of RMSE us-ing a visual user-item matrix based on different numbers of adjective noun pairs. We use SVD++ algorithm with the best parameters found in Figure 5 (a), (b), and (c).

6.3 Evaluation metrics

Both the Root Mean Square Error (RMSE) and Mean Absolute Error(MAE) are regularly employed as a standard statistical met-rics to measure model performance. Both scores measure the dif-ferences between the predicted ratings and our ground truth ratings in test set. While the MAE gives the same weight to all errors, the RMSE penalizes variance as it gives errors with larger absolute val-ues more weight than errors with smaller absolute valval-ues [7]. The best MAE and RMSE score is 0.0, where all predictions are exactly correct.

6.4 Baseline

During our research we were unable to find previously existing baselines for recommendation algorithms where implicit user rat-ings were extracted from social media posts. Therefore we create a baseline constructed by number of likes received by post related to a certain venue. We consider the number of likes the post received as a rating in the process of user-item matrix creation. We call this matrix voteU I. We consider ru,i = log2(n) where n is number

of likes which post du,ireceived. By taking logarithmic scale for

(7)

25 50 75 100 150 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Iterations RMSE MCMC ALS SGD BiasedMF SVD++ (a) 0.0010 0.005 0.01 0.05 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Learning rate RMSE ALS SGD BiasedMF SVD++ (b) 5 10 15 20 25 30 40 50 75 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Factors RMSE ALS MCMC SGD BiasedMF SVD++ (c)

Figure 5: Recommendation performance in terms of RMSE using the visual user-item matrix. (a), (b), and (c) show the effect of changing the value of iteration, learning rate, and number of factors respectively; similar to Figure 3.

6.5 Experiments

In order to establish the effectiveness of our multi-modal senti-ment analysis for predicting the rating of users to venues, we per-form several experiments to answer the following research ques-tions:

• RQ1: How does each of our user-item matrices, using sen-timent extracted from social media posts, perform in venue recommendation?

• RQ2: Can we find optimal model parameters for each algo-rithm, for every created user-item matrix?

To answer to these research questions we perform experiments to see the effect of using a textual user-item matrix, textualU I, a visual user-item matrix, visualU I, a combined user-item matrix, combiU I, and our baseline, voteU I, for venue recommendation.

6.5.1 Textual user-item matrix for recommendation

To see the effect of textual sentiment analysis on recommenda-tion, we do experiments by considering different algorithms de-scribed in section 6.2. We change the value of parameters, I, LR, and F to find the optimum parameters.

6.5.2 Visual user-item matrix for recommendation

In this experiment we attempt to find the best result of recom-mendation by considering different algorithms mentioned in Sec-tion 6.2. In all cases we fix the value of k = 10, number of k-top visual adjective noun pairs for creating visualU I. We varied all three parameters of algorithms, I, LR, and F to find the optimum parameters.

Also, to explore the effect of k in recommendation, we change the value of k to make new visualU I matrices and use the best algorithm and optimum parameters found in our previous step.

6.5.3 Combined matrix for recommendation

We create a third item matrix by combining the visual user-item matrix and the textual user-user-item matrix. We use the best per-forming visual user-item matrix with the value of k for which the best performance was found in previous experiment. To determine the effect of multi-modal sentiment analysis on recommendation effectiveness, we perform experiments using different algorithms described in Section 6.2 by modifying the value of I, LR, and F parameters.

6.5.4 Sentiment-based versus non-sentiment-based

In this experiment we compare the accuracies of our multi-modal sentiment analyses versus baseline, non-sentiment-based method

as mentioned in Section 6.4, in venue recommendation. Optimum model parameters for voteU I are found in a similar fashion as aforementioned.

7. RESULTS

We plot the results of our proposed multi-modal sentiment anal-ysis for recommending venues to users in Figure 3, 4, 5, 6. Since learning rate parameter is absent in the MCMC algorithm, we do not mention to this algorithm during learning rate tuning experi-ments. Table 2 summarizes our best achieved results.

7.1 Textual user-item matrix results

We plot the result of using a textual user-item matrix, textualU I, in Figure 3. Figure 3 (a) shows that during iteration tuning we reach to the best result, 0.590 RMSE, using MCMC algorithm on 50 iterations where the value of factors is fixed to 20. Figure 3 (b) represent the effect of learning rate on the accuracy of recom-mendation system using different algorithms. As we can see, by increasing the value of learning rate from 0.001 to 0.1 the result improves from 0.704 RMSE to 0.594 RMSE, where we use SGD. We depict the effect of different value of factors in the accuracy in Figure 3 (c). After our third experiment we reach to our best result on textualU I, 0.578 RMSE, using MCMC algorithm with 100 factors on 50 iterations.

7.2 Visual user-item matrix results

Figure 5 (a)(b)(c) show the result of visual sentiment analysis for creating user-item matrix, while using top 10 adjective noun pairs for creating visualU I. We show the effect of iteration parameter using different algorithms in Figure 5 (a), where we fix the value of learning rate and number of factor to 0.001 and 20 respectively. We reach to the best result, 0.618 RMSE, using SVD++ with just 25 iterations. Figure 5 (b) shows the effect of different values of learning rate in the accuracy of our recommendation system. As we can see, the results are getting worse by increasing the value of learning rate, especially using SVD++ the result reaches to 0.682 RMSE. We show the effect of different number of factors using different algorithms in 5 (c). During this experiment, we use the best value for iterations and learning rate parameters achieved by the previous experiments. As we can see, changing the amount of factors does not have a dramatic effect on the results. Our best result on visualU I, RMSE 0.618, was achieved on the SVD++ algorithm with 25 iterations, learning rate 0.001 and 15 factors.

Figure 4 shows the effect of different top-k adjective noun pairs for creating a visual user-item matrix, where we use SVD++ algo-rithm with the optimum value of iterations, 25, learning rate, 0.001,

(8)

25 50 75 100 150 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Iterations RMSE MCMC ALS SGD BiasedMF SVD++ (a) 0.0010 0.005 0.01 0.05 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Learning rate RMSE ALS SGD BiasedMF SVD++ (b) 5 10 15 20 25 30 40 50 75 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Factors RMSE ALS MCMC SGD BiasedMF SVD++ (c)

Figure 6: Recommendation performance in terms of RMSE using the combined user-item matrix. In creating combined user-item matrix we use a visual user-item matrix obtained by using k = 10. (a), (b), and (c) depict the effect of changing parameters: iteration, learning rate, and number of factors.

and number of factors, 15. The figure shows that by increasing k the result is improving, where we reach optimal results when using k = 100 top adjective noun pairs for creating a visual user-item ma-trix (RMSE is 0.432). However, this may be explained by the way we verify performance of our algorithms; based on its RMSE score. As we take higher k ANP scores to average sentiment from, our ex-tracted rating naturally approaches the average sentiment score of all 1200 adjective noun pairs. Also, we will take less important sentiment concepts (with lower probabilities) into consideration; giving us a more general rating. More generalized ratings will have less spread between them. Naturally, the model will have lower error scores as it is easier to predict a rating closer to its real value.

7.3 Combined matrix results

Next, we report the result of multi-modal recommendation in Figure 6, where we use a combined user-item matrix for recom-mending venues to users (combiU I). As we can see in Figure 6 (a), the best result, 0.465 RMSE, was achieved where we use 25 itera-tions on the MCMC algorithm. Figure 6 (b) shows we achieved our best results on learning rate tuning with SGD algorithm, where we change the value of learning rate from 0.001, 0.507 RMSE, to 0.1, 0.501 RMSE. Finally, figure 6 (c) depicts the best result achieved on combiU I, 0.464 RMSE, using MCMC algorithm with 75 fac-tors on 25 iterations.

7.4 Non-sentiment based results

Finally, we also implement all algorithms on voteU I matrix. The best results we achieve are 0.540, 0.543, 0.552, 0.833, and 0.733 RMSE using MCMC, ALS, SGD, biasedMF, and SVD++ respectively.

7.5 Summarized results

We summarize the best results we achieved on all user-item ma-trices, textualU I, visualU I, combiU I and voteU I with two metrics RMSE and MAE in Table 2. We explain the good results of our multi-modal sentiment analysis for venue recommendation from several observations.

First, we observe that the differences in results of visualU I and textualU I are very small. This indicates that both modalities are comparable when implemented in user-item matrix; also, users in-deed show their feeling about a venue through both textual and vi-sual channels.

Secondly, as we can see in Table 2, we found our combined ex-tracted sentiment user-item matrix, combiU I, performing better than non-sentiment baseline, voteU I, independent of the chosen algorithm. This could be explained by the fact that a post generated

Table 2: Comparison of performance (RMSE and MAE) of each item matrix. The result show our multi-modal user-item matrix outperforms the visual, textual, and baseline ma-trices.

User-item matrix RMSE MAE visualU I 0.618 0.313 textualU I 0.578 0.296 combiU I 0.464 0.228 voteU I 0.540 0.373

by a user expresses the opinion that this specific user has versus a certain item. By considering number of likes we do not utilise content generated by this specific user; instead, we claim that the actions of a user’s followers infer something about the user himself. Our results disprove this statement.

Furthermore, our results indicate that combining both visual and textual features, combiU I, present a better representation of the expressed sentiment within a post. The results show that textual and visual sentiments are indeed complementary when utilised to extract ratings.

The results of our experiments confirm that the user rating pre-diction accuracy profits from using sentiment analysis on both vi-sual and textual channels of user posts. Moreover, it is beneficial to represent user-item matrix, which is the core of recommendation system, using a combination of visual and textual sentiment scores. We achieved best performance in venue recommendation utilising the combiU I.

8. FUTURE WORK

In this section we describe several possible directions on which future work could focus.

First, due to the nature of our utilised matrices (created through subjective method of finding sentiment scores), we have performed our research on artificially created user-item matrices. Comparing results based on artificially obtained user-item matrices is generally not ideal. Optimally, an additional user-item matrix is required to compare against the results of each individual modality; one that consists of explicitly given user ratings, provided by the exact same users.

Moreover; to indeed find whether one modality outperforms oth-ers, we have to adapt our evaluation from an error evaluation metric (such as RMSE and MAE) to a ranking evaluation metric. Creating

(9)

a ranked list for each user would be necessary in order to compute precision and recall for the given recommendations. In this way, future work can focus on finding whether combined ratings give a better representation of user ratings. Important to note here is, which of the recommended venues should be considered "relevant" to the user. We cannot simply create a top-k recommendation list, and compare this to the most positively rated venues for specific users. Correctly predicting a negatively rated venue could be just as relevant to a user as predicting a positively rated venue. After all, knowing which venues to avoid is also valuable information.

Furthermore, the algorithms we implemented in our research base their results on the assumption that similar users share sim-ilar interests. This implies that there is no real individualism, as every person is always considered similar to some other individual or group of individuals, based upon historical data. However, fu-ture work could focus on a different approach, or expand current algorithms to more accurately describe real world scenarios. Peo-ple are different and every person has his/her own specific interests; every individual user has their own set of beliefs, opinions and at-titudes, which shape the recommendations they wish to receive. A downside to current algorithms is that they facilitate the "informa-tion bubble" that we currently see happening in recommenda"informa-tion scenarios throughout the web. Users tend to receive recommenda-tions correlating to their previously found interests, therefore we fail to grasp the opportunity to broadening the user’s horizon with new information.

Lastly, future work could improve our current model in several ways. Final recommendations could be improved by incorporating inner-item similarity to our model, where a trade-off must be made between accuracy and diversity. Categorical data (musea, gallery, zoo, etcetera) could be added to the model, so that our model does not recommend two venues of the same type. This will lead to a higher diversity in final recommendations, especially useful in our use-case where users are unlikely to visit multiple venues of the same category during their stay. Also, a focus could lay on incorporating trend recognition. Venues constantly evolve and re-new themselves. When a sudden peak is found in the amount of discussion about a certain venue on social media (in relation to the average amount of discussion about this venue), this specific venue might have introduced something new and thus its ratings might have to be re-evaluated. Optionally "rating fatigue" (or weighted ratings) could be used, where current ratings are more important than ratings given a longer time ago. The essence of time could also play a role when we would like to re-rank recommendations, based upon real-time crowdedness of venues.

9. CONCLUSION

We have proposed a novel multi-modal sentiment analysis frame-work, that considers both visual and textual information to extract sentiment expressed in social media posts as a user-item rating. A large dataset was collected, with posts relating to predefined venues in Amsterdam.

We have shown how extracted sentiment was used to create user-item matrices, to model user’s ratings to user-items. During experi-ments, we verified the applicability of our multi-modal sentiment extraction on various item recommendation algorithms for predict-ing the ratpredict-ing of users to certain venues.

Experiments have shown that our constructed user-item matrices perform well in a recommender that uses the matrices to predict user ratings. Most importantly, we have found that both textual and visual aspects are complementary and should both be considered when analyzing expressed sentiment. Our combined user-item ma-trix, which jointly considers the extracted textual and visual

senti-ment, performed better than matrices created using sentiment ex-tracted from single modality. Moreover, we show that our pro-posed method outperforms a user-item matrix created by consid-ering number of likes (voting). However, additional research is required to find which modality (visual, textual or combined) per-forms best.

(10)

References

[1] D. Agarwal and B.-C. Chen. flda: matrix factorization through latent dirichlet allocation. In WSDM, 2010.

[2] Y. Bae and H. Lee. Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology, 63(12):2521–2535, 2012.

[3] N. J. Belkin and W. B. Croft. Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29–38, 1992.

[4] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013. [5] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale

visual sentiment ontology and detectors using adjective noun pairs. In MM, 2013.

[6] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia, pages 223–232. ACM, 2013.

[7] T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature. Geoscientific Model Development, 7(3):1247–1250, 2014. [8] K. Chen, T. Chen, G. Zheng, O. Jin, E. Yao, and Y. Yu. Collaborative

personalized tweet recommendation. In SIGIR, 2012. [9] T. Chen, D. Borth, T. Darrell, and S.-F. Chang. Deepsentibank:

Visual sentiment concept classification with deep convolutional neural networks. abs/1410.8586, 2014.

[10] X. Ding, B. Liu, and P. S. Yu. A holistic lexicon-based approach to opinion mining. In WSDM, 2008.

[11] G. Guo, J. Zhang, Z. Sun, and N. Yorke-Smith. Librec: A java library for recommender systems. In Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd International Conference on User Modeling, Adaptation and Personalization, 2015.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [13] I. Inc. Instagram developer api, 2016. URL

https://www.instagram.com/developer/. Online; accessed: 15-July-2016.

[14] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A sentence model based on convolutional neural networks. In ACL, 2014.

[15] C. Kofler, L. Caballero, M. Menendez, V. Occhialini, and M. Larson. Near2me: An authentic and personalized social media-based recommender for travel destinations. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, 2011.

[16] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 6(8):30–37, 2009.

[17] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.

[18] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of massive datasets. Cambridge University Press, 2014.

[19] H. Lieberman et al. Letizia: An agent that assists web browsing. IJCAI, 1995.

[20] P. Lops, M. De Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In Recommender systems handbook, pages 73–105. Springer, 2011.

[21] H. Ma, I. King, and M. Lyu. Learning to recommend with social trust ensemble. In SIGIR, pages 203–210, 2009.

[22] H. Ma, D. Zhou, C. Liu, M. Lyu, and I. King. Recommender systems with social regularization. In WSDM, pages 287–296, 2011. [23] A. Marketing. Amsterdam open data - attractions, 2016. URL

https://data.amsterdam.nl/dataset/attracties. Online; accessed: 15-July-2016.

[24] M. Mazloom, R. Rietveld, S. Rudinac, M. Worring, and W. van Dolen. Multimodal popularity prediction of brand-related social media posts. In MM, 2016.

[25] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In NIPS, 2007.

[26] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In EMNLP, 2002. [27] Y. Pang, Q. Hao, Y. Yuan, T. Hu, R. Cai, and L. Zhang. Summarizing

tourist destinations by mining user-generated travelogues and photos.

Computer Vision and Image Understanding, 115(3):352–363, 2011. [28] M. Pennacchiotti, F. Silvestri, H. Vahabi, and R. Venturini. Making

your interests follow you on twitter. In CIKM, 2012.

[29] R. Plutchik. Emotion: A psychoevolutionary synthesis. Harpercollins College Division, 1980.

[30] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012. ISSN 2157-6904. [31] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl.

Grouplens: an open architecture for collaborative filtering of netnews. In CSCW, 1994.

[32] F. Ricci, L. Rokach, and B. Shapira. Introduction to recommender systems handbook. Springer, 2011.

[33] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In WWW, 2001. [34] G. Shani and A. Gunawardana. Evaluating recommendation systems.

In Recommender systems handbook, pages 257–297. Springer, 2011. [35] Y. Shi, M. Larson, and A. Hanjalic. Collaborative filtering beyond

the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR), 47(1):3, 2014. [36] X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering

techniques. Advances in artificial intelligence, 2009:4, 2009. [37] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning

sentiment-specific word embedding for twitter sentiment classification. In ACL, 2014.

[38] D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, 2015.

[39] D. Tang, B. Qin, T. Liu, and Y. Yang. User modeling with neural network for review rating prediction. In IJCAI, 2015.

[40] M. Thelwall, K. Buckley, and G. Paltoglou. Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol., 63(1): 163–173, 2012.

[41] P. D. Turney. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In ACL, 2002. [42] C. Wang and D. M. Blei. Collaborative topic modeling for

recommending scientific articles. In KDD, 2011.

[43] B. Yang, Y. Lei, D. Liu, and J. Liu. Social collaborative filtering by trust. In IJCAI, 2013.

[44] S. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike: joint friendship and interest propagation in social networks. In WWW 2011, pages 537–546, 2011.

[45] P. Yin, P. Luo, W.-C. Lee, and M. Wang. App recommendation: a contest between satisfaction and temptation. In WSDM, 2013. [46] Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis

using progressively trained and domain transferred deep networks. In AAAI, 2015.

[47] J. Zahálka, S. Rudinac, and M. Worring. New yorker melange: interactive brew of personalized venue recommendations. In MM, 2014.

[48] J. Zhao, L. Dong, J. Wu, and K. Xu. Moodlens: an emoticon-based sentiment analysis system for chinese tweets. In KDD, 2012.