Recommendations for post popularity prediction in social media

(1)

Recommendations for post popularity

prediction in social media

submitted in partial fulfillment for the degree of

master of science

Iliana Pappi

11394153

master information studies

data science

faculty of science

university of amsterdam

2017-07-12

Internal Supervisor Title, Name Dr Masoud Mazloom Affiliation UvA, FNWI, ISIS Email m.mazloom@uva.nl

(2)

Recommendations for post popularity prediction in social media

Iliana Pappi

University of Amsterdam iliana.n.pappi@gmail.com

Masoud Mazloom

University of Amsterdam masoud.mazloom@gmail.com

ABSTRACT

The importance of social media is becoming dominant in everyday life during the last few years with users sharing their thoughts and experiences every moment worldwide. In order to address the desire of users to make their posts more popular, several studies have been trying to answer the question of how post popularity can be defined and predicted to increase user satisfaction. Two very powerful elements to draw attention in social media are image and text. There has been a significant amount of research lately related to which are the most appropriate visual attributes in social media posts combined with social cues and textual data to predict increased post popularity. In this master thesis, the impact of certain visual and textual features in post popularity will be investigated, adding value to the current scientific research through categorizing post content. Most of the existing research in this area focuses on popularity prediction of a post without considering its semantic content. On the contrary, the aim of this work is to propose an new framework which includes all the aforementioned features for posts that are categorized as action, scene, people and animals, or brand. The ultimate outcome is to make recommendations to users in order to make their posts more popular in social media and increase user engagement.1

1 INTRODUCTION

Defining what makes an image, and subsequently a user’s post with image content and social cues popular, has been proven to be a challenging problem to solve. A huge amount of visual and textual information is posted everyday in social media such as Twitter, Instagram, Flickr, Facebook etc. It takes only a few seconds to share social activity in form of image, video, comments, tags and mentions in any place at any time with a simple internet connection to a device, e.g. mobile, computer, tablet etc. As a result, every day more and more users are getting involved in social media posting and sharing social activity, creating more or less popular posts carrying multimodal information.

Thus, predicting how popular a user’s post will be among other users in his network or in public, has become interesting for market-ing, business, political and economic sciences and decision-making strategies of campaigns targeting on social media crowds. Moreover, predicting post popularity is important for the self-evolution of the social media. Every user would like to know what is the best way to interact or get noticed in a social media platform, concerning both shared images and quotes or comments.

However, feature selection and the consequent recommendations with a maximum possible correlation with popularity has been proved a difficult task with many aspects. Many approaches have been proposed in the literature, others focusing merely on visual low- and high-level features [11], [17], some others considering

1_{Github repository: https://github.com/ilianapappi/masterthesis}

user’s tags and comments, analyzing language content of a user’s post along with the image, or the interaction with one’s network [13], [15], [19], [21]. Indications for post popularity are the number of likes, number of views, number of re-tweets etc. depending on the social network. Another important measure that may be used on top of the aforementioned is the number of followers, if available, indicating also the size of the user’s network.

Despite the broad attention around post popularity prediction re-lated research, most relevant works have analyzed general content datasets crawled from social media or other well-defined datasets for research purposes, e.g. the MIRFlickr dataset [16]. On the con-trary, handling raw data coming from social media posts without well-defined content and categorization, as well as investigating the effect of specific categories of users’ content on popularity pre-diction are two challenges faced in conjunction in this thesis. Raw data preprocessing is important as it allows for an insight into the user’s psychology and sentiment and facilitates feature selection. Another element that is studied in this work is how human action present in the image, sceneries or backgrounds, the presence of people, animals and brands affect the popularity of a post. Espe-cially, identifying action that is described through visual features for still images is particularly challenging, since action detection is broadly covered in the literature for videos and less works have been dedicated to action representation in static images.

This thesis proposes a new multimodal framework for post pop-ularity prediction and recommendations, especially when action, sceneries, people, animals and brands appear in the users’ posts. The methodology followed is inspired by works such as the study of Habibian et al. [14], introducing video representation using vo-cabularies composed of concept detectors for event recognition. The vocabularies are diverse, covering object, action, scene, people, animal and attribute concepts. A new dataset for the purposes of this research was created from scratch, containing posts crawled from Instagram, a broadly used social network with emphasis on visual content. The crawled data were gathered according to hash-tags related to the aforementioned categories. Visual and textual features were extracted for each sample, i.e., Instagram post, and regression models were used to predict the normalized number of likes in posts for all the extracted features. Two types of analysis were employed, the first concerning each hashtag-related subset and the second concerning each larger sub-group of hashtags for action, scene, people - pets, and brand. The pipeline followed in this thesis is illustrated in Figure 1. The reported evaluation metric between the ground truth and the prediction vector of each regres-sion learning model employed, attempts to answer to the research question formed below:

How can we define which features affect post popularity in social media, in order to make recommendations to the users? More specific sub-questions attempted to be answered in this work are:

(3)

Figure 1: Outline of the Proposed Methodology for Popularity Prediction Inside A Category

(1) What is the role of low and high-level visual features, along with visual features with specific characteristics, e.g. action and scenery, in correlation with post popularity?

(2) How visual centrality in a user’s post can be combined with textual and numerical data in order to define and predict post popularity?

(3) Can we develop a unified model, taking under consideration features from different categories, in order to make recom-mendations for social media users to make their posts more popular?

The main contributions of this work, as an answer to the research questions formulated, are summarized below:

• The employed methodology is used to investigate which features, both visual and textual, are more related to popu-larity prediction for different content categories. The ex-periments resulted in new findings of specific features contributing more to specific categories, which led us to proceed to novel recommendations.

• The effect of features with a semantic meaning is inves-tigated by correlating high-level features with popularity prediction for different content categories. Thus, meaning-ful recommendations for users are also made, especially for intriguing visual content such as actions and scenes. • All the given information in a user’s post is exploited in

order to predict popularity by combining features with the best outcome in terms of post popularity prediction coming from both visual and textual content. The complementarity of the multimodality for the investigated categories is noted in the respective experiments.

• Useful recommendations for users were made by proposing some visual and textual characteristics that would increase the popularity of their future posts.

• Certain findings for feature selection per category are, to the best knowledge of the authors, reported for the first time in this work.

2 RELATED WORK

Popularity prediction is a key problem in social media networks in order to analyze information diffusion. There are studies in the literature addressing the post popularity prediction problem based on textual or visual attributes, or a combination of them. A study for predicting popularity in Twitter by Hong et al. [15], formulates the task as a classification problem, investigating a wide spectrum of features based on the content of the messages. Bae and Lee [7] ana-lyzed Twitter posts, categorized the followers of a limited number of influential users to a positive and a negative audience, correlated the sentiment of the followers with the textual content of their posts and based on that defined a measure of influence.

Apart from textual content, visual features are investigated as well for their correlation with popularity prediction. Cappallo et al. [11] developed a model for popularity prediction in social me-dia based only on visual content. In this paper, a latent ranking approach was proposed, which takes into account not only the dis-tinctive visual cues in popular images, but also those in unpopular images. The experiments investigated factors of the ranking model, the level of user engagement in scoring popularity, and whether the discovered senses are meaningful. One of the first comments of the authors of this paper is that popularity is a difficult quantity to measure, predict, or even define. It is also claimed that this paper is building on the findings of the paper of Khosla et al. [17], who at-tempted to answer to the question of what makes an image popular. In this paper, the importance of image cues are highlighted such as color, gradients, deep learning features and the set of objects present, as well as the importance of various social cues such as number of friends or number of photos uploaded that lead to high or low popularity of images. An indication of how challenging the problem of popularity prediction is, especially when based mainly on visual features, is the final outcome of the paper of Capallo, comparing their best result expressed by their evaluation metric, Spearman’s Rank Correlation Coefficient, scoring a 0.345, versus Khosla’s paper scoring 0.315 with the same evaluation metric.

(4)

Image popularity prediction in a "cold start" scenario (i.e. where there exist no, or limited, textual/interaction data), by considering image context, visual appearance and user context, was investigated by McParlane et al. [20]. In this paper the authors have boosted pop-ularity prediction results to 76% accuracy with another approach, casting the problem as a classification task between highly popu-lar and unpopupopu-lar images. They have used the Pareto principle to select the threshold to split between images with high (20%) and low (80%) comments and views, training an SVM classifier with a Radial Basis Function (RBF) kernel. Mazloom et al. [19] presented an approach for identifying what aspects of posts determine their popularity. The proposed model was based on the hypothesis that brand-related posts may be popular due to several cues related to factual information, sentiment, vividness and entertainment param-eters about the brand, called engagement paramparam-eters. Experiments on a collection of fast food brand-related user posts crawled from Instagram have shown that visual and textual features are comple-mentary in predicting the popularity of a post. The rank correlation of late fusion over all introduced engagement parameters reaches 0.551 for certain brands in this paper.

In the paper of Gelli et al. [13] image popularity prediction in social media is based on sentiment and context features. More specifically the authors are using object (high-level) features for images, combined with context features, e.g. number of tags and length of title of the image, and user features, e.g. mean views of the images of the user.

Finally, Overgoor et al. [21] investigated brand popularity pre-diction in a spatio-temporal category representation framework. The experiments of this work confirm complementarity of visual and textual features for predicting post popularity, especially for posts related exclusively to brand. Furthermore, a post level train-ing model is employed and a category representation of the brand as well, having a positive impact on the results. Last, it is shown that temporal dimension into the representation increases the pre-dictability of brand popularity.

3 METHODOLOGY

3.1 Post Popularity Measurement

In this section the methodology followed in this work will be de-scribed. The post popularity prediction model is fitted on a dataset D, consisting of users’ posts of all considered categories. The dataset consists of subsets corresponding to each different category, ex-pressed byD = [D1, D2, ..., Dm]. Each Di subset containsniposts,

Di = (Pi1,yi1), (Pi2,yi2), ..., (Pini,yini) wherePi j is a post

car-rying multimodal information andyi jis its corresponding number

of likes.

Each post has visual and textual information. Both visual and tex-tual features are extracted for each post in order to construct sample-feature matrices for each subset in the respective analysis. The normalized number of likes for each postPi jare considered as the

y-vector to be predicted by a regression model. Thus, each subset Di

corresponds to a vector of normalized number of likes for each post expressed by the following formula: ¯Yi _{= log y}

1, ...,yni

+ ¯1. The log function is chosen to deal with the large variation in the number of likes in the posts. The number of likes per post follow a power law distribution in social media, while the minority of posts

get a large number of likes and the majority of posts get little or no likes.

For each post,Pi j,K feature vectors are extracted for both the

visual and textual characteristics. Each feature extraction proce-dure links a postPi j to a corresponding feature vector ¯xk Pi j .

The dimension of each feature vector depends on the feature type, resulting from the model or procedure used for extraction.

The feature extraction procedure is concluded with the creation of feature matrices for all samples in each subsetDi. Each category

of posts is represented by a matrixX (Di), used to train a regression

model for predicting a normalized number of likes, ˆyi, for theDi

subset.

In this work, similar to state-of-the-art approaches such as the ones in [17], [13] and [19] Support Vector Regression (SVR) is used in order to predict post popularity. More specifically, SVR with a RBF kernel and l1 regularization [25] is used, due to its good computational speed for the used dataset, straightforward tuning and better predictive capability compared to the linear SVR.

The learning procedure is applied to every category of the ex-tracted features. The regression model results in a ˆyi prediction vector for each subsetDi, and for every feature category. These

prediction vectors participate in a late fusion which is described as follows: AvgPool= 1 K · Õ ˆ yi1_{+ ˆy}i2_{+ ... + ˆy}i K (1) MaxPool= max ˆ yi1_{, ˆy}i2_{, ..., ˆy}ik ₍₂₎

3.2 Evaluation metric

Subsequently, the post popularity prediction model is evaluated by the Spearman’s Rank Correlation Coefficient (SRCC), a statistical metric showing the monotonic relation between two vectors, in this case the prediction vector resulting from the model and the ground truth vector, expressed by the following equation:

ρ = 1 −6 Ír( ˆyi j) −r(¯yij) 2 n n2_{− 1} (3)

where n is the number of data points, and the sum term measures the square difference between the predicted rank and ground-truth rank for all examplesj. The ρ receives values in the range [−1, 1], with 1 denoting perfect correlation.

4 EXPERIMENTAL SETUP

4.1 Dataset

The effectiveness of the proposed methodology is tested on a dataset crawled from Instagram social media network for the purposes of this thesis, using the Instagram API [2]. The dataset consists of approximately 40k Instagram posts with visual and textual content as well as metadata. The information available by the crawler is the post ID in Instagram, the image link, user description/title of the post, Instagram username, the number of likes for the post, and for a small percentage of the posts, the geolocation of the user. The crawling procedure of the dataset lasted three weeks, from 1/4/2017 to 25/4/2017 approximately, since the crawling speed

(5)

was approximately 2500 posts per day. The crawling of the posts was made according to hashtags related to human action and places/sceneries, and also the presence of people, pets and brand. The relevant hashtags used for the Instagram crawler are: for places/scenes: #art_gallery, #bar, #beach, #bedroom, #cafe, #canals, #fields, #forest, #home, #kitchen, #street, #swimming_pool, #urban, for actions: #playing_music, #running, #basketball, #surfing, #ski, #climbing, #cycling, #dance, #football, #horseriding, #hug, #kiss, and for pets/people: #pets,#selfie. Brand related posts are referring to the fast food brand #Wendys. The data statistics of the whole dataset, after its preprocessing, are reported in Table 1.

4.2 Data Preprocessing

The next step after the crawling of the raw dataset, was the dataset preprocessing in order to reach a clean form of the data ready to be used for subsequent analysis. Initially, duplicate posts were dropped for each hashtag-related subset, with the use of Python’s Data Analysis Library (Pandas 19.2v) [4] for organizing .csv or .xlsx files in dataframes. Pandas library was also used throughout the analysis for several data handling tasks. The amount of duplicate posts in the originally crawled dataset was about 15% of the total posts. Next, the image content of each post was downloaded from its medialink in the crawled posts. The final clearing of the dataset included discarding samples with bad or removed image content, as well as posts with no textual content.

The use of Python’s Natural Language Processing (NLTK 3.2.4v) library [8] was employed in this point for the text preprocessing of the posts. Non-ASCII character words were removed from the posts, and the remaining words of each post were compared to an English words dictionary. The words were tokenized, stripped from mention and hashtag symbols, emoticons and other non-relevant symbols to textual analysis. English stopwords were also excluded. The result is a list of tokenized English words for each post, ready for the subsequent textual analysis. At this point all the remaining clear text accompanying the description of the image in a post was pooled, due to the small amount of available text in each post that was crawled. The dataset, after the preprocessing phase, was prepared in a clear form ready for feature extraction.

Table 1: Statistics of Data per Category in the Experiment

Action: 13205 Scene: 15794 #basketball 1055 #art_gallery 499 #climbing 1008 #bar 1039 #cycling 1786 #beach 1222 #dance 473 #bedroom 1240 #football 1549 #cafe 1568 #horseriding 1258 #canals 882 #hug 1290 #fields 1264 #kiss 909 #forest 1789 #playing_music 185 #home 1306 #running 1953 #kitchen 1620 #ski 1267 #street 1408 #surfing 472 #swimming_pool 415 #urban 1542 People/Pets: 1467 Brand: 8522 #selfie 714 #Wendys 8522 #pets 753 All Samples: 38988

4.3 Feature Extraction

4.3.1 Visual Feature Extraction. The first group of features ex-tracted from each sample in the dataset are related to visual content, in this case the image each post contains. High-level visual features are connected to a semantic meaning, like objects and concepts related to natural and urban environments as well as animals and humans. These features are very important as the method’s final goal is the formulation of meaningful recommendations to users in order to make their posts more popular. Low-level visual features are examined for their relation with popularity as well, as the col-ors, intensity, contrast and texture of a picture could lead users to notice a picture in a social media platform or not. Last, but not least, visual sentiment, positive or negative, is another indication to be researched for its correlation with popularity of an image online.

• High-level features related to concepts: A deep neural network was utilized to extract high-level features, namely the GoogleNet Inception V3 [26]. Python’s deep learning library Keras [3] with a Tensorflow [5] backend was used for the implementation. A 1000-dimensional feature vector of the softmax output layer of the network was extracted for each image. Each element of the vector corresponds to the probability of each one of ImageNet [24] 1000-concept weights used for the pretrained deep network.

• Low-level features: A 2048-dimensional low-level feature vector was extracted for each image from the Max Pool-ing of the Convolutional Pool 8x8 layer of the GoogleNet Inception V3 deep network architecture.

• Visual Sentiment Features: The sentiment in the visual content of each post was expressed through visual senti-ment 1200-dimensional vectors, utilizing SentiBank detec-tors [12] of the Visual Sentiment Ontology (VSO) [9], in a MATLAB implementation. The construction of VSO was based on psychology theory behind a wheel of 24 emotions to extract Adjective-Noun pairs (ANP’s) such as "beautiful flowers" or "sad eyes" as visual tags.

4.3.2 Textual Feature Extraction. The second group of features extracted for each post are relevant to its textual content. According to the relevant literature, textual features can be of significant value when combined with visual features for the prediction of post popularity. Some of the textual features can also lead to meaningful recommendations for users, pointing out the most common words connected with high popularity when used in social media posts. The types of extracted features are listed below:

• Word-to-Vec Features: The Word-to-vec (W2V) model [23] leads to vector representations of words learned using word embeddings, where each word is represented by a 300-dimensional vector. All the words for each post are represented by a W2V vector. Each vector corresponds to a word, while the final representation of the textual content of each post is obtained by performing average pooling on all vectors to derive one 300-dimensional vector. Python Gensim library’s W2V implementation [1] was used for the extraction, while the W2V model was pretrained on a part of Google News dataset, a collection with about 100 billion words.

(6)

• Bag-of-Words Features: The idea behind the Bag-of-Words (BOW) representation is to convert the textual content of each post to a sparse vector representation of the counts of each word in the post, compared to a pre-constructed vocabulary, i.e., a sorted list of all the unique words in the dataset. In the first step of this analysis the vocabulary was constructed, with 19166 unique words appearing in the dataset and was subsequently used as an argument to extract a 19166-dimensional sparse vector for each post, using the Count Vectorizer class of Python’s Scikit-learn 18.2v library [22].

• Textual Sentiment Features: In this last feature extrac-tion procedure, a 2-dimensional feature vector is repre-senting the positive and negative sentiment probability of the textual content in each post. Python’s TextBlob 12.0v library [6] was utilized for the task, based on NLTK’s Naive-Bayes Analyzer.

4.4 Experiments

4.4.1 Experiment 1: Category-Mix Analysis. The first ex-periment for post popularity prediction concerns the category-mix analysis. Category-mix subsets are formulated, by concatenating all the posts from the related hashtags to action, scene, people - pets, and brand datasets. Finally the analysis is repeated for the whole dataset. A 70% - 30% split for the training set and test set was considered for each subset. Initially, the Scikit-learn SVR model [25] with an RBF kernel was used as the regression model, with l1-normalization, to predict log-normalized post popularity. The C-SVR hyperparameter was tuned by evaluating the accuracy of each model in a 5-fold cross validation for the values C = [0.01, 0.1, 1, 10, 100, 1000].

For the sake of comparison, three more regression models of Python’s Scikit-learn were used for the same task: a Random Forrest Regressor (RF) [10] with 100 tree estimators, as an optimal parame-ter giving the best results, a Multi-layer Perceptron (MLP) using an artificial neural network for the regression task, with scikit-learn implementation default settings giving the best results [18] in the shortest computational time, and a linear SVR with l2-normalization. The linear SVR model, apart from serving the purpose of comparing the results, was also used for storing the weights of the model vec-tor in order to use them for recommendations in feature categories with a semantic meaning. This is a procedure not possible for the non-linear SVR-RBF regressor used in this analysis, because theϕ(·) mapping function cannot be analytically determined. Ranking the coefficients of the model vector will lead to a choice of the top-10 semantic recommendations correlated with popularity.

Two types of analysis were employed:

• Visual Category-Mix analysis: The regression was re-peated for all visual feature categories: high-level, low-level and visual sentiment. The rank correlation between the prediction vector and the test set ground truth was reported, as well as between the average or max pooling of the prediction and the ground truth of each category subset, for each feature and each model employed. • Textual Category-Mix analysis: Similarly to the

previ-ous analysis, the regression was repeated for all textual feature categories: W2V, BOW and textual sentiment. The

rank correlation was again reported for each feature type, regression model, and their late fusion.

Finally, late fusion is employed for multimodal features, combin-ing the results of the visual and textual analysis.

The results of this experiment are reported in Section 5.1. 4.4.2 Experiment 2: Category-Specific Analysis. The sec-ond experiment for post popularity prediction refers to the Category-Specific analysis, referring to each hashtag-related subset in the dataset. A 70% - 30% split for the training set and test set was con-sidered for each hashtag subset. For this type of analysis, which requires a large number of regressions, the Scikit-learn SVR-RBF model was chosen as the regression model, with l1-normalization and C-SVR hyperparameter tuning for the values C = [0.01, 0.1, 1, 10, 100, 1000] in a 5-fold cross validation.

Experiment 1 (Category-Mix), as it will be described in Section 5.1 of the results, has shown that each regression model performed differently for different tasks. Firstly, l2 regularization linear-SVR model is demonstrating the lowest prediction capability, while in terms of computational time, MLP regression model is a slow regres-sor, although demonstrating high prediction capability for different subset categories or features. However the gain is often small com-pared with other models while the difference in computational burden is significant, and this is the reason why it is not further used for Experiment 2 (Category-Specific). A final choice is to be made between SVR-RBF and RF regression model, offering both some of the best results. Although RF can be boosted in terms of performance by parallelization in its Scikit-learn implementation it is still slower, when tested in a 10-fold cross validation of the 10% of the whole dataset versus SVR-RBF. Reducing the number of estima-tors in order to boost computational time, on the other hand, leads it to be outperformed by SVR-RBF anyway. Moreover, the tuning of the model is complex which does not make it the best to use for a large number of regressions and more significantly, it is prone to presenting significant variance in the results, depending on the random state initialization of the estimators. On the basis of the aforementioned reasons, SVR-RBF was chosen for Experiment 2, due to its fast and relatively simple tuning, good computational performance and stability in terms of variance in the results.

Two types of Category-Specific analysis were employed, in a sim-ilar manner to the Category-Mix analysis: visual category-mix analysis and textual category-mix analysis. The description of the procedure is similar to Section 4.4.1. This experiment leads to recommendations concerning which hashtag-related subsets are more correlated with popularity prediction.

The results of this experiment are reported in Section 5.2. 4.4.3 Experiment 3: Concept-specific analysis: In this last experiment, the 1000-dimensional high-level feature vectors, pre-viously extracted for each image in the dataset, were divided in smaller vectors according to their index matching with specific concept categories: action and objects related to action, scene and objects related to scene, people, animals, and objects(general). The 1000-Imagenet Concepts were manually labelled for the aforemen-tioned 5 categories, and the index of 1000-vector coordinates was matched with the index position of a specific category element. The outcome was 50-dimensional feature vectors for action/action

(7)

Table 2: Experiment 1 - Popularity Prediction for Category-Mix using Visual Features

Model High-Level

Low-Level

VisSent AvgPool MaxPool action SVR(rbf,l1) 0.286 0.317 0.152 0.289 0.28 SVR(lin,l2) 0.231 0.163 0.146 0.241 0.212 RFReg(100) 0.274 0.279 0.158 0.302* 0.288 MLPReg 0.183 0.266 0.211 0.254 0.253 scene SVR(rbf,l1) 0.169 0.199 0.136 0.187 0.155 SVR(lin,l2) 0.143 0.144 0.117 0.198 0.161 RFReg(100) 0.221 0.201 0.153 0.250* 0.231 MLPReg 0.142 0.227 0.133 0.202 0.193 people/pets SVR(rbf,l1) 0.187 0.233 0.193 0.227 0.245* SVR(lin,l2) 0.174 0.053 0.043 0.077 0.090 RFReg(100) 0.142 0.209 0.137 0.215 0.220 MLPReg 0.033 0.250 0.198 0.038 0.086 brand SVR(rbf,l1) 0.167 0.174 0.146 0.188 0.187 SVR(lin,l2) 0.104 0.105 0.115 0.143 0.113 RFReg(100) 0.187 0.190 0.193 0.244* 0.226 MLPReg 0.121 0.226 0.189 0.202 0.210 all SVR(rbf,l1) 0.221 0.188 0.095 0.218 0.202 SVR(lin,l2) 0.214 0.193 0.166 0.253 0.217 RFReg(100) 0.232 0.218 0.174 0.260* 0.235 MLPReg 0.155 0.247 0.164 0.230 0.205

objects within Imagenet 1000 concepts, 151-dimensional vectors for scene/scenery objects, 8-dimensional vectors for people, 406-dimensional vectors for animals and 525-406-dimensional vectors for general objects, respectively. Many of the concepts had to be double-labelled, and as a result participating in two categories, as a strict division of concepts was not always possible in terms of semantic meaning.

The purpose of this analysis was to study the outcome of a model when a specific high-level concept category is chosen for predicting post popularity for each one of the subsets related to action, scene, people and pets and brand. The regression model and its settings were the same as Experiment 2. Finally, the impact of every concept category was studied for the whole dataset. The results are reported in Section 5.3.

5 RESULTS

5.1 Category-Mix Results

In this section, the results of the Category-Mix analysis as described in Section 4.4.1. will be discussed.

5.1.1 Visual Category-Mix Results. The results for the action, scene, people and pets, and brand subsets are presented in Table 2, for all different visual features and different used regression models. In each column of the table corresponding to each feature category, the best performance for the specific feature and subset is high-lighted with bold among the different regressors. The best rank correlation between each one of the visual features is reported for the action dataset. The highest SRCC scores for high and low-level features are 0.286 and 0.317 respectively, for the SVR-RBF model. The highest SRCC score for visual sentiment features is 0.211 for the action dataset and the MLP Regressor. It is observed

Table 3: Experiment 1 - Popularity Prediction for Category-Mix using Textual Features

Model Word2Vec BoW TextSent AvgPool MaxPool action SVR(rbf,l1) 0.417 0.438 0.192 0.433* 0.383 SVR(lin,l2) 0.406 0.178 0.185 0.296 0.278 RFReg(100) 0.417 0.460 0.109 0.399 0.343 MLPReg 0.428 0.262 0.185 0.379 0.338 scene SVR(rbf,l1) 0.362 0.430 0.207 0.413 0.380 SVR(lin,l2) 0.378 0.199 0.159 0.302 0.269 RFReg(100) 0.413 0.479 0.090 0.412 0.342 MLPReg 0.446 0.310 0.165 0.415* 0.390 people/pets SVR(rbf,l1) 0.203 0.194 0.056 0.166* 0.142 SVR(lin,l2) 0.182 0.047 0.080 0.095 0.113 RFReg(100) 0.102 0.162 0.042 0.143 0.100 MLPReg 0.153 0.061 0.053 0.098 0.129 brand SVR(rbf,l1) 0.236 0.167 0.087 0.228* 0.198 SVR(lin,l2) 0.233 0.086 0.047 0.142 0.144 RFReg(100) 0.205 0.197 0.075 0.193 0.166 MLPReg 0.244 0.092 0.078 0.159 0.160 all SVR(rbf,l1) 0.328 0.415 0.073 0.331 0.349 SVR(lin,l2) 0.338 0.402 0.073 0.390 0.356 RFReg(100) 0.370 0.448 0.065 0.372 0.292 MLPReg 0.409 0.320 0.073 0.397* 0.370

from the category-mix analysis that posts related to action are more correlated with popularity in terms of visual features, even when compared with the same analysis performed for all samples in the dataset. As a result, the highest reported score for late fusion is also observed for the action dataset reaching 0.302. Low-level visual features have the highest correlation with popularity prediction.

The best reported rank correlation for the late fusion in the whole dataset is 0.260, for the RF regression model.

• Recommendations from ImageNet: The weights of the model vector elements for the SVR linear model were ranked for the high-level feature analysis and for each one of the category-mix subsets. The ranking was used to return the top-10 elements of the vector contributing more to popularity prediction and corresponding to the 10 most important ImageNet concepts for each category. This enables us to make recommendations for the most popular visual concepts, appearing in Table 4. For the action and scene subsets it is obvious that the natural environments both in some outdoor sports hashtags and outdoor scener-ies are dominant because most of the concepts are wild animals. Of course this is also due to the fact that ImageNet

Table 4: Top-10 ImageNet Concepts - Category-Mix

action European gallinule, white stork, ringlet, jay, brain coral, ruffed grouse, dragonfly, mantis, hognose snake, scorpion

scene coho, cocker spaniel, king snake, eft, hyena, goose, tailed frog, ladybug, brambling, black widow

people/pets sunglass, marimba, jeep, shovel, ballpoint, whistle, bassoon, frying pan, vault, alp

brand green mamba, mongoose, armadillo, brambling, Leonberg, electric locomotive, Great Pyrenees, admiral, radio, grey whale

(8)

Figure 2: Most important visual ANPs for (a) action, (b) scene, (c) people-pets and (d) brand

has a big percentage of wildlife concept categories. The top-10 concepts are changing for people - pets, and brand, while mostly indoor backgrounds are appearing in these kind of images, and there are some home items appearing in the two last subsets.

• Recommendations from ANPs: The weights of the model vector elements for the linear SVR were also ranked for the visual sentiment feature analysis and for each one of the category-mix subsets. The top-10 most correlated with popularity adjective-noun pairs of the VSO are returned for each category-mix subset. The results are visualized in 4 wordclouds, appearing in Figure 2, where the size of each ANP in the cloud corresponds to its ranking in the top-10 list. Most prominent examples are stupid face for action, relaxing cruise, clean pool and colorful garden for scene, pretty kitty and attractive face for people and pets and ancient farm, relaxing beach for brand.

5.1.2 Textual Category-Mix Results. Finally, the results of the category-mix analysis for textual features are presented in Table 3. The analysis is similar to the one made for the visual features. Dif-ferent regression models are tested for the subsets of action, scene, people and pets, and brand and for all the textual features. The best performance per subset and feature is highlighted with bold among the different regressors. The scenes subset is observed to have the highest rank correlations with all textual features, and more specifi-cally: 0.446 for W2V, 0.479 for BoW and 0.207 for textual sentiment. In contrast with the action dataset being the most correlated with visual features, it appears that textual content related to scenes is the most correlated with post popularity prediction. In general, SRCC receives higher values for the textual features than for the

Figure 3: Most important words in BoW representation for (a) action, (b) scene, (c) people-pets and (d) brand

visual features, underlying the importance of textual analysis for post popularity prediction.

The best reported SRCC for the late fusion in the whole dataset is 0.397 for the MLP regressor.

• Recommendations from BoW: The textual representa-tion to come in handy for meaningful recommendarepresenta-tions to users is the BOW representation. The weights of the linear SVR model vector for action, scene, people and pets and brand were ranked for the BoW features. The rank-ing resulted in detectrank-ing the top-10 words most correlated with popularity prediction by matching the indexing with the precomputed sorted BoW vocabulary. The results are shown in Figure 3. The wordcloud visualizations are using bigger size words in the cloud for the higher ranked words among the top-10. The highest ranked words for action like aw, guru, goaltender are also associated with sports. For the scene, places like pharmacy or Kuala Lumpur are top-ranked. The most ranked for people/pets are pussycat, personas and sleeping and for brand are giggle, moving, consume and froze, also associated with fast food brand campaigns.

5.1.3 Multimodal Fusion. The last procedure in the Experiment 1 was to fuse the prediction vectors of all types of features, both vi-sual and textual, for each category-related subset. The results of the multimodal fusion performed appear in Table 5, and are underlying the complementarity of using a combination of both visual and textual features for popularity prediction. In social media like In-stagram, when image centrality is the most important for drawing other users’ attention, the hidden information in the image is con-tributing to the reactions in the network, thus it has to be analyzed. However, fusing visual features alone will not boost popularity

(9)

Table 5: Experiment 1 - Multimodal Fusion

Multimodal Fusion action scene people/pets brand all AvgPool MaxPool AvgPool MaxPool AvgPool MaxPool AvgPool MaxPool AvgPool MaxPool SVR(rbf,l1) 0.467 0.394 0.414 0.334 0.166 0.154 0.268 0.218 0.354 0.330 SVR(lin,l2) 0.340 0.241 0.316 0.213 0.114 0.094 0.184 0.120 0.391 0.306 RFReg(100) 0.443 0.347 0.430 0.323 0.208 0.101 0.267 0.202 0.409 0.299 MLPReg 0.397 0.316 0.407 0.356 0.074 0.043 0.215 0.196 0.407 0.329

Figure 4: Experiment 2 - Popularity Prediction for Category-Specific using Visual Features

prediction performance as much, as if also the image caption and textual information accompanying the image are also considered to underhelp the prediction.

5.2 Category-Specific Results

The first group of results from the Category-Specific analysis, de-scribed in Section 4.4.2, are depicted in Figure 4 and Figure 5, for the visual and textual analysis respectively. Spearman’s Rank Cor-relation coefficient is reported between each prediction vector and the testset ground truth for each hashtag-related subset. The two figures are reporting the results in form of a heatmap visualization, where the cells of the highest correlation between a feature and a hashtag are depicted with a dark red color while the lowest, or even negative in some cases, correlation is depicted with a dark blue color. The legend of each heatmap on the right side is showing the color degradation according to the correlation score.

5.2.1 Visual Category-Specific. It is observed in Figure 4 that two of the hashtags highly correlated with all visual features are #dance and #swimming_pool. The #dance subset is scoring a SRCC of 0.4 for the high-level, 0.321 for the low-level and 0.188 with the

visual sentiment features, while average and max pooling reaches 0.4 and 0.403 respectively. The #swimming_pool subset has also a high SRCC score for all visual features, with 0.495 for high-level, 0.598 for low-level, and 0.309 for visual sentiment. The results for average and max pooling for all visual features together reach a SRCC score of 0.503 and 0.535 respectively.

The results are pretty high for these two hashtags, showing the effect on popularity of pleasant actions and places for everyday life hidden in visual content, reporting above state of the art scoring in the related literature for popularity prediction based on visual features alone.

For action related hashtags, significant positive correlation is also reported for e.g. #basketball, #horseriding, #hug, #climbing and #cycling, while for scene-related #art_gallery, #bar and #forest are performing well. Person in the image is also important as #selfie has a significant positive correlation, as well as brand for #Wendys subset.

• Recommendations: The results in this case show that users including action in their images are more likely to make their future posts more popular, particularly when this has to do with dancing, or some sports activity. Scenes

(10)

Figure 5: Experiment 2 - Popularity Prediction for Category-Specific using Textual Features

in their images are a bit less likely to be popular, unless they are connected with pleasant activities like swimming, or having drinks. However, the low-level feature analy-sis for scene shows that background colors, contrasts etc. are important for making images popular in social media. Lastly, face and brand centrality are important for increas-ing popularity as well.

5.2.2 Textual Category-Specific. From the observations of Fig-ure 5 it is shown that the most important subsets are #dance and #swimming_pool, as it was also observed for the visual part, mak-ing this hashtags the strongest ones in correlation with predicted popularity from multimodal features. Regarding textual features in action-related hashtags, #ski is also reported with a high perfor-mance, as well as #climbing and #hug, while in scene-related are #bedroom and #forest.

The #dance subset is scoring 0.475 for W2V, 0.48 for BOW, 0.291 for textual sentiment, and 0.419 - 0.47 for average - max pooling respectively. The #ski is also reporting 0.414 for W2V, 0.443 for BOW, 0.231 for textual sentiment and 0.439 0.445 for average -max pooling respectively. The most impressive results are the ones related with #swimming_pool, which is scoring 0.675 for W2V, 0.669 for BOW, 0.549 for textual sentiment and 0.652 0.653 for average -max pooling respectively. These are also the highest scores in this thesis. Textual features have a stronger correlation with popularity prediction than visual features for the category - specific analysis.

• Recommendations: Sports and entertainment related words in social media posts are very important for post popularity

prediction. Dominant are the words used to describe ac-tions like dancing and skiing, climbing and hugging, while some internal or external places like bedroom, and espe-cially swimming pool, related to relaxation and positive sentiments are also highly correlated with popularity of future posts.

5.3 Concept- Specific Results

In this section, the results presented are relevant to the experiment described in Section 4.4.3. The results are presented in Figure 6. It is shown that for all subsets, except for the scene subset, all 1000 ImageNet concepts have more descriptive power for the visual con-tent of the subsets than a percentage of them mostly correlated with only the specific subset. The observation reinforces the re-marks of the authors in [14] for video concept vocabularies. The only case where all 1000 concepts are less effective for popularity prediction than specific ones, is the case of the scene subset, which is a new research outcome in terms of popularity prediction with visual features. In this case, the scene subset is observed to have the highest rank correlation with the specific concepts related only to scene/scenery objects, forming 151-dimensional high-level fea-ture vectors of 151 such concepts. The reported correlation is 0.195 versus all concepts with 0.169.

The conclusion of the experiment is that higher-dimensional feature vectors are contributing more to popularity prediction, since there are as many descriptors as possible for a subset category of posts.

(11)

Figure 6: Correlation between types of high-level concepts and subset categories

6 CONCLUSIONS

As a conclusion to this thesis, the proposed pipeline generated results which will provide answers to the research questions for-mulated in the context of this work.

All the three experiments conducted have provided significant results for recommendations to users and recommendations for content. A number of features were tested with regression learn-ing models for post popularity prediction and have been proven adequate to lead to recommendations. Highest rank correlations among both visual and textual features indicated human joyful ac-tivities and fun places that could make a post popular, if appearing in its content. Furthermore, visual features, especially low-level, are more powerful when predicting popularity of action content. On the other hand, textual features, especially bag-of-words repre-sentations, are more powerful for popularity prediction of scene content. Another observation leads to the fact that high-level con-cepts related to scene or indoor/outdoor objects, have a descriptive power with the highest correlation with popularity prediction in scenery datasets. Human faces and animals are also important for popularity prediction, as the adjective-noun pairs and bag-of-words representation weights are indicating when ranked. The action in post content is also presenting the highest correlation with popu-larity for multimodal fusion of all features.

Finally, both visual and textual features are important for pre-dicting popularity and the combination of the best feature selection, as a future target of this research, will be the next level towards optimizing popularity prediction models with multimodal features.

7 ACKNOWLEDGMENTS

This research was completed successfully with the immense support and guidance of Dr. Masood Mazloom, the inspiring and mentor-ing remarks of Dr. Efstratios Gavves, the technical help with UvA Sesame Servers of Bouke Hendriks and Dr. Dennis Koelma, and the moral support of Dr. Nikolaos Paterakis, who are all very much thanked for their help and contribution in any way possible.

REFERENCES

[1] Gensim:deep learning with word2vec. http://radimrehurek.com/gensim/models/ word2vec.html. Accessed: 2017-04-25.

[2] Instagram application programming interface. https://www.instagram.com/ developer/. Accessed: 2017-04-01.

[3] Keras: The python deep learning library. https://keras.io/. Accessed: 2017-04-25.

[4] Python data analysis library - pandas. http://pandas.pydata.org/. Accessed: 2017-04-01.

[5] Tensorflow:an open-source software library for machine intelligence. https: //www.tensorflow.org/. Accessed: 2017-04-25.

[6] Textblob: Simplified text processing. https://textblob.readthedocs.io/en/dev/ index.html. Accessed: 2017-04-25.

[7] Younggue Bae and Hongchul Lee. Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology, 63(12):2521–2535, 2012. [8] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with

Python. O’Reilly Media, Inc., 1st edition, 2009.

[9] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pages 223–232, New York, NY, USA, 2013. ACM.

[10] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001. [11] Spencer Cappallo, Thomas Mensink, and Cees G.M. Snoek. Latent factors of

visual popularity prediction. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR ’15, pages 195–202, New York, NY, USA, 2015. ACM.

[12] Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. CoRR, abs/1410.8586, 2014.

[13] Francesco Gelli, Tiberio Uricchio, Marco Bertini, Alberto Del Bimbo, and Shih-Fu Chang. Image popularity prediction in social media using sentiment and context features. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 907–910, New York, NY, USA, 2015. ACM.

[14] Amirhossein Habibian, Koen E.A. van de Sande, and Cees G.M. Snoek. Rec-ommendations for video event recognition using concept vocabularies. In Pro-ceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, ICMR ’13, pages 89–96, New York, NY, USA, 2013. ACM.

[15] Liangjie Hong, Ovidiu Dan, and Brian D. Davison. Predicting popular messages in twitter. In Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11, pages 57–58, New York, NY, USA, 2011. ACM. [16] Mark J. Huiskes and Michael S. Lew. The mir flickr retrieval evaluation. In

MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA, 2008. ACM.

[17] Aditya Khosla, Atish Das Sarma, and Raffay Hamid. What makes an image popular? In International World Wide Web Conference (WWW), Seoul, Korea, April 2014.

[18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[19] Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 197–201, New York, NY, USA, 2016. ACM.

[20] Philip J. McParlane, Yashar Moshfeghi, and Joemon M. Jose. "nobody comes here anymore, it’s too crowded"; predicting image popularity on flickr. In Proceedings of International Conference on Multimedia Retrieval, ICMR ’14, pages 385:385– 385:391, New York, NY, USA, 2014. ACM.

[21] Gijs Overgoor, Masoud Mazloom, Marcel Worring, Robert Rietveld, and Willemijn van Dolen. A spatio-temporal category representation for brand popularity prediction. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, June 6-9, 2017, pages 233–241, 2017.

[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [23] Xin Rong. word2vec parameter learning explained. CoRR, abs/1411.2738, 2014. [24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [25] Alex J. Smola and Bernhard Schölkopf. A tutorial on support vector regression.

Statistics and Computing, 14(3):199–222, Aug 2004.

[26] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-niew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.