Detection of popular issues on Twitter

(1)

Detection of popular issues on Twitter

Submitted for the fulfillment for the degree of Master of Science George Gousios

11836156

Master Information Studies Data Science

Faculty of Science University of Amsterdam Defense Date : 2018-07-04

Internal Supervisor External Supervisor Title, Name Dr Maarten Marx Wouter Koot, BSc.

Affiliation UvA OBI4Wan

(2)

Contents Contents 1 Abstract 2 1 Research Goal 2 2 Introduction 2 3 Related Work 2 3.1 Measuring Popularity 2

3.2 The power of Social Media 2

3.3 Different approaches 2

3.4 Classification Features explored 3

4 Basic Definitions 3

5 Methodology 3

5.1 Preparing the Data 3

5.1.1 Choosing the features 3

5.1.2 Dealing with missing values 4

5.1.3 Preprocessing 4

5.1.4 Plotting Tweet features 5

5.1.5 Check for Correlations 5

5.1.6 Deriving Aggregate Features 5

5.1.7 Labeling 6

5.1.8 Dealing with Class Imbalance 6

5.2 Evaluating the baseline model 7

5.3 Machine Learning 7 5.3.1 Logistic Regression 7 5.3.2 Decision Tree 7 5.3.3 Random Forest 7 5.3.4 SVM (Linear kernel) 8 5.3.5 SVM (RBF kernel) 8 5.3.6 ANN 8

5.4 Results & Conclusion 8

6 Discussion 9

References 9

7 Appendix 11

(3)

ABSTRACT

This thesis focuses on the performance of predicting whether a topic will be popular on Twitter, using a dataset of Twitter data and various classifiers. The task falls under the category of short term prediction and, in particular, binary classification. Content and Social features were explored and trained on by various Machine Learning classifiers. The best performing classifiers outperform the baseline model marginally, providing an improvement of 2-3% in terms of F1 score.

KEYWORDS

Social Media Analysis, Popularity Prediction, Twitter, Feature Ex-traction, Feature Selection, Machine Learning, Deep Learning

1 RESEARCH GOAL

The aim of this project is to improve the following task: given a set of tweets about a specific topic, classify this set as popular or not. Popularity is measured in terms of future tweets, and will be fur-ther explained later on. The baseline is the currently implemented method at OBI4wan which achieves an F1 score of 72.8%.

2 INTRODUCTION

The project focuses on the binary short-term classification (popular/non-popular) of aggregated Twitter data of April 2018. Specifically, only published tweets from non-private accounts are being used, no direct messages or private tweets.

In order to determine the most suitable evaluation criterion, it is important to understand the purpose of the task under discus-sion. OBI4wan sends alert notifications to its clients based on the model’s classification results, so, apart from predicting correctly true positives, minimizing false positives and detecting true neg-atives is also essential, so that clients do not get false alerts. For that matter, precision and recall seem to be appropriate metrics for model evaluation, and a metric that harmonically combines both is F1 score, which will be the metric of our choice. That decision is supported by my colleague Matt Chapman’s own research on the same matter [8].

3 RELATED WORK

3.1 Measuring Popularity

Throughout the past decade, efforts have been made to address the problem of popularity on social media. Popularity of content as a term is quite ambiguous, so it is logical that different approaches use different definitions, which are usually task-specific. Szabo and Huberman [37] used views for Digg (diggs) and Youtube videos to define popularity, and tried make predictions in terms of 7 and 30 days afterwards. F.Figueiredo [12] used multi-class classifiers to predict Youtube/Vimeo popularity classes, which were derived from a time-series clustering algorithm. Additionally, Cheng-Te Li et al [24] defined Youtube popularity as the number of shares of the videos, and performed multi-class classification on data labeled with self-defined thresholds.

On a more relevant note, numerous efforts have been made in social media as well. In recent classification approaches, F. Gelli et al [14] and A. Khosla [19] consider Flickr views a good measure of popularity, while Philip J. McParlane [26] considers both views and comments as factors that define his labeling. H Lakkaraju [20] used comments as a measure of Facebook post popularity and defined 5 classes before performing his classification task.

As far as Twitter is concerned, Z. Ma et al [25] tries to predict hashtag popularity (defined as the volume of tweets containing each hashtag) using multi-class classification, after choosing values for time windows and post volume thresholds to define his labeling. Lastly, L. Hong et al [16] uses reshares (retweets) as a metric for popular tweets, and trains classifiers to predict 5 popularity classes, defined by self-chosen threshold values.

The projects mentioned above either used arbitrary thresholds to define the data’s labels, or used the Pareto Principle [40]. In the scope of this project, the latter is chosen.

3.2 The power of Social Media

The potential of social media prediction has proved to be large from papers like “Predictive Analytics with Social Media Data" [21], where several different predictive models are tested, only to show the strong predictive power of a model that takes into account social media activity to predict (impressively accurately) the sales of iPhones and H&M clothing products.

Furthermore, Asur et al [1] try to forecast box-office revenues for movies, based on sentiments extracted from tweets, among other features, using a simple regression model. The performance achieved outperformed in accuracy that of the Hollywood Stock Exchange (which could be considered as the “gold standard" at the time of that research).

The use of social media can be extended to a large panoply of topics, ranging from the future rating of products to agenda setting and even election outcomes [13]. At a deeper level, this work shows how social media expresses a collective wisdom which, when properly tapped, can yield an extremely powerful and accurate indicator of future outcomes [1].

3.3 Different approaches

There have been numerous different approaches to predicting pop-ularity throughout the years, the most fundamental of which are Time-Series Analysis and Feature based Classification/Regression. As for the former, it is based on early popularity fluctuation, which assumes that patterns early in the data tell the story about its fu-ture popularity [17]). Factors such as growth, bursts and decay are the main focus and their classification can be considered as an extended application of trajectory clustering. Existing virality pre-diction algorithms try to forecast time series based on past values [22]).

Furthermore, a respectable colleague from OBI4wan, Matt Chap-man, has worked his own Master’s thesis tackling this subject using time series data analysis as an approach, and, in particular using different change point detection techniques to make predictions

(4)

both on real-world and simulated data, only to discover that they do a poor job providing timely and accurate predictions and, as a result, more advanced techniques need to be implemented [8].

When it comes to feature based Classification, there have been many supervised Machine Learning efforts to achieve the desired goal. Some have focused more on community structure (network-bound) or speed of growth, while others have studied innate ap-peal/emotion [3] or social influence [31]. Nevertheless, the most popular features include user actions such as resharing [6] and hashtag count/retweetability [23]. Interestingly, research has also shown that the structure of the network can play a significant role on content virality [39].

Furthermore, the absence of labels greatly influences the method-ology, as it dictates that one should either use unsupervised learning methods, or one should try manual labeling. Luckily enough, qual-ity predictions based on features derived through manual labeling of data showed highly accurate results [34].

3.4 Classification Features explored

In terms of our classification tasks, related literature provides some valuable insights concerning what features are more likely to pos-sess significant prediction power for Machine Learning techniques. F. Gelli [14] indicated that sentiment is a valuable feature for pop-ularity. On the other hand, J. Berger’s research [4] showed that positive sentiment often leads to popular content, but some nega-tive sentiments (such as awe, anxiety, anger) may also contribute to this direction. As a result, an initial intuition dictates that senti-ment polarity (fraction of non neutral-sentisenti-ment tweets) can be a valuable feature for the task in hand.

Additionally, L. Hong [16] used content (e.g. TF-IDF) and tempo-ral features, among others. The latter is also backed up by H.Lakkaraju [20], while A.Khosla [19] found that the significance of these feature categories greatly differs depending on the dataset. Nevertheless, in the scope of this research, those features are considered rele-vant and will be explored. Other content features such as average views/comments/likes/favorites’ rates have been examined as well [12].

On the other hand, social features like Impact are known to play an important role in the classification feature arsenal, given that, according to common intuition, the more impact an entity or a person has, the more likely it is that the content referred is going to be popular. This intuition is also verified by Lakkaraju’s research [20]. Accordingly, the number of followers/contacts has been investigated by J. McParlane [26]

4 BASIC DEFINITIONS

• Akeyword is a word (string) from the keywords field (see Methodology below) of a tweet. These strings are used as a topic indicator, and along with temporal information (post-ingDate) they are used to group tweets together in the ag-gregation stage of the methodology.

• Atime window is a 60-min time interval, representing the time structure according to which tweets are partitioned as in [25]. The duration of the time windows, although consis-tent/fixed, is configurable, but we find that 60 minutes suits our task needs of short-term predictions well. Time windows

are consecutive (15:00-16:00, 16:00-17:00, etc) and discrete (do not overlap).

• AnIssue I(T,n) is defined as the set of all tweets about the same topic T (keyword) during a specific time window n. A tweet may belong to more than one issue (in the same time window n), as, depending on its “keywords" field, it may match on multiple topics. Issues can overlap in terms of time (as multiple issues can belong in the same time window). “Given a topic T, and the n-th consecutive time window n, I(T,

n) is defined as the issue in time window n on topic T" • In the scope of this research, anissue is considered popular

if it contains more thanN tweets, where N is a threshold specified by thePareto principle [40] (as indicated in [26]). The Pareto Principle came from Italian economist Vilfredo Pareto in 1896, when he observed that 80% of the wealth in Italy was owned by 20% of the individuals. The principle penetrated business and other fields through the years, as a means of directing focus towards what is more important. The generalized Pareto Principle is the observation that most things are not evenly distributed, as in 20% of the inputs creates 80% of the results.

The principle does not require an 80/20 distribution though, the values always depend on the underlying task and gen-erally describe the distribution skewness, in a way that a distribution is more skewed if a smaller part of the causes result in a larger part of the results (also those values may not add up to 100, such as 90/20, 70/25, 80/10 etc) [40].

5 METHODOLOGY

5.1 Preparing the Data

5.1.1 Choosing the features. The initial dataset was harvested from an ElasticSearch Cluster that belongs to the company and con-tains one month worth of Twitter data (April 2018) and it concon-tains approximately 5.5 million rows (tweets). As it can be easily guessed, the company holds detailed information on posts, including both raw (e.g. text content) and derived (e.g. Keywords) features. In Table 1 we can see all features divided into these categories:

Notably, according to OBI4wan’s published whitepaper [? ], Im-pact is defined as a weight factor that corresponds to the source’s category (categories of sources derive from a custom OBI4wan al-gorithm), and is used as a metric to define the potential impact an information source (in this case, a tweet) can have to the public.

Raw features come straight from the Source itself (in our case Twitter), while derived features are a result of either manual assign-ment, or some internal OBI4wan algorithm. As we can see, there are features that are not of any particular use for our research.

First of all, the "Body" field of each tweet is purposely empty for all tweets so it can be ignored. Likewise, “PostingType" is always “post" so it does not provide any value in our task. SourceURL and SourceSite both refer to Twitter for all rows in our data, so they can be excluded as well. Projects is a field that reflects OBI4wan’s mechanism way to match content to specific existent queries via the parser/matcher service, and again, in the scope of our research, is plainly useless.

(5)

Feature Explanation Raw/Derived Body Empty field (used in other platforms) Raw Head Contains all tweet content Raw Reach Poster’s follower count Raw Impact Normalized Impact of the Source Derived Likes Tweet’s like count Raw Retweets Tweet’s retweet count Raw Username Poster’s username Raw Keywords List of all strings of words in tweet Derived PostingDate Date the tweet was tweeted Derived Language The language of the tweet Derived Country Coutry from which the tweet was posted Raw SourceURL URL of the site (in our case, Twitter’s URL) Raw SourceSite URL of the site (in our case, Twitter) Raw Posting Type Categorical: Print/post Derived Projects List of projects associated with a tweet Derived

Table 1: Data Features

On a different note, a visualization on the language field shows us the following results:

Figure 1: Country feature Plot

It is obvious that the country with code number 22 (which corre-sponds to ’International’) is the dominant class in this categorical feature. As a matter of fact, it includes over 91% of the dataset with countries with code numbers 13 and 3 sharing the 9% left, so this feature can safely be excluded from our feature set.

The language field is exclusively consisted of instances with value “2", which corresponds to the Dutch language in the OBI4wan system, as is already expected from the initial description of the dataset. It can therefore be neglected.

To sum up, our initial feature selection includes some content features (tweet length, sentiment polarity) and some social features (impact, reach), as suggested by the relevant literature [2], but is going to be further enriched with some more features during our feature engineering phase right after the appropriate tweet aggregations.

5.1.2 Dealing with missing values. The next step includes checking the data for missing values column-wise. For example, should a certain feature have more than a certain percentage thresh-old of missing values, those can be imputed in some way, or, in

extreme cases, the feature should be omitted. Luckily, the only col-umn with significant NaN values (100% NaN) is Body (which makes perfect sense in Twitter data), which has already been excluded.

For a row-wise check of missing values, a really simple yet ef-fective way is used to identify missing values in all columns of the data, using a compact visualization from the Seaborn python library [33] (see Figure 2).

Figure 2: Visualize Missing Values

As can be seen from Figure 2 The most missing values are found in the columns of “likesfavorites" and “sharesretweets", yet the amount is by all means insignificant. Given the fact that our dataset is large enough, we can afford to drop these rows, as the vast majority of information is still retained. As a result, after dropping those rows, a mere 0.65% of our initial data is lost.

5.1.3 Preprocessing. At first our dataset included data on a discrete post level (Tweets). The main idea was to discard unusable features, deal with NaN values and perform some preprocessing steps which will enable the grouping of the Tweets into issues. After the first steps, the data is left with the following features:

Data Features PostingDate LikesFavorites PosterUsername SharesRetweets Keywords Impact Reach Head Table 2: Features

Notably, the “Head" field contains the text of each Tweet (due to the company’s design choices), hence the reason we keep it. It is

(6)

also worth reminding that “Keywords" is the field that our topics rely on.Due to memory issues in the preprocessing steps that are to follow, the dataset is randomly sampled by 15%, leaving 820 thousand remaining rows of tweets. The following preprocessing steps are applied:

• Calculating TweetLenth: each tweet is lowercased, cleaned of stopwords and some other minor artifacts that would otherwise be counted as words (such as the RT annotation that stands for retweet). The resulting text is counted in terms of words.

• A conversion of the “Keywords" list from a list containing Unicode objects to a list of python Strings that can be later extracted for grouping purposes. The strings of the list are lowercased, lemmatized, filtered against NLTK’s list of stop-words [27], and disposed of single character strings, strings containing non-European symbols , in order to keep the keywords (and thus the future issues’ topics) as noise-free as possible. Additionally, words with more than 3 identical consecutive characters were removed, as there was still con-siderable noise by strings such as “whaaaaaaat", “yeyyyyy" and “aaaaaahhhh". The process’s results can be observed in figure 4 :

Figure 3: Tweet before list parsing

Figure 4: Tweet after list parsing

• A calculation of a sentiment polarity score for each tweet, using the “pattern" python library. Focus is put on whether tweets are infused with sentiment or are neutral, rather than what sentiment this is. Each tweet gets a float value in the range of (-1,1), where -1 is the maximum negative sentiment score, +1 is the maximum positive sentiment score and 0 is neutral. Afterwards, the recommended threshold of 0.1 (see [28]) is used to create two classes of sentiment-infused (1) or not (0) tweets.

• An “explosion" of the keywords list that is necessary for the grouping (to achieve that a tweet can be part of multiple issues of the same time window, but on different topics). This action basically refers to the operation shown below in figures 5 and 6

Figure 5: Keywords list before “explosion"

Figure 6: Keywords list after “explosion"

Figure 7: Tweet Features’ Distributions

5.1.4 Plotting Tweet features. The distributions of Likes, Retweets, Impact, Reach, tweetlength and sentimentpolarity are plotted (fig-ure 7)

For all features except the polarity, the distributions are right skewed, as there are few tweets that contain very large values for some of those features, while the vast majority of them contain small values. The sentiment polarity feature seems to be quite balanced, in a way that the classes of tweets with and without any sentiment are comparable in size.

5.1.5 Check for Correlations. At this point, the data is checked for collinearity among the features, even though the features are not to be used in their current form. Calculating the Pearson Corre-lation and visualizing it can be performed using Seaborn python library as shown in figure 8.

It is obvious that the strongest correlation corresponds to Sen-timent Polarity and Tweetlength (0.22 Pearson Correlation score), followed by Reach and Impact (which has already been investi-gated by [2]). Correlations that range below 0.3 are not considered significant, so this is not an alarming discovery.

5.1.6 Deriving Aggregate Features. Tweets are aggregated into 60-min time windows based on matching topics (matched

(7)

Figure 8: Correlations between tweet features

by keyword) and PostingDate, to form our issues data, and the aggregated features of issues were calculated as follows:

Data Features

LikesFavorites Average

PosterUsername Sum of unique values (users)

Impact Average

Sentiment Polarity Average

TweetLength Average

TweetCount Count of all Tweets in each issue

TweetCount2 Tweetcount merged with SharesRetweets

Table 3: Issue Features

Next, “TweetCount2" is added as the number of tweets in a discus-sion merged with the number of retweets (TweetCount+SharesRetweets). The last feature additions include calculating rates at which the amount of Likes and Retweets grow during each issue’s time win-dow.

To facilitate the manual labeling process, for each issue I(T,n), nextweetcount is calculated as the tweet volume of I(T,n+1), which is the tweet volume of the same topic in the next time window.

5.1.7 Labeling. Lastly, as mentioned earlier in this document, each I(T,n) of the 1.5M issues produced were labeled based on the tweet volume of I(T,n+1) as popular (1) or not popular (0), using the Pareto Principle [40]

To elaborate, at this point it can be easily observed that 80% of the total volume of the tweets exist in the top 10% of the issues (in terms of tweet volume). In the scope of this research, the Pareto Principle will be used to select a threshold to split between popular and not popular issues, as it has been performed by existing work ([26], [7]). Consequently, the threshold of popularity of the issues in terms of tweetcount is defined as:

“N is the minimum amount of tweets that exist in the top 10% of

the issues in the data"

Notably, the information about each topic’s progress (tweet vol-ume of I(T,n+1) for every I(T,n)) already exists in the data for all issues except those that belong in the very last time window of the dataset. The latter were discarded, which resulted in an insignif-icant loss of less than 0.1% of the data.For the former, we extract this information and compare it to the threshold defined above, to determine its class label. Applying the rule above indicates that the 10% mentioned above corresponds to all issues with at least 10 tweets, thus our resulting threshold is defined as N=10.

The final labeled dataset was checked for Pearson correlations among the predictors in figure 9:

Figure 9: Correlations between Issue features From figure 9, the high values of Pearson Correlations indicate that there are some serious collinearity issues here, but that issue will be addressed in the following sections. It is noteworthy that tweetcount and nexttweetcount are strongly correlated (94%). That means that there is a strong linear relationship between the amount of tweets of issues I(T,n) and I(T,n+1). That is of vital importance in terms of this project, for reasons that will be explained later on. 5.1.8 Dealing with Class Imbalance. What is next is split-ting the data into training and test sets, in a 70-30 manner. There is a huge class imbalance, which means that in a regular random split, the minority class (popular) might be greatly underrepresented. This can lead to two problems. First, if the minority class is under-represented in the training set, the classifiers won’t learn enough to correctly predict it and the test performance will be bad, as most instances of the untrained unpopular class will be included in the test set. Secondly, if the minority class is underrepresented in the test set, the generalization error indicated by test performance will be unrealistically good, as very few instances of the unpopular class exist in it. As a counter measure, a stratified split, as proposed by Scikit Learn [32] is taken, which ensures that the class ratios in the train and test set are approximately the same as in the data before

(8)

the split (10%), yet the instances of both classes are randomly split into the two splits.

Afterwards, 10-folds stratified cross validation is chosen over a single validation set, for two main reasons. First, we do not actually ’lose’ any data (as in a train/validation split) which means that we do not risk losing important patters/trends in the data, which in turn would increase the error induced by bias. Secondly, The error estimation is averaged over all 10 trials to get total effectiveness of our model, which reduces both bias and variance. As for the stratified version of cross validation, the same reasons apply as with the stratified train-test split described above ([9], [15]).

5.2 Evaluating the baseline model

We compare to the following simple Markov-like baseline model, which makes naive predictions based on its retrospective nature. Given C(I(T, n)) the actual popularity class of topic T in time window n and C’(I(T, n+1)) the predicted popularity class of topic T in time window n+1, the predicted Classification is:

“C’(I(T, n+1)) = C(I(T, n))"

In other words, at the end time window n, it is automatically assumed that the popularity pattern for each topic T is going to be the same for the next time window as it actually was in the one that just ended. Specifically, if the actual number of tweets of issue I(T, n) exceeds the predefined threshold (thus making it a popular issue), the prediction for I(T, n+1) is positive (popular). Otherwise, I(T, n+1) is predicted as not popular. If topic T does not exist in time window n (tweet volume of I(T,n) is 0), then C’(I(T,n+1)) yeilds 0 as well.

Due to the strong correlation of tweetcount and nexttweetcount, as mentioned in section 5.1.6, tweetcount has great predictive power over nexttweetcount (in other words current popularity of an topic is a great indicator for its future popularity in this dataset), which is the main reason that the baseline model scores so well. Subse-quently, the baseline model provides an F1 score of 72.8%.

Further analysis of the data was performed to investigate why such a naive model can achieve such high levels of predicting perfor-mance. As it turned out, in the dataset used, 95% of the total topics had the same popularity class in consecutive time windows. More importantly, approximately 73% of the popular topics remained popular in the next time window. This is a property of the data itself and enables the baseline model of OBI4wan to perform well.

5.3 Machine Learning

The aim is to use the features of I(T, n) for each topic T and time window n in order to predict the popularity of I(T, n+1). The task falls in the category of short term prediction and consists of several steps.

For the classification purposes of this project, a variety of Ma-chine Learning models were used, which included Logistic Regres-sion, Decision Tree, Random Forest, two Support Vector Machines (with both Linear and RBF kernels) and an Articifial Neural Net-work (ANN). A Grid Search was set up for each model in order to determine the best hyperparameters for this case, to ensure op-timal tuning. The hyperparameters tested for each method will

be analyzed further in each model’s section. The final choice of hyperparameters in each model derived from the F1-score of the model’s performance on the test data.

As mentioned before, 10-folds Cross Validation was implemented for all models, serving a dual purpose of both evaluating the perfor-mance of each hyperparameter combination, as well as to ensure that each model generalizes well during training time, thus it is not overfitting.

Last but not least, all models were tested with three variations of data normalization: unnormalized data, standarized scaling (Zero Means Unit Varience), and minmaxscaling (0-1 scaling).

5.3.1 Logistic Regression. Logistic Regression is a common yet simple approach for binary classification problems, whose out-put is a probability that the given inout-put point belongs to a certain class. It only includes one major hyperparameter, which results in it training and fine-tuning very fast. The exploration of its hyperpa-rameter provided only slight differences in the results, which comes natural, as Linear Regression cannot change much the classification model if the data is not linearly separated. The hyperparameter in discussion is:

• C: Inverse of regularization strength

The optimal value of C for the Logistic Regression Classifier proved to beC=100 for the Standardized and MinMax transformed dataset (approximately equal performance).

5.3.2 Decision Tree. Decision Trees is another popular option, which provides an effective classification that is transparent, in a way that we can see how each time a decision/classification is made, thus providing interpretability of the model, should that be required by the task. Explored hyperparameters include:

• min_samples_split: The minimum number of samples re-quired to split an internal node

• max_depth: The maximum depth of the tree

• min_samples_leaf : The minimum number of samples re-quired to be at a leaf node

• criterion: The function to measure the quality of a split Performance did not vary with different normalization tech-niques on this classifier, as it expected due to the classifier’s nature. The best performing hyperparameters weremin_samples_split=2, min_samples_leaf =100, max_depth=20 and criterion=’entropy’, when the data is MinMax normalized.

5.3.3 Random Forest. Random Forest, literally being an en-semble of Decision Trees, is most commonly used to combat the overfitting issue that Decision Trees produce, depending on the task in hand. In our case, we did not have any convincing signs of overfit, but the potential of a well-tuned Random Forest classifier is considered to be solid, especially given the time and effort it needs to train and tune, in comparison to more complex Machine Learning approaches, such as Neural Networks.

A list of explored hyperparameters includes:

• n_estimators: The number of trees of the Random Forest Classifier

• max_features: The number of features to consider when looking for the best split

• max_depth: The maximum depth of each tree

(9)

•min_samples_leaf : The minimum number of samples re-quired to be at a leaf node

•criterion: The function to measure the quality of a split

The optimal values for the Random Forest Classifier aren_estimators=50, max_features=None, max_depth= 5, min_samples_leaf =50 and criterion= ’gini’, when the data was MinMax normalized.

5.3.4 SVM (Linear kernel). Support Vector Machines are based on the idea of finding a hyperplane that best divides a dataset into classes and they are vastly used in classification tasks. On the other hand they are not suited to large datasets since their training time can be too high, considering our 1,2M rows of data (of the training set). Normally, SVMs are trained with 10-20K rows of data, at most, due to their quadratic complexity [29].

Luckily, there is an optimized Scikit Learn implementation of a linear SVM classifier, based on liblinear, which scales better with large datasets, enabling us to use a bigger portion of our data to train and tune it. A new subset of 100K rows was created in a 50-50 manner (to address the imbalance issue, as an alternative to apply-ing class weights, as supported by [38] and [5]), by downsamplapply-ing both classes. Other ratios and sizes were also evaluated and proved to perform worse. In terms of hyperparameters, SVMs with Linear kernels only have one:

•C: Penalty parameter C of the error term

The best performing value wasC=128, for the MinMax normal-ized data.

5.3.5 SVM (RBF kernel). The downsampling had to be more aggressive during the training of the RBF kernel SVM, due to its quadratic complexity with regards to the data, as it is mentioned earlier. For this reason, a new subset of 40K rows was created, again in a 50-50 manner by downsampling both classes. Using the RBF kernel greatly reduces the training set, though it allows to explore slightly more hyperparameters than the Linear SVM. Particularly:

•C: Penalty parameter C of the error term •gamma: Kernel coefficient for ‘rbf’

The best performing values ware C=32, gamma=1 for the MinMax normalized data.

5.3.6 ANN. Artificial Neural Networks are considered to be superior in terms of potential when compared to the other clas-sification methods we tested, due to the fact that they introduce nonlinearities through their activations which can be used to fit more complex patterns in the data. This does not guarantee that they are better performers in relatively "simple" tasks though and it is also important to note that ANNs suffer from zero interpretability, which is not a problem for this task.

Several experiments were conducted in terms of architectures, in order to decide which one is more appropriate for the current task, and the best performing one can be shown in figure 10 in the Appendix.

The ANN’s hidden layers consist exclusively of Dense Layers with ReLU activations. ANNs are known to be highly flexible in terms of hyperparameter tuning, as they include multiple tweakable parameters to experiment with. In our experiments, we explored the following:

•batch_size: Number of samples per gradient update

• epochs: Number of passes of the whole train set through the ANN

• init: Refers to the Dense Layers’ initialization method • optimizer: Algorithm that handles learning rate of SGD

during train time

• class_weight: Used for weighting the loss function to favor the minority class, in cases of great class imbalance As for the initialization parameter, the potential of glarot_uniform was stressed in [41], [18], while differences between ’uniform’ and ’gaussian’ were proved to be minimal indeed, as suggested in [35]. That is the driving force of the initialization hyperparameter explo-ration in this project. As for the class weights, although it makes sense in such imbalanced datasets, it proved that it did not improve performance, on the contrary, sometimes it significantly reduced it. Notably, the ANN was the only classifier that was not tested with unnormalized data, as it is believed that it either fails to converge or it converges too slow in this case [30], [36], which means that performance with normalized data is most likely to be better.

The best hyperparameters for the ANN turned out to bebatch_size=96, epochs=24, init=normal and optimizer=rmsprop, scoring equally in both normalization approaches.

5.4 Results & Conclusion

Most classifiers performed equally well on the majority class but performance on the minority class varies, according to the experi-ments conducted, mainly depending on the combination of classifier and normalization method. For example, SVMs seem to perform significantly better when the data was scaled in a MinMax manner in comparison to data normalized in other ways, while hyperpa-rameter tuning proved to yield tangible improvements only on the ANN classifier.

Fortunately, performance on the test dataset was on par with (if not slightly higher than) the train/validation performance in all models, which indicates that the models generalized well, thus there was no overfitting, which eliminated the need for regularization techniques.

An exhaustive table of results lies in the Appendix, both for the popular (table 4) and the unpopular class (table 5), in a F1 (Precision/Recall) format.

As can be seen from table 4 of the Appendix, the best perform-ing classifiers tie for the best performance of 0.75 in terms of F1 score. Among those, the most balanced classifier seems to be the Linear SVM when trained with MinMax normalized data. The best Precision is achieved by Logistic Regression when trained with un-normalized data, which reaches a score of 0.92, though scores poorly in terms of Recall, only 0.42. As far as the best Recall achieved, a clear winner is the SVM with the RBF kernel when trained on the standarized dataset. On the other hand, it performs very poorly in terms of Precision, scoring just 0.49.

In section 5.1.6, we showed several strong correlations between are feature set, which in other tasks would consist a problem. For example in tasks where interpretability of features is important, strongly correlated features make it difficult to determine which fea-tures are more significant, and as a result, some of them would have to be discarded. In this case though, the ultimate goal is maximizing classification performance. Through trial and error it became clear

(10)

that dropping any feature resulted in reduced performance on a scale from 3-17%, which makes it clear that it is for the best interest of this project to preserve them all.

Although several models, normalization techniques and model-specific hyperparameters were examined in order to aim for signifi-cant performance gains, the results of the research were on par with the baseline model’s performance. This comes down to a combina-tion of two reasons. First, as mencombina-tioned before, the high correlacombina-tion between a topic’s popularity in consecutive time windows is an inherent property of the data that enables a naive approach that considers only recent popularity to score very well. As for the sec-ond, popularity of an issue is solely defined by the number of tweets. Should popularity be a more complex concept in terms of definition, then the gap between the baseline model’s and the tested Machine Learning models’ performance could be wider.

6 DISCUSSION

Nevertheless, given the specific task, there are always more experi-ments to try, which were not covered given the limited time scope of the project, but would be particularly interesting in terms of future work.

First of all, the addition of more features could be the most promising approach towards a significant performance boost. For example, as far as content features are concerned, word embed-dings from the text of the tweets (possibly weighted by tf-idf) is a promising approach which has shown some serious potential in past papers [16], [20], [19]. Unfortunately, in the timespan of this project, due to issues with data availability, that was not possible. Secondly, it should also be noted that, due to the fact that labeling is solely defined by tweet volume, the true potential of complex/non-linear classifiers cannot show, which, in the experiments conducted, is confirmed by the similar performance of the ANN, compared to the other linear classifiers. Should popularity be codefined by other parameters, the performance of the nonlinear classifiers could indeed be superior.

In contrast, since not all topics are discussed online at the same amount, a fair approach would be to find a way to compare to previous topic volumes, based on historical data. For instance, mean values of topic volumes from a time period of several days can be stored in the data and the deviation from the mean can be used as a predicting feature. Additionally, individual thresholds can be set for each topic in terms of tweet volume (thus popularity is defined differently for each topic), and one can make predictions on topics discussed more often than the average. Should this information be available in the data, the prediction of each classifier will be more valuable to real life applications.

Throughout the training of some specific models, such as the SVM (with both kernels), which do not scale well enough for large volumes of data, random stratified undersampling (of the majority class) was used as a way to reduce data size while trying to pre-serve the existing classes’ qualities as much as possible. Another technique that serves the same purpose is oversampling (of the minority class), and particularly synthetic oversampling (such as SMOTE) [10] [11]. It also makes sense to try it for the training of the models that make use of the whole training set (Decision Tree,

Logistic Regression, Random Forest, ANN), as means to fight class imbalance, and determine whether it improves performance or not. Despite the task-specific nature of model selection, it is gener-ally believed that ANNs provide the state of the art approach in terms of Machine Learning, making the time spent on them worthy. Although multiple ANN architectures were explored and evaluated, dedicating additional time for experiments could still provide better results.

Other potential approaches include different data transforma-tions (e.g. log, BoxCox) and possibly some additional classifiers, such as Gradient Boosted Trees or Classifier Ensembles. On the other hand, due to high correlations between issue features demon-strated in this paper, algorithms such as Naive Bayes (which assume independence among features) should not be used. Another idea would be to try dimensionality reduction such as Principal Compo-nent Analysis (especially since the features were so correlated), or t-SNE.

Last but not least, the features used in the scope of this project can be used for other social media platforms other than Twitter. For future work, the described methods could be conducted with Facebook data (and data from other social media platforms) data and compare the results.

REFERENCES

[1] Sitaram Asur and Bernardo A Huberman. 2010. Predicting the future with social media. InProceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology-Volume 01. IEEE Computer Society, 492–499.

[2] Sitaram Asur, Bernardo A Huberman, Gabor Szabo, and Chunyan Wang. 2011. Trends in social media: persistence and decay.. InICWSM.

[3] Jonah Berger and Katherine L Milkman. 2012. What makes online content viral? Journal of marketing research 49, 2 (2012), 192–205.

[4] Jonah Berger and Katherine L Milkman. 2012. What makes online content viral? Journal of marketing research 49, 2 (2012), 192–205.

[5] Chris Carter and Jason Catlett. 1987. Assessing credit card applications using machine learning.IEEE expert 2, 3 (1987), 71–79.

[6] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and P Krishna Gummadi. 2010. Measuring user influence in twitter: The million follower fallacy.Icwsm 10, 10-17 (2010), 30.

[7] Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: analyzing the world’s largest user gen-erated content video system. InProceedings of the 7th ACM SIGCOMM conference on Internet measurement. ACM, 1–14.

[8] Matt Chapman. 2017. A Meta-Analysis of Metrics for Change Point Detection Algorithms. (2017). http://www.buzzcapture.com

[9] Coursera. [n. d.].Machine Learning for Data Analysis - Validation vs Cross Valida-tion. https://www.coursera.org/learn/machine-learning-data-analysis/lecture/ lEcpf/validation-and-cross-validation

[10] Liliya Demidova and Irina Klyueva. 2017. Improving the Classification Quality of the SVM Classifier for the Imbalanced Datasets on the Base of Ideas the SMOTE Algorithm. InITM Web of Conferences, Vol. 10. EDP Sciences, 02002. [11] T Elhassan, M Aljurf, F Al-Mohanna, and M Shoukri. 2016. Classification of

Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method.Journal of Informatics and Data Mining 1, 2 (2016).

[12] Flavio Figueiredo. 2013. On the prediction of popularity of trends and hits for user generated videos. InProceedings of the sixth ACM international conference on Web search and data mining. ACM, 741–746.

[13] Fabio Franch. 2013. (Wisdom of the Crowds) 2: 2010 UK election prediction with social media.Journal of Information Technology & Politics 10, 1 (2013), 57–71. [14] Francesco Gelli, Tiberio Uricchio, Marco Bertini, Alberto Del Bimbo, and

Shih-Fu Chang. 2015. Image popularity prediction in social media using sentiment and context features. InProceedings of the 23rd ACM international conference on Multimedia. ACM, 907–910.

[15] Prashant Gupta. [n. d.]. Cross-Validation in Machine Learning. https:// towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f [16] Liangjie Hong, Ovidiu Dan, and Brian D Davison. 2011. Predicting popular

messages in twitter. InProceedings of the 20th international conference companion on World wide web. ACM, 57–58.

(11)

[17] Salman Jamali and Huzefa Rangwala. 2009. Digging digg: Comment mining, popularity prediction, and social network analysis. InWeb information systems and mining, 2009. WISM 2009. International conference on. IEEE, 32–38. [18] Andy L Jones. [n. d.].An Explanation of Xavier Initialization. http://andyljones.

tumblr.com/post/110998971763/an-explanation-of-xavier-initialization [19] Aditya Khosla, Atish Das Sarma, and Raffay Hamid. 2014. What makes an image

popular?. InProceedings of the 23rd international conference on World wide web. ACM, 867–876.

[20] Himabindu Lakkaraju and Jitendra Ajmera. 2011. Attention prediction on social media brand pages. InProceedings of the 20th ACM international conference on Information and knowledge management. ACM, 2157–2160.

[21] Niels Buus Lassen, Lisbeth la Cour, and Ravi Vatrapu. 2017. Predictive analytics with social media data.The SAGE Handbook of Social Media Research Methods (2017), 328.

[22] Scott Lenser and Manuela Veloso. 2005. Non-parametric time series classifica-tion. InRobotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 3918–3923.

[23] Kristina Lerman and Tad Hogg. 2010. Using a model of social dynamics to predict popularity of news. InProceedings of the 19th international conference on World wide web. ACM, 621–630.

[24] Cheng-Te Li, Man-Kwan Shan, Shih-Hong Jheng, and Kuan-Ching Chou. 2016. Exploiting concept drift to predict popularity of social multimedia in microblogs. Information Sciences 339 (2016), 310–331.

[25] Zongyang Ma, Aixin Sun, and Gao Cong. 2012. Will This #Hashtag Be Popular Tomorrow? (2012), 1173–1174. https://doi.org/10.1145/2348283.2348525 [26] Philip J. McParlane, Yashar Moshfeghi, and Joemon M. Jose. 2014. "Nobody Comes

Here Anymore, It’s Too Crowded"; Predicting Image Popularity on Flickr. In Proceedings of International Conference on Multimedia Retrieval (ICMR ’14). ACM, New York, NY, USA, Article 385, 7 pages. https://doi.org/10.1145/2578726.2578776 [27] Nltk. [n. d.].Natural Language Toolkit. https://www.nltk.org/

[28] pattern.en. [n. d.]. Sentiment. https://www.clips.uantwerpen.be/pages/ pattern-en#sentiment

[29] Quora. [n. d.].What is the computational complexity of an SVM? https://www. quora.com/What-is-the-computational-complexity-of-an-SVM

[30] ResearchGate. [n. d.]. Do Neural Networks work with Non-normalized Data? https://www.researchgate.net/post/Do_Neural_Networks_work_with_ Non-normalized_Data

[31] Matthew J Salganik, Peter Sheridan Dodds, and Duncan J Watts. 2006. Experi-mental study of inequality and unpredictability in an artificial cultural market. science 311, 5762 (2006), 854–856.

[32] Scikit-Learn. [n. d.].Cross-validation: evaluating estimator performance. hhttp: //scikit-learn.org/stable/modules/cross_validation.html

[33] Seaborn. [n. d.].Seaborn Python Plotting Library. https://seaborn.pydata.org/ [34] Hashim Sharif, Saad Ismail, Shehroze Farooqi, Mohammad Taha Khan,

Muham-mad Ali Gulzar, Hasnain Lakhani, Fareed Zaffar, and Ahmed Abbasi. 2015. A Classification Based Framework to Predict Viral Threads.. InPACIS. 134. [35] StackExchange. [n. d.]. When to use (He or Glorot) normal

initial-ization over uniform init? And what are its effects with Batch Nor-malization? https://datascience.stackexchange.com/questions/13061/ when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are [36] StackOverflow. [n. d.]. Why do we have to normalize the input for

an artificial neural network? https://stackoverflow.com/questions/4674623/ why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network [37] Gabor Szabo and Bernardo A. Huberman. 2010. Predicting the Popularity of

Online Content.Commun. ACM 53, 8 (Aug. 2010), 80–88. https://doi.org/10.1145/ 1787234.1787254

[38] Analytics Vidhya. [n. d.]. How to handle Imbalanced Classification Prob-lems in machine learning? https://www.analyticsvidhya.com/blog/2017/03/ imbalanced-classification-problem/

[39] Lilian Weng, Filippo Menczer, and Yong-Yeol Ahn. 2013. Virality prediction and community structure in social networks.Scientific reports 3 (2013), 2522. [40] Wikipedia. [n. d.].Wikipedia: Pareto Principle. https://en.wikipedia.org/wiki/

Pareto_principle

[41] Mehrdad Yazdani. [n. d.]. What is an intuitive explanation of the Xavier Initialization for Deep Neural Networks? https://www.quora.com/ What-is-an-intuitive-explanation-of-the-Xavier-Initialization-for-Deep-Neural-Networks

(12)

7 APPENDIX

width=1

Logistic Regression Decision Tree Random Forest SVM (Linear kernel) SVM (RBF kernel) ANN Unnormalized Data 0.71 (0.87/0.59) 0.71 (0.87/0.59) 0.74 (0.81/0.68) 0.74 (0.82/0.67) 0.39 (0.25/0.94)

-Standarized Data 0.70 (0.88/0.58) 0.70 (0.88/0.58) 0.74 (0.81/0.67) 0.74 (0.83/0.67) 0.61 (0.38/0.83) 0.75 (0.84/0.67) MinMax Normalization 0.73 (0.84/0.64) 0.74 (0.87/0.71) 0.75 (0.78/0.72) 0.75 (0.75/0.74) 0.61 (0.47/0.89) 0.74 (0.83/0.66)

Table 4: Table of Results for the Popular Class

width=1

Logistic Regression Decision Tree Random Forest SVM (Linear kernel) SVM (RBF kernel) ANN Unnormalized Data 0.97 (0.96/0.99) 0.97 (0.98/0.97) 0.98 (0.98/0.98) 0.97 (0.96/0.99) 0.83 (0.99/0.71)

-Standarized Data 0.97 (0.96/0.99) 0.97 (0.98/0.97) 0.98 (0.97/0.99)) 0.97 (0.96/0.99) 0.94 (0.98/0.91) 0.98 (0.97/0.99) MinMax Normalization 0.97 (0.96/0.99) 0.97 (0.98/0.97) 0.97 (0.98/0.97) 0.97 (0.96/0.99) 0.94 (0.99/0.89) 0.98 (0.96/0.99)

Table 5: Table of Results for the Unpopular Class

Figure 10: Neural Network Architecture