Forecasting the success of game-apps based on reviews

(1)

University Of Amsterdam

Master Thesis

Forecasting The Success Of

Game-Apps Based On Reviews

Author:

Vuk Gliˇsovi´

c

Supervisor:

Prof. Dr. Marcel

Worring

July 28, 2016

(2)

Verklaring eigen werk

Hierbij verklaar ik, Vuk Glisovic, dat ik deze scriptie zelf geschreven heb en dat ik de volledige verantwoordelijkheid op me neem voor de inhoud ervan. Ik bevestig dat de tekst en het werk dat in deze scriptie gepresenteerd wordt origineel is en dat ik geen gebruik heb gemaakt van andere bronnen dan die welke in de tekst en in de referenties worden genoemd.

De Faculteit Economie en Bedrijfskunde is alleen verantwoordelijk voor de begeleiding tot het inleveren van de scriptie, niet voor de inhoud.

(3)

1 Introduction

Nowadays, a huge amount of apps are available for download in the app store on your mobile phone. Some of these apps become very popular and are used a lot, whereas other apps hardly get any attention and don’t become popular at all. Our objective is to predict whether an app will become popular soon after it’s released in the app stores or not. Manual assessment of the popularity of a single app may be extremely difficult if not impossible. A certain level of expertise and statistics would be needed for correct manual prediction of popularity. Moreover, the number of apps at the moment is huge (and this number is still growing) which makes manual assessment of all these apps infeasible. Therefore, the need for a model that is able to predict the popularity of an app grows.

An app on the app store comes with various kinds of data; for example the image of the app, the description of an app, reviews that are given, the number of installs, the average rating and more. The form of data that will be used in this thesis, is textual data with average ratings. More specifically, the focus will lie on the reviews that are given by customers. Simply said, we would like to predict the success of an app based on its early reviews. Certainly a set of automatic tools are needed to analyze the reviews. Various techniques are already available in processing language and predicting classes. In particular, LDA [D. M. Blei et al (2003)] and deep learning methodologies [C. N. dos Santos et al (2014)] and [T. Mikolov et al. (2013)] appear most promising. Another relatively new method, is the global vector representa-tion for words (GloVe) method [J. Pennington et al (2014)], which will also be used to extract features. Another important tool, will be sentiment anal-ysis [D. Tang et al (2014)] and [J. Pavlopoulos et al (2014)]. It is obviously important to extract the intention of words that are used in the reviews of an app. In short: contextual linguistics will be important. Classification based on contextual linguistics is an area in machine learning that has made enormous progress over the past years.

After the feature extraction is done, a (supervised) classification method will be applied. For instance a neural network, a support vector machine or a logistic regression. The goal of this thesis is automatic prediction of the pop-ularity (star ratings) of apps. In order to approach the problem accurately, the following research question is formed.

(6)

“How do LDA and deep learning based models for the analysis of reviews perform, when learning to predict the star rating of a game application?”

Based on the predicted star ratings (ranging from 1 to 5), we assess the accuracy/correctness of a model. Several other questions arise while thinking about this matter, as it is obviously possible that some models perform better under different circumstances. Concrete questions are:

• “How many early reviews should be taken into consideration for pre-diction?”

• “Should we look at a time range for the reviews that should be taken into account or should we look at a particular number of early reviews for prediction?”

The structure of this paper in short is as follows. First relevant literature is summarized in the next section. Afterwards all the methodologies are elaborated and finally experiments are performed and discussed.

2 Literature Review

Research in the field of app popularity is limited. Which is actually quite strange, since it is a market that is expanding quickly and knowledge of how popular an app will be in the future can be extremely useful. This makes it a very interesting subject to examine in great detail.

Previous research is done in the field of app popularity; in [M. Chen et al (2011)] they use a classification and regression tree (CART) approach on data from the iTunes App Store. They attempt to answer two questions: ”1) What makes an application popular? 2) Can we predict the popularity of new and existing applications?”. Even though our main research question in this paper is more technical than the two above, the main goals we both depict are the same; we would like to predict the popularity of an app. [M. Chen et al (2011)] describes a conceptual framework; no experiments with results are given in the paper.

Further research in the field of predicting game popularity doesn’t appear to be present. However quite some research is done in the (more general) field of predicting popularity.

A famous dataset that is used to test whether a model could work or not, is the IMDB dataset. This dataset contains positive and negative reviews

(7)

and the goal is to classify every review correctly as positive or negative. For example, [A. L. Maas et al (2011)] use a mix of unsupervised and supervised learning techniques to classify these reviews. Two other papers describing their approach to this problem, are [J. Hong et al (2015)] and [Q. Le et al (2014)]. Both use a similar approach in the sense that they both attempt to create paragraph vector representations with deep learning techniques. Other papers covering the prediction of popularity, are for instance [D. Demir et al (2012)]. They attempted to predict IMDB movie ratings using Google trends. Each movie is characterized by a set of features where after sup-port vector machines and neural networks are used to classify and predict movie ratings. Another approach to predicting the popularity of a movie, is described in [M. Mesty´an et al (2013)]. They show that the popularity of a movie can be predicted before its release. They measure and analyze the activity level of editors and viewers of a specific movie in Wikipedia. It is stated in the paper, that with a minimalistic predictive model they are able to predict the popularity of a movie. The prediction of popularity can of course also be done on other channels. [G. Szabo et al (2010)] collected data from Digg (http://digg.com) and YouTube (http://youtube.com) and attempted to predict the popularity of the content on these portals. Using a linear model with various time varying features they conducted a regression to predict popularity.

Various challenges on the prediction of popularity are also present. One of these challenges, is the RecSys Challenge in 2014: User Engagement as Evaluation. The goal of the challange was to predict the level of engagemen-t/popularity of tweets generated automatically from IMDB. [L. Peska et al (2014)] describe their approach to this challenge. Their main approach, was using a k-NN (k nearest neighbour) algorithm in order to classify the level of engagement of the IMDB tweets. Another challenge, is the IEEE VAST 2013 Mini Challenge. The goal was to predict the popularity of new movies in terms of viewer ratings and ticket sales. In [M. el Assady et al (2013)] they explain how they tackled this challenge. The data they were allowed to use, was data from IMDB and a predefined set of Twitter microblog messages. They applied machine learning techniques, with a focus on neural networks, to solve this problem.

Before we start digging into all the theory, the structure of this paper will briefly be discussed. First the review feature extraction methods (unsuper-vised learning methods) will be covered: LDA, word2vec and GloVe and

(8)

reviews are preprocessed in two different ways; one preprocessing approach is used for LDA and the other one for word2vec and GloVe. The theory of the word2vec and GloVe methods are quite similar, therefore the same pre-processing approach is used. After the unsupervised learning techniques are covered, feature fusion will be discussed. Then, before we start discussing the experiments, a synthetic sampling method (named ADASYN) will be elaborated on. Then, when all techniques are covered, the experiments are discussed. First the methods covered in the previous sections will be used on a baseline dataset where after these methods are applied on a new game apps dataset. Figure (1) gives an illustrative overview of how all techniques are applied. Note: GloVe is not used on the game apps dataset and ADASYN is not used on the baseline dataset. Why these methods are not used on the specific datasets will be clear later on in this thesis.

Figure 1: Illustrative overview of the different techniques that are used in this thesis.

(9)

3 LDA Feature Extraction

Latent Dirichlet Allocation (LDA) is a generative approach in predicting the success of games. It is a three level hierarchical Bayesian model. In the first subsection, the LDA model will be discussed. After which the implementa-tion of this model in forecasting the success of games will be discussed.

3.1 LDA Explained

3.1.1 Notation

First of all, before the LDA model is explained, we define the following: • A word is a discrete unit from a vocabulary indexed by {1, ..., V }.

Using unit vectors w of length V , the vth _{word in the vocabulary is}

represented such that wv = 1 (the vth entry equals 1) and wu = 0 for u 6= v.

• A document, d, is a sequence of N words with d = {w1, ..., wN}. In

this report, a review will be seen as a document. Keeping this in mind, a sequence of words, d, will be referred to as a review.

• A corpus is a collection of M reviews: D = {d1, ..., dM}.

This notation is useful, to understand the next subsection more easily.

3.1.2 The LDA Model

Before the corpus level objective function is stated, the review and word level probabilities will be elaborated. Basically reviews are represented as random mixtures over latent topics. Each topic in turn is characterised by a distribution over words. For each review d ∈ D, LDA assumes the following generative process:

1. The number of words per review N is fixed (in some cases a random approach is used where N ∼ Poisson(ξ) for each review).

2. A k-dimensional dirichlet distribution for the topic mixture θ: θ ∼ Dir(α).

(10)

4. A multinomial distribution for each word wn given their latent topic zn

and a matrix η: p(wn|zn, η).

First of all, note the subscript n in zn, which may lead to some confusion,

since there are k topics instead of N . Remember: there indeed are k topics, however, for every word a topic has to be picked, which results in N topics. A second thing to note, is that η is a k × V matrix which parameterizes the word probabilities as follows: p(wj _{= 1|z}i _{= 1) = η}

ij. In words: the jth word

in the vocabulary has a probability of ηij to appear in latent topic i. With

the defined properties above, the joint distribution of a topic mixture θ, a set of N topics (z) and a set of N words (d = {w1, ..., wN}) can be formulated

as: p(θ, z, d|α, η) = p(θ|α) | {z } ∼ Dir(α)    N Y n=1 ∼ Multinomial(θ) z }| { p(zn|θ) p(wn|zn, η) | {z } ∼ Multinomial(ηn₎    (1) where ηn _{= ηw}

n (the column in η corresponding to word wn) and where the

curly braces with explanations are included to elaborate every part of the equation (basically to summarize). Now, summing over the topics zn and

integrating over θ yields the marginal distribution of a review:

p(d|α, η) = Z p(θ|α) N Y n=1 k X i=1 p(z_ni|θ)p(wn|zni, η)dθ (2) where p(zi

n|θ) is the probability of topic i being chosen for word wn given θ.

This equals θi_{; the i}th _{component of θ since it’s a multinomial distribution}

with only one element appearing once. In essence, we could write equation (2) as follows: p(d|α, η) = Z p(θ|α) N Y n=1 k X i=1 θiηijndθ (3)

where jn represents the index of word wn in the vocabulary (i.e. wnjn = 1

and wu

n= 0 for u 6= jn). In other words: ηijn = p(wn|z

i

n, η) is the probability

of the jnth word given the ith latent topic. At last, taking the product of

the marginal probabilities of all reviews d1, ..., dM in the corpus, yields the

probability of a corpus: p(D|α, η) = M Y m=1 Z p(θm|α) Nm Y n=1 k X i=1 p(z_mni |θm)p(wmn|zmni , η)dθm (4)

(11)

where wmn represents word n from review m. Obviously this is the objective

function that has to be maximized. The three level hierarchical Bayesian model can now be summarized as follows. First of all, parameters α and η are corpus level parameters. Second, variables θ1, ..., θM are review level

parameters. Third, all the zi

mn and wmn are word level variables. Equation

(4) is illustrated in a probabilistic graphical model in figure (2).

Figure 2: Graphical Representation LDA Model [M. Blei et al (2003)].

3.2 LDA Priors

As can be seen from the previous section, two priors are picked before train-ing the model; a Dir(α) for the topic distribution and a Multinomial(θ) dis-tribution for the words (given the latent topics). Usually, both priors are uniformly distributed. This would mean that it (more or less) is assumed that every topic appears equally likely and every word within each topic also appears equally likely. Of course this need not be the case and most probably is not the case. It would be more natural to assume, that in a set of reviews, certain topics appear more than others.

According to [H. M. Wallach et al (2009)], choosing a non-uniform prior for Dir(α), often improves performance. However, they also state that choosing a non-uniform prior for the Multinomial(θ) distribution is not known to im-prove performance. Intuitively this makes sense: it’s easier for the model to learn the distribution of the words given a latent topic than to initiate a non-uniform prior for the Multinomial(θ) when it is not known which words in which latent topics appear with greater probabilities. When a non-uniform prior would be picked for the word distributions, chances are high these dis-tributions give too high probabilities to certain words in certain topics.

(12)

3.3 Preprocessing the Data

As can be deduced from section 3.1, the LDA approach is basically a bag of words approach. Therefore the order of the words is irrelevant; only the individual words and their meanings are important. In that line of thinking, stopwords will be removed, as they have little (if none) meaning. In addition, words will be stemmed with the Porter stemmer in order for LDA to be able to recognize that words like ’work’ and ’works’ are the same. Finally, emojis (like :-) or :-( and more), question marks and exclamation marks are preserved since they could have additional meaning.

3.4 Generating Features from the LDA Model

The basic idea is, that after training, a certain number of topics is created where every topic has a distribution over the words. The most important parameters that can be set beforehand, are the parameters discussed in the previous sections: the number of topics and the priors α and η. The features per review are then created by concatenating the probabilities of each topic appearing in each review into a vector.

4 Word2vec Feature Extraction

Word2vec is a deep learning method for creating vector representations of words. There are basically two word2vec approaches: continuous skip-gram modeling (SG) or continuous bag of words (CBOW). Both create word vector representations, but with different approaches. The basic idea for skip-gram modeling, is given a word wi, predict the context wi−2, wi−1, wi+1, wi+2 of

this word. Whereas for the continuous bag of words approach, this process is reversed; given the context wi−2, wi−1, wi+1, wi+2, predict word wi. The

skip-gram method usually is more effective, but this depends entirely on the problem. Therefore both methods will be implemented.

4.1 Word2vec Explained

A brief explanation of how the model produces these word vector represen-tations is given in this subsection. Note: this subsection is purely meant to obtain a good intuition in the word2vec model, it doesn’t always elaborate in full detail.

(13)

4.1.1 The Neural Network

The word2vec neural network can most easily be summarized in the following figure. Note that only a single context word as input is used (thus no skip-gram or CBOW yet).

Figure 3: Basic Word2vec Model. Figure adapted from [X. Rong (2014)].

Define V to be the vocabulary size and N to be the hidden layer size (which will be the length of the word vector representations). The network is very simple; with an input layer, just one hidden layer and an output layer. All input vectors are unit vectors; xj = 1 and xj0 = 0 for ∀j0 6= j. In other

words, for a given context/input word, only one of x1, ..., xV equals 1 and the

others 0. Then, since the activation function from the input to the hidden layer is linear, we have AT_{x = h where A is a V × N weight matrix and h}

is a N × 1 vector. Now already the final word vector representations can be seen. Figure (4) illustrates this nicely.

(14)

Figure 4: Obtaining the Word Vector Representations. Adapted from 1_.

So say the word vector representation of the jth _{word in the vocabulary is}

needed; then the jth row of weight matrix A is needed. Basically, only weight matrix A is interesting; the output layer with weight matrix B is discarded after training. However, even though the weight matrix B is discarded, it’s important for obtaining the final weight matrix A. B is an N × V matrix and is used in the softmax activation function from the hidden layer to the output layer. Using the columns of B, a score uj = bwj

T_{h can be created}

for each word in the vocabulary. bwj

T

is the jth column of B. Now we can ’activate’ this value uj with the softmax function:

p(wj|wI) = yj =

exp(uj)

PV

j0₌₁exp(uj0)

where yj is the output of the jth unit in the output layer. In words, this

probability means: given the context/input word wI, what is the probability

of word j in the vocabulary?

Updating of the weight matrices, is done through backpropagation. The update equations are explained thoroughly in [X. Rong (2014)]. One can

(15)

imagine that the computation of this model can be extremely time consum-ing. Solutions exist to speed up the process; for instance hierarchical softmax (also explained in [X. Rong (2014)]) and negative sampling [Y. Goldberg et al. (2014)] are useful methods to reduce computation time. In addition, sub-sampling of frequent words [T. Mikolov et al. (2013)] can be used to speed up the process even more. The two methods and the subsampling of frequent words will briefly be described in the following subsections.

In order to expand the model described above, SG and CBOW are imple-mented as follows. Instead of using just one word as the context of the neural network, a context of size C is used. SG and CBOW use the context in a different way. As already stated before; SG tries to predict the context given a word and CBOW tries to predict a word given the context. This is summarized in figure (5).

(a) Skip Gram (b) CBOW

Figure 5: Both approaches for Word2vec. Figure adapted from [X. Rong (2014)].

(16)

4.1.2 Hierarchical Softmax

The hierarchical softmax yields a substantial decrease in computation time compared to the full softmax. The idea behind the hierarchical softmax, is to create a binary tree to represent all words in the vocabulary. Since it’s a tree, there exists a unique path from every node (a node represents a word) to every other node. More formally; let n(w, j) be the j-th node on the path from the root to node w and let L(w) be the length of this path (thus n(w, 1) = root and n(w, L(w)) = w). In addition, define for any node (not the root) ch(n) to be an arbitrary fixed child of n (for simplicity, it can be assumed ch(n) is the left child node) and define δ(x) to equal 1 if x is true and -1 otherwise. Now, the hierarchical softmax defines the probability of a word w being the output word given the context word wI as:

p(w|wI) = L(w)−1 Y j=1 σ( δ( n(w, j + 1) = ch(n(w, j)) ) · v0n(w,j) T vwI )

where σ(x) = _1+exp(−x)1 is the sigmoid function. The function δ may seem a bit confusing, however, this only makes sure that the probability of going left and the probability of going right at a specific node add up to one. Remember: 1 − σ(x) = σ(−x) or put differently σ(x) + σ(−x) = 1.

In the simple full softmax method, the computation time grows linearly with the size of the vocabulary: O(V ). However, in the hierarchical softmax, with the creation of a binary tree this reduces to O(log(V )) since the computation time is proportional to L(w) (which on average is no greater than log(V )).

4.1.3 Negative Sampling

The objective function in negative sampling is different from the one de-scribed before. The main idea is: considering a word context pair (w, c), did this pair come from the data? Therefore p(D = 1|w, c, θ) is defined as the probability that (w, c) indeed came from the data and p(D = 0|w, c, θ) = 1 − p(D = 1|w, c, θ) (the probability that (w, c) did not come from the data) with θ controlling the distribution. First of all, a set D is created from the corpus, which contains all word context pairs (w, c). Here the goal would be to find θ such that the probability of the observations (words w and the contexts c) coming from the data is maximized:

θmax= arg max θ

Y

(w,c)∈D

(17)

with p(D = 1|w, c; θ) = _1+exp(−v1

c·vw) (the sigmoid function) where vc, vw are

vector representations of respectively the context and the word. However, θ can easily be chosen such that p(D = 1|w, c; θ) ≈ 1 for all (w, c) ∈ D by taking θ such that vc= vw and vc·vw is large. However, to prevent all vectors

from having the same value, secondly, a set D0 is constructed from randomly sampled (w, c) pairs, where it is assumed these randomly generated pairs are incorrect language constructs (for this reason, this method is also called the ’negative sampling’ method). The objective function becomes:

arg max θ Y (w,c)∈D p(D = 1|w, c; θ) Y (w,c)∈D0 p(D = 0|w, c; θ)

which is the objective function for the entire corpus. The speed up comes from the fact that, instead of having to update all output vectors in the model every iteration, only a sample is being updated per iteration. A sample is the positive sample (an existing word context pair from the data) with its K negative samples (usually K is a value between 5 and 20).

4.1.4 Subsampling Of Frequent Words

Words that appear often in a corpus are usually not very informative. Take for example the words ’a’, ’house’ and ’garden’. In SG the context ’garden’ is way more informative for the word ’house’ than the context ’a’, since the word ’a’ appears in the context of almost every word. Put differently, rare words are far more informative than frequent words. In order to adjust for this imbalance of rare and frequent words, a word wi is discarded from the

vocabulary with probability:

p(wi) = 1 −

s t f (wi)

where f (wi) is the number of times a word appears in the corpus (the

fre-quency). t is a sort of threshold value which, often depending on the size of the corpus, is usually chosen between 10−3 and 10−5. The subsampling not only accelerates training, it also improves vector representations of the learned words as there is no noise anymore from the extremely frequent words.

(18)

4.2 Preprocessing the Data

Various preprocessing techniques are used; but the main idea for preprocess-ing here, is to first split the reviews into sentences and then create a wordlist of every sentence. When creating a wordlist, we do not remove stopwords; the subsampling of frequent words method described in section 4.1.4 takes care of frequent words, which means we don’t have to manually select the words we want to filter out. Moreover, we don’t stem the words, since word2vec could be able to capture small subtleties in words. Finally, like in LDA, emojis (like :-) or :-( and more), question marks and exclamation marks are preserved since they could have additional meaning.

4.3 Generating Features from the Word2vec Model

There are various parameters that can be played with to create vector rep-resentations with word2vec. The most important one, is of course whether SG or CBOW is used. Together with the hidden layer size and the context size, they determine the exact architecture of the neural network. Also im-portant, even though it doesn’t affect the neural net architecture, is whether hierarchical softmax or negative sampling is used. These two different speed up methods, may result in different word vectors.

After training, as stated before, the word vectors will have the same length as the size of the hidden layer.

5 GloVe Explained

Another interesting approach in this matter, is the global vector represen-tation for words algorithm (in short: GloVe). The idea is similar to the word2vec method; learn vector representations for words by minimizing a loss function. In this section, the GloVe model will be elaborated.

5.1 The GloVe Model

As in word2vec, the statistics of word (co-)occurrences is the primary infor-mation source that is used in the GloVe algorithm. Basically, the algorithm has two phases:

(19)

1. Create a matrix of word-word co-occurrence counts by passing the text data once

2. Minimize a specific loss function with a gradient descent algorithm

Both phases will be covered now.

5.1.1 Matrix of Word Co-occurrences

This subsection mainly covers some definitions. Define X to be the matrix that contains the word-word co-occurrence counts. Let entry Xij denote the

number of times word j occurs in the context of word i and define Xi =

PV

k=1Xik where V is the vocabulary size. Now the probability of word j

appearing in the context of word i can be defined as Pij = P (j|i) = Xij/Xi

which is nothing more than the number of times word j occurred in the context of word i divided by the number of words that appeared in the context of word i. The context discussed in this section, is exactly the same context as in the word2vec model.

5.1.2 The Loss Function

The starting point for the loss function, are the ratios of the probabilities: Pik/Pjk. In [J. Pennington et al (2014)] they explain why this is the starting

point for this algorithm. Every ratio depends on three words: i, j and k. With word vectors wi, wj andwek, the general form of the model is represented as follows:

F (wi, wj,wek) = Pik

Pjk

(5)

with function F a yet to be defined function. Note that the vectors w and w_e are two different sets of vectors (just like in word2vec where there were two matrices A and B with word vectors). This means every word is represented by two vectors.

Various functions can be chosen for F . To keep it short ([J. Pennington et al (2014)] contains the derivation), a function that can be chosen for F , is the exponential function. After a drastic simplification, this results in:

wiTwek+ bi+ ebk = log(Xik) (6) where bi is a bias term for word i and ebk a bias term for word k (again

(20)

even dropped out of the equation. Nevertheless this function can be used, to construct a loss function. The loss function can be casted into a weighted least squares problem using equation (6) and a weight function f (Xij):

L = V X i=1 V X k=1 f (Xik) ∗ (wiTwek+ bi+ ebk− log(Xik)) 2 ₍₇₎

The weighting function should be a non-decreasing function which goes to zero as the number of word-word co-occurrences goes to zero. The function that is used in [J. Pennington et al (2014)] is:

f (x) = (

(x/xmax)α if x ≤ xmax

1 otherwise (8)

where the idea is to cut of extremely frequent co-occurrences by xmax to not

overweight these co-occurrences.

5.2 Preprocessing the Data

Since the main statistic is counting word co-occurrences while looking at a certain context window, as is basically done in word2vec, the preprocessing can be done completely analogically as in the word2vec case (see section 4.2).

5.3 Generating Features from the GloVe Model

The features that are created in the GloVe model, are the sets of word vectors w and w. Therefore every word basically has two vectors. [J. Pennington_e et al (2014)] argue that in order to reduce overfitting and noise, the sum of these vectors should be taken. In other words, the feature vector for word i will be wi+wei.

6 Fusion Of Features

This section is inspired by [C. G.M. Snoek et al (2005)]. The basic concept is simple: combine different sorts of features in order to predict with these combined features. Two types of fusions methods are discussed in [C. G.M. Snoek et al (2005)]; early and late fusion. Both approaches are summarized

(21)

Figure 6: Early Fusion

Figure 7: Late Fusion

visually in figures (6) and (7) (specifically for the problem adressed in this paper). Only LDA and word2vec are used in the fusion methods.

Figure (6) illustrates how first the unsupervised learning methods are used to obtain features. The obtained features from word2vec and LDA are then concatenated which happens in the (early) fusion stage. Finally a supervised learning method is used to classify these combined features.

Late fusion, in figure (7), basically includes another supervised learning stage after the word2vec and LDA feature extraction for both of the extracted features separately. After this supervised learning stage for both word2vec and LDA, new features can be created from the predictions of the models in both supervised learning stages. Specifically for the prediction of star ratings of game apps, which is basically a ranking of the number of stars for an app, these rankings for the star ratings of apps (or the probabilities for the app star ratings) can be used as new features.

(22)

7 Synthetic Sampling

In many practical applications, the data that is used, is unbalanced. This means that some classes don’t contain a lot of data, which makes it hard for the models to recognize these classes. In order to correct for these classes that are under-represented, adaptive synthetic sampling for imbalanced learning (ADASYN) is used [H. He et al (2008)]. In the paper, they discuss the problem of unbalanced data for binary classification. However, as is also stated in their paper, this can easily be generalized to more classes. In this section, the ADASYN algorithm for binary classification is covered. The steps in the algorithm can most easily be summarized as follows.

1. From the labelled data {xi, yi}, determine the minority class, denoted

by ms, and the majority class, denoted by ml.

2. For each xi ∈ ms, determine the K nearest neighbours based on the

Euclidean distance. With this information, calculate ri = ξi/K where

ξi is the number of examples in the K nearest neighbours of xi that

belong to the majority class ml.

3. Normalize ri in the most straightforward way: rbi = ri/ P|ms|

i=1 ri where

|ms| denotes the number of elements in ms. Now, the number of

syn-thetic data points that will be generated for each xi, equals gi =bri∗ G where G is the total number of synthetic data points that will be gen-erated.

Simply said, the more majority class data points are in the neighbour-hood of xi, the more synthetic data will be generated for xi.

4. At this point, it is known how many data points will be generated for each xi ∈ ms. The only question that remains, is how to generate these

data points. First, define Li to be the set of K nearest neighbours of

xi.

Now, for each xi ∈ ms generate gi data points. For each of these gi (to

be generated) data points, randomly pick one point from LiT ms and

call this point xki (in words: randomly pick a data point, xki, which is

a nearest neighbour of xi and is in the minority set). Then generate:

si = xi+ (xki − xi) ∗ λ where λ is a random number from the interval

(23)

The parameters that can be set in this algorithm, are first of all the number of nearest neighbours K and second the number of to be generated data examples G.

To extend this algorithm to be fit for multi-class imbalanced data sets, there is not a lot that has to be changed in the algorithm described above. In step 1, instead of determining one minority class, multiple minority classes can be chosen to synthesize data for. The majority class remains the class with the largest number of examples. Then, for each class go through steps 2 till 4. The only thing that needs to be adjusted slightly, is the definition of ξi. Instead of it being the number of examples from the majority class, it’s

the number of examples of all the other classes but the minority class where currently points are synthesized for.

Synthesizing data for unbalanced data sets is known to improve performance of classification algorithms. In [H. He et al (2008)] they applied this algorithm with success. The classification accuracy increased with 10%.

8 Experiments

First, the described methods are tested on a baseline dataset in order to validate whether these methods work at all. The baseline dataset that will be used for this validation, is the IMDB movie review dataset. Afterwards, these methods will be applied to the game apps dataset in section 8.2.

8.1 IMDB Reviews Classification

The IMDB movie review dataset is a very popular dataset. Many have used this dataset to train and test their model as for instance [J. Hong et al. (2015)] and [Q. Le et al (2014)] have done. They obtained optimal accuracies of respectively 94.5% and 92.6% while they both used vector representations for paragraphs in a word2vec set-up. For the purpose of this paper both LDA features and word2vec features have been used to train various classifiers.

8.1.1 Data Description

The IMDB data set is an easy to work with data set considering movie reviews. It contains a training set with 25.000 reviews; from which 12.500 are labeled as positive reviews and the other 12.500 are labeled as negative

(24)

reviews. The labeling is done manually. The test set also contains 25.000 reviews, where again half of the reviews are labeled positive and half of the reviews negative. In addition, an extra 50.000 unlabeled reviews are included, which is useful for the unsupervised training.

8.1.2 Results

For the LDA features, a 150 topic model was used with a non-uniform α and a uniform η. Afterwards, the LDA features were classified using neural net-works (NN), support vector machines (SVM) and logistic regressions (LR). The results are summarized below. RBF is a radial basis function kernel which is able to capture nonlinearities. The metric that is used:

• Accuracy Percentage: the percentage of positive and negative test re-views that are correctly classified

Table 1: Classification with LDA Features.

Accuracy Percentage

NN Linear SVM RBF SVM LR 82.3 83.1 80.1 83.1

For the word2vec features CBOW and skip-gram (SG) were both used with negative sampling (NS) (where the number of negative samples was set to 5) and hierarchical softmax (HS). The size of the hidden layer was set to 300 and a context size of 10 was used. In order to create a feature vector per review, the feature vectors of the words in a review are averaged. The various results are listed below. The nonlinear SVM is left out of the results as it was always outperformed by the other classifiers. Perhaps RBF SVM is overfitting the data.

(25)

Table 2: Classification with Word2vec Features. Accuracy Percentage NN Linear SVM LR SG & NS 84.6 87.6 86.0 SG & HS 84.8 87.2 85.8 CBOW & NS 85.7 86.6 86.5 CBOW & HS 85.6 87.2 86.0

In this part of the analysis, both an early and a late fusion of the LDA and word2vec features are applied. As indicated in section 6, for the early fusion the word2vec and LDA features were simply concatenated. For the late fusion, the probability of a positive review from both the word2vec and LDA model is used. Note that the probability of a negative review is completely determined by the probability of a positive review for both models. It is therefore unnecessary to include the probability of a negative review from both models. Since late fusion didn’t improve performance, it was left out of the results. Early fusion on the other hand did improve performance slightly.

Table 3: Classification with Early Fusion of LDA and Word2vec Features.

Accuracy Percentage NN Linear SVM LR LDA + SG & NS 87.0 88.1 86.9 LDA + SG & HS 86.9 88.1 86.5 LDA + CBOW & NS 86.7 87.7 87.6 LDA + CBOW & HS 86.5 87.6 87.7

Studying the affect of the length of the feature vectors, the number of features created by word2vec (SG & NS) and LDA were respectively increased to 500 and 300. An early fusion of these features with an SVM classifier resulted in an accuracy of 88.4%. Increasing the number of features even more is not improving the accuracy. Using the early fusion model with 150 topics and a hidden layer size of 300, is a nice compromise between computational burden and performance.

(26)

following parameter values were chosen: xmax = 100, α = 0.75, the size of

the word vectors are 500, a context window of 20 was used and training was done with a 100 iterations. The prediction accuracy on the test set was 52%, which is basically randomly guessing positive or negative. Noticeable though, was that the prediction accuracy of the training set, was 84.3%, which clearly failed to generalize to the test set.

8.1.3 Conclusions

It is clear the LDA features are outperformed by the word2vec features in every classification approach. However, combining the LDA and word2vec features in an early fusion approach, improves accuracy. This results in an accuracy of 88.1%. This accuracy can be boosted to 88.4% by increasing the number of features for both word2vec and LDA.

All three classifiers (so no RBF SVM) perform quite well with their top performances being NN: 87.0%, linear SVM: 88.1% and LR: 87.7%. Note how a linear SVM always outperforms a neural network for this task. Comparing the results with the results of [J. Hong et al.] and [Q. Le et al (2014)], the early fusion method obtains accuracies close to their accura-cies. However, fact is that the early fusion method is outperformed by the paragraph vector representations for this task.

A note has to be made for the GloVe algorithm approach. On the github repository (on python-glove2_{) they clearly state that the algorithm could}

contain bugs, which could obviously bungle performance. Comparing for instance the most similar words to a word within GloVe and word2vec sepa-rately, it is no surprise that word2vec performs better. Take for example the word ’good’: using the cosine distance between the vector representation of the word ’good’ and all other word vectors, the top 5 most similar (greatest values for the cosine distance) words are shown below for both GloVe and word2vec.

(27)

Most similar words to ’good’ GloVe word2vec ’surprisingly’ ’great’ ’idea’ ’okish’ ’measure’ ’decent’ ’guys’ ’first’ ’deal’ ’actualy’

It is clear that the most similar words in the word2vec model make more sense than the most similar words in the GloVe model.

8.2 Game Review Classification

In this section the results for the prediction of the ratings of game apps are posed and discussed.

8.2.1 Data Description

The data consists of 17.343 game apps labelled with an average rating be-tween 1 and 5 (precision is to one decimal point). These ratings are rounded off to the closest integer such that 5 classes remain (rating x will from now on be referred to as class x). Every app contains at least one review with a maximum of 54 reviews. Unfortunately, the reviews do not contain time stamps, so it is unknown when the reviews were given. In the classification task, simply every review will therefore be used. Moreover, the data does not contain the ratings that individual users included with their reviews. Therefore, only the average ratings are used while training.

Another point to note, as can be seen from table 4, is that the ratings are centred around class 4, which could skew the performance of the classification algorithms. To adjust for this problem, ADASYN (section 7) was used on classes 1, 2, 3 and 5 to sample synthetic features for both LDA and word2vec. The results are structured as follows:

1. Results 1: First, ADASYN will be applied and then the data is split into an 80% trainig set and 20% test set. This means the test set also contains synthetic data.

(28)

2. Results 2: First, the data is split into an 80% training and 20% test set and then we apply ADASYN on the training set. Now the test set contains real data only.

Obviously it is to be expected that the first results will be better since not only does the training data contain synthetic data that is created from the test data, on top of that, more data is synthesized which makes learning easier. It does give us insight in how well the models can absorb the extra information created by ADASYN. It can moreover confirm if the ADASYN method for synthesizing data indeed generates sensible data (and is not gen-erating data that only creates more noise).

Table 4: Number of Data Points per Class.

Class

1 2 3 4 5

Count 41 129 1435 11776 3962

8.2.2 Results 1

As in the IMDB data case, a 150 topic model was used within LDA with a non-uniform alpha and a uniform eta. In order to give every review per app equal weight, the feature vectors of the reviews per app were summed and then divided by the number of reviews for that specific app. Afterwards, the features were classified using neural networks (NN), linear support vector machines (SVM) and logistic regressions (LR). The results are summarized below. The RBF kernel is left out of the results again. The metrics that are used are:

• Accuracy Percentage: the percentage of correctly classified apps • One Class Off: the percentage of apps that are classified at most one

(29)

NN Linear SVM LR Accuracy Percentage 90.8 72.4 70.3 One Class Off 99.5 90.6 89.7

For the word2vec features CBOW and skip-gram (SG) were both used with negative sampling (NS) (where the number of negative samples was set to 5) and hierarchical softmax (HS). The size of the hidden layer was set to 300 and a context size of 10 was used. In order to create a feature vector per app, every review was weighted equally as in the LDA case. Thus the feature vectors of the reviews are averaged per app. Again, as in the IMDB case, the feature vectors per review are the average of the word vectors of the words in that specific review. The various results are listed below.

Table 6: Classification with Word2vec Features.

Accuracy Percentage (One Class Off) NN Linear SVM LR SG & NS 87.9 (98.8) 61.3 (79.0) 59.5 (78.6) SG & HS 90.2 (99.1) 74.4 (85.7) 73.9 (85.0) CBOW & NS 86.5 (98.8) 58.0 (78.9) 58.6 (79.4) CBOW & HS 86.7 (98.6) 66.8 (85.2) 67.7 (85.3)

In the final part of the first analysis, early fusion of the LDA and word2vec features is applied; the word2vec and LDA features were simply concate-nated.

Accuracy Percentage (One Class Off) NN Linear SVM LR LDA + SG & NS 90.6 (98.7) 90.0 (98.4) 89.2 (97.7) LDA + SG & HS 90.9 (99.5) 82.3 (96.6) 81.8 (96.0) LDA + CBOW & NS 88.2 (98.7) 67.3 (90.9) 67.5 (91.6) LDA + CBOW & HS 88.5 (98.9) 66.9 (90.4) 67.2 (90.9)

(30)

for both LDA and word2vec separately, where different data points will be used to generate data. In other words, there is no guarantee the train and test set across LDA and word2vec will not overlap. Therefore, late fusion is left out of the analysis.

8.2.3 Conclusions 1

Contrary to the IMDB case, the LDA features in combination with a neu-ral network classifier outperform all word2vec features with an accuracy of 90.8%. Also noteworthy; early fusion of the LDA and the skip-gram word2vec features, boosts the accuracies of the linear SVM and LR classifiers by quite some percentage points. It also makes the accuracy of the neural net clas-sifier just slightly higher than the plain LDA features till an accuracy of 90.9% (this increase is marginal however). The CBOW word2vec features are always outperformed by the skip-gram features within NS and HS. The results appear very promising. However, keep in mind that the reviews did not contain a time stamp and about 40.000 synthetic data points were generated before splitting into a training and test set. All in all, it appears the neural network classifier is able to use the extra information generated by ADASYN extremely well. It also confirms the synthetic data generated by ADASYN is useful data (and not simply noise).

8.2.4 Elaborating on the LDA and Word2vec Features 1

An interesting part now, is to check what the features look like. For the LDA case, the topic distributions are illustrated in figure (8). The bars represent the average probability of that topic appearing in a class. In addition, the top three (stemmed) words per class per top topic are given. It’s very interesting to investigate the figure and the top words.

(31)

Figure 8: LDA Topic Distributions with ADASYN Data; top 5 topics per class are green.

(32)

Table 8: Top 5 Topics per Class with ADASYN Data. Top Topic 1 2 43 68 69 78 86 92 98 130 136 142 Class 1 X X X X X 2 X X X X X 3 X X X X X 4 X X X X X 5 X X X X X

9 Discussion & Conclusions

The results obtained for the prediction of the the star rating of game apps look very promising. However, reviews lacked the information of time stamps and the data was skewed. Moreover, it may be that not all reviews are in-cluded in the data. Because of these flaws, questions stated in the intro-duction, like ”How many early reviews should be taken into consideration for prediction?” and ”Should we look at a time range for the reviews that should be taken into account or should we look at a particular number of early reviews?” remain unanswered. When at some point this data is avail-able, especially an increase in observations for classes 1 and 2 is needed, this data should absolutely be utilized to create a model which is able to predict the star ratings of game apps based on early reviews.

For the word2vec model, as seen in section 8.2.4, a very odd thing resulted from the model; the words ’good’ and ’bad’ are very similar words in the word2vec model. The most probable explanation for this, is that the words ’good’ and ’bad’ usually appear in the same context. Take for example the sentence: ”This app is a very ... game.” Both ’good’ and ’bad’ can appear on the dots. For this task, where recognizing sentiment is extremely important, the similarity between the words ’good’ and ’bad’ is unwanted. These words

(39)

should be very dissimilar in order for the classification models to recognize the difference between good and bad apps more easily.

The LDA model on the other hand appeared to work very well for this task. Not only did the created topics make sense, moreover, the distribution over the classes varied which made classification possible.

As for now, answering the question of ”How do LDA and deep learning based models for the analysis of reviews perform on the prediction of the star rating of a game application?”. Based on the current results it is safe to say: LDA works well on the prediction of the star rating of a game application, however, deep learning methodologies seem to need some improvement. When datasets would grow orders of magnitudes larger, this is likely to change. We do feel, however, that the strengths of the two methods are complementary; so optimal solutions might combine both using early fusion.

(40)

10 References

M. Chen, X. Liu. Predicting popularity of online distributed applications: iTunes app store case analysis (2011) 661-663.

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts. Learn-ing Word Vectors for Sentiment Analysis. The 49th Annual MeetLearn-ing of the Association for Computational Linguistics (ACL 2011).

D. Demir, O. Kapralova, H. Lai. Predicting IMDB movie ratings using Google Trends (2012).

G. Szabo and B. A. Huberman. Predicting the Popularity of Online Content (2010) 80-88.

M. Mesty´an, T. Yasseri, J. Kert´esz. Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data (2013).

L. Peska, P. Vojtas. Hybrid Biased k-NN to Predict Movie Tweets Popularity (2014).

M. el Assady, C. Rohrdantz, D. Hafner, F. Fischer, M. Hund, S. Simon, A. J¨ager, T. Schreck, W. Jentner, D. A. Keim. Visual Analytics for the Prediction of Movie Rating and Box Office Performance (2013).

D. M. Blei, A. Y. Ng, M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022.

H. M. Wallach, D. Mimno, A. McCallum. Rethinking LDA: Why Priors Matter (2009).

W. Xiaogang, E. Grimson. Spatial Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 20 (2008). 1577-1584

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed Repre-sentations ofWords and Phrases and their Compositionality (2013)

X. Rong. Word2vec Parameter Learning Explained. (2014)

Y. Goldberg , O. Levy. Word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method (2014)

C. N. dos Santos, M. Gatti. Deep Convolutional Neural Networks for Sen-timent Analysis of Short Texts. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 69–78

(41)

S. Poria, E. Cambria, A. Gelbukh. Deep Convolutional Neural Network Tex-tual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis. Proceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 2539–2544

J. Pavlopoulos, I. Androutsopoulos. Aspect Term Extraction for Sentiment Analysis: New Datasets, New Evaluation Measures and an Improved Unsu-pervised Method. Proceedings of the 5thWorkshop on Language Analysis for Social Media (LASM) (2014), 44-52

D. Tang, F. Wei , N. Yang, M. Zhou, T. Liu, B. Qin. Learning Sentiment-SpecificWord Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1555–1565, (2014)

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed Repre-sentations of Words and Phrases and their Compositionality. (2013)

J. Hong, M. Fang. Sentiment Analysis with Deeply Learned Distributed Representations of Variable Length Texts. (2015)

Q. Le, T. Mikolov. Distributed Representations of Sentences and Documents (2014).

G. Attardi. 2015. DeepNL: a Deep Learning NLP pipeline. Workshop on Vector Space Modeling for NLP, NAACL 2015, Denver, Colorado (June 5, 2015)

D. Guthrie, B. Allison, W. Liu, L. Guthrie, Y. Wilks. A Closer Look at Skip-gram Modelling. NLP Research Group, Department of Computer Science, University of Sheffield. (2006)

C. G.M. Snoek, M. Worring, A. W.M. Smeulders. Early versus Late Fusion in Semantic Video Analysis. Proceedings of the 13th annual ACM international conference on Multimedia. Pages 399-402 (2005).

J. Pennington, R. Socher, C. D. Manning. GloVe: Global Vectors for Word Representation. Computer Science Department, Stanford University, Stan-ford, CA 94305 (2014).

H. He, Y. Bai, E. A. Garcia, S. Li. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. 2008 International Joint Conference on Neural Networks (IJCNN 2008), pages 1322-1328.

Forecasting the success of game-apps based on reviews

University Of Amsterdam

Master Thesis