Forecasting the Popularity of Applications : an analysis of textual and graphical properties

(1)

Forecasting the Popularity of Applications

An analysis of textual and graphical properties

Harro van der Kroft Master Thesis for

Econometrics - Big Data Track Faculty of Economics and Business Section Econometrics

(2)

(3)

Abstract

This thesis contributes to the scarce literature pertaining to App Store content popularity prediction. By scraping data from the Apple App Store, we form feature sets pertaining to the textual and graphical domain. The methodology employed allows for the use of other data, from other online content sources, and fuses these feature sets by means of late fusion. This thesis researches the predictive power of Neural Networks and Support Vector Machines in parallel, and by layering diﬀerent feature sets it ascertains that there is an added benefit in combining diﬀerent feature sets. We reveal that there is predictive power in using the methodology outlined in this thesis.

(4)

II

Acknowledgments

I would like to sincerely thank my supervisor Prof. Dr. M. Worring for his supervision, patience, and enthusiasm. The passion entertained by Marcel has furthered my interests in the field of AI more than I could have hoped for. Furthermore, I would like to thank Leo Huberts, Diederik van Krieken, Frederique Arntz, and Dominique van der Vlist for their input and constructive criticism.

(5)

Statement of Originality

This document is written by Student Harro van der Kroft who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(6)

1 |

Introduction

With the introduction of the Apple iPhone in 2007, smart phones have become a fixture in the online consumption of media. There are over 3.9 billion active mobile data subscriptions worldwide, with estimates for 2022 being 6.9 bilion (Ericsson, 2017, p. 2). Furthermore, the data transfer over a monthly period associated with these active subscriptions is over 2.1 GiB (about 3 Compact Discs), in 2014. The fact that people spend an ever increasing amount of time on their phones (Meeker, 2014), means that online content consumption is a large and increasing part of people’s lives. Companies such as Google, Netflix, Amazon, Hulu, Apple, Microsoft, and many more try to captivate users with applications, movies and online content related to their respective fields and businesses.

The mobile app development market is a large market. On August 16th, AppShopper.com reported that there were 1.6 million application available in the Apple App Store (AppShopper, 2017). A recent article by Forbes.com showed that for the 2016 calendar year the total money spent in the App Store was $30 billion, with developers receiving over $20 billion Forbes (2017). All in all these numbers show that there is a lot of revenue to be made in the online content business, with the App Store being a prime example of a medium serving online content.

Online content however, is very diverse. The content ranges from images on Flickr to Microsoft Excel in the Android App Store. There is a lot of variety, and the attention span of people is intrinsically biased (or: short) (Szabo and Huberman, 2010a, p. 88). Therefore, the added value for each item has to be clearly communicated to the consumer. When doing so, one must consider the diﬀerent feature sets pertaining to an item. Firstly, the graphical domain: thumbnails, videos, layouts, and presentation. Secondly, the textual domain: descriptions, titles, reviews. Lastly, more meta attributes may be considered: awards, mentions in other online content, and for movies: actors.

However, this diversity means nothing without having a common denominator to pin the added value per consumer on. This diversity means nothing without having a common denominator to pin the added value per consumer on. A clear example example is the rating of an item. These ratings allow the consumer to show sentiment, and allows the content provider to have a proxy for their actual needed statistic: popularity. Popularity is a vague construct. We therefore have the need to quantify it. One of the ways this can be achieved is by the number of views (henceforth known as views more simply). The views show a good chunk

(10)

2

of popularity, but there is a fatal flaw in using this statistic to produce a proxy for popularity: it does not show the sentiment for a particular item. An item may have a large amount of views because of marketing but still fall short of consumer expectations of the particular content. Now as stated before, many content providers allow for the rating of an application. An example would be the rating of an application in the App Store of Apple: a 1-5 rating to show sentiment.

Companies such as the aforementioned giants need to anticipate the eﬀect of their next move. The biggest problem for most companies is: how will my future content evolve? Will HBO produce another season of their latest TV Show or will Netflix produce a new series in its entirety? An approximation the success online can garner more security for these companies.

This paper answers the following question by developing a tool set/algorithm: Is it possible to predict the average rating of App Store content? With the following sub-questions:

1. How do Support Vector Machines (SVM), and Neural Networks (NN) perform? 2. How does performance depend on the exploitation of diﬀerent feature sets?

The tools used within this paper are based in the realm of Machine Learning: NN, SVM, and Latent Dirichlet Allocation (LDA), Support Vector Machines, and Ensemble Learning. The use of NN and SVM is because they allow for a classification problem to be solved, which is why they were chosen.

The expectation of the results of this paper is that it is possible to ascertain a decent approximation of the average rating of online content, but with some probable caveats: Firstly, meta data that would be relevant tot he popularity classification that is not readily available is most probably omitted (e.g. marketing budget for an application, the popularity of an actor at the time of release). Secondly, companies are not too keen on disclosing all information regarding their online content. Lastly, the used algorithms have their benefits, but also disadvantages which will be discussed.

This paper will first focus on the relevant literature pertaining to textual and graphical analysis of content in chapter 2. Afterwards, we will first introduce theoretical constructs in chapter 3, where the basic foundation will be laid for the reader to understand the methodology as outlined in the next chapter, chapter 4. After this an experiment will be performed on App Store data in chapter 5. Finally, a conclusion will be drawn in chapter 6.

(11)

2 |

Literature Review

The main focuses of this chapter are: the textual analysis of online content, the analysis of graphical online content, and online content analysis in general. The general developments in the field of machine learning pertaining to classifying images will also be discussed.

2.1 Internet Movie Database

Recent research with regards to online content popularity has mainly focused on popularity of movies by using the Internet Movie Database (IMDb): papers such as those by Eren and Sert (2017) and Pramod and Joshi (2017). The former focuses on a binary classification: flop or success; the latter focuses on predicting a rating. The work by Eren and Sert (2017) is of importance for this thesis as they combine mixed data types. The use of mixed data types is interesting for this thesis. Other work includes that by Latif and Afzal (2016), which couples econometric regressions and machine learning to attain a rating.

In the paper by Hsu et al. (2014) the authors tackle a similar problem as this thesis: predicting popularity. The authors analyze 32,968 movies, with a focus on the graphical part. By using neural networks, the authors achieve a high accuracy (a prediction absolute error of 0.82). They use 31,406 movies as a training set, using a 95/5% split for training/testing. The authors use key image components to identify features contained in images. By using color histograms, gradient histograms, texture, and objects the popularity of an image is predicted by means of Support Vector Regression. The work by Oghina et al. (2012) uses YouTube commentary sentiment for prediction. Others focus on only looking at the box oﬃce revenue Mestyán et al. (2013). In short: the IMDb data-set is a well-structured data set which has been analyzed thoroughly.

2.2 Online content analysis

IMDb data is well prepared and thoroughly analyzed, we therefore perform a focus shift: other online categories with lesser quality data are of interest. This section analyzes relevant papers in this subject.

The paper by Khosla et al. (2014) focuses on the popularity of online images. Their paper uses a data set consisting of 2.3 million images from Flickr, an image sharing site. They use meta-data consisting of views and social cues. For them, a social cue is the number of friends of the photo’s uploader. The paper does not, however, concern itself with the rating of the online content, but with the amount of views. This

(12)

2.3. POPULARITY PREDICTION 4

is because Flickr does not allow for an average rating on a nominal scale. They do allow for thumbs up or down, which provides less information than an integer scale from e.g. 1 to 10. The authors used views as a proxy for popularity, and this use of a proxy for popularity will be used in this thesis.

Other branches of online content that have been extensively researched are Youtube and Twitter, with research such as the works done by Bae and Lee (2012), and Szabo and Huberman (2010b). The former work focuses on investigating the factors that drive the popularity of messages on Twitter mostly based on the sentiment. The latter work focuses on the popularity of videos on YouTube by looking at the popular (at the time) website Digg.com. They associate the popularity of YouTube videos by the amount of likes on Digg. The paper by Szabo and Huberman (2010a) also concerns itself with Digg.com by using the information gathered on the site to model the popularity of applications. Despite the novel approach, it uses data outside the applications themselves, which we see as a fault with the paper. We will use the data more closely related (meta data) to the online content as a base.

2.3 Popularity Prediction

Harvey et al. (2011) predicts the rating given by a user. An interesting point to make is that by predicting the rating of a user, one can essentially predict the overall popularity of an item using this algorithm. The paper by Malmi (2014) investigates the connection between the usage of an application and the popularity associated with it. The data set used in the paper describes the usage of the application and the user’s phone. It tries to quantify the correlation between popularity and usage data. They find a surprisingly small correlation between the popularity of an application and usage.

In the paper by Mazloom et al. (2016), an analysis is done on the diﬀerent driving forces behind the popularity of brand related online content posts on social media. It combines features from the visual and textual properties of online content. One of the foremost results of the paper is that the visual and textual properties complement each other. However, the diﬀerence in this thesis is that the properties are not distilled into engagement properties as that would be of no use in this thesis. We are not interested in engagement or sentiment, but using visual and textual properties is interesting for this thesis.

2.4 Deep Learning and Image classification

This thesis uses deep learning for feature extraction from images, and the following section will review the important contributions made in image classification.

In 2009 the cleanly-labelled ImageNet (Deng et al., 2009) data set was introduced. It consists of ap-proximately 50 million sorted and labelled images. Because a perfectly labelled imageset is a gold mine within the field of machine learning, it initiated an arms race in the field of picture labeling and object recognition. The data of ImageNet is the basis for the competition Large Scale Visual Recognition Challenge

(13)

2.5. MODALITY IN FEATURES 5

(ILSVRC). This competition, and the rules it specified, allowed for algorithms to compete in the field of image recognition and object classification. The two trophy accuracy tests were top-5 and top-1 error rates. The first is defined as the percent of classifications whereby the correct label is not part of the top-5 labels predicted by the model. The latter is defined in the same matter. The competition first took place in 2009 and the research community concerning image classification and object recognition was boosted significantly with more sophisticated and better error rates as a result.

Among these methodologies is the work by Krizhevsky et al. (2012). The network described in their paper is called AlexNet, a convolutional neural network consisting of only 8 layers: the first 5 are convolutional layers followed by a 5 fully connected layers with dropout layers in between. AlexNet was also noteworthy in that it used intra-gpu connections to train. By doing so it reached an error rate of 15.3% in the top-5 error rate, a full 10.8 percentage points ahead of the runner up. The concepts of these layers are explained in section 3.3. AlexNet is discussed here as it a widely used and well-documented applicant to the ILSCRC, there are however better teams. The teams at the top of LSVRC2016 are: CUImage, HIKVision, Nuist ILSVRC (2016)

2.5 Modality in features

The previous section concentrated on the gathering of data and therefore features. When combining these data points it is necessary to talk about the fusion of these features. The paper by Snoek et al. (2005) concerns itself with the fusion of multiple types of features. The paper illustrates this by using two types of ’fusion’: early and late fusion. The former uses the modalities in feature space while the later fuses the modalities in feature space. For example: adding price data and colour data in one data set to forecast the sales of ice cream. In late fusion the feature sets are trained individually to from an outcome. This outcome, and the probabilities for each class that arise, are trained upon afterwards again. This can be applied to regression, SVM and NN.

The main take-away from this analysis of recent research is that in their respective fields a lot of work has been done. However, the research is either very concentrated on a certain field of online content (IMDb, Flickr, Youtube, Twitter) or with the use case (Fraud detection, revenue prediction, usage prediction). To ovecome this disadvantage, this paper posits a more general model with an experiment on App Store data as an example.

(14)

3 |

Theory

This chapter lays the the theoretical groundwork for the methodology outlined in chapter 4. Firstly section 3.1 covers some basic statistics pertaining to the field of machine learning. Secondly, section 3.2.2 covers the textual analysis part. Thirdly, section 3.3 covers Neural Networks, and section 3.4 covers the theoretical side of SVM. Finally, section 3.5 covers data sampling.

3.1 Statistics

The section includes some primer information on Bayesian statistics. Within the field of machine learning, Bayesian statistics is widely used. This form of statistics uses a so-called posterior probability of an event as being the conditional probability that is assigned before relevant data is taken into account. Conversely, the posterior probability is the probability distribution of an unknown random variable, after relevant data has been taken into account. "Posterior" here means taking into account relevant evidence from an experiment. The posterior probability can be written in textual form as:

Posterior probability∝ Likelihood × Prior probability

We shall illustrate with an example; given an experiment (data) d, and parameters theta, we may define the above equation as follows:

P (θ|d) = P (d|θ)P (θ) P (d)

Now if the left-hand side posterior (P (θ|d)) is of the same probability as the prior probability (P [θ]), then the prior and posterior are called conjugate distributions, where the prior is called the conjugate prior Pratt et al. (1995).

The Dirichlet distribution, in this tehsis denoted by Dirichlet(α), is family of continuous multivariate probability distributions parametrized by a vector α constituting of real, positive numbers. The Dirichlet distribution is the multivariate generalization of the Beta distribution. Dirichlet distribution are often used in Bayesian statistics, as the Dirichlet distribution is the conjugate prior of the Multinomial distribution.

(15)

3.2. NATURAL LANGUAGE PROCESSING 7

3.2 Natural Language Processing

The field of Natural Language Processing (or: NLP) is a field of artificial intelligence that is concerned with the processing of human language in a way that computers are able to process it. In particular, it is concerned with processing large corpora of texts. The challenges available in natural language processing involve speech recognition, dialog interaction systems, generating natural language, amongst others. We define some notation to be used in this section:

Token This is a unit from the vocabulary indexed by {1, . . . , T }. A token can be seen as a lowercase version of a word without punctuation. Using unit vectors we distinguish between certain tokens used. If token i is used, a vector ei is used to represent this word. ei is a vector with all 0’s, except for the

i-th spot. If a token is not used in a document, the resulting vector is a 0 vector: a vector with all zeroes.

Document is a sequence containing units of tokens present in a document. A document is defined as the set of tokens that are contained within it

d ={t1, t2, . . . , tW−1, tW}, (3.1)

With W being the amount of words present in the document after processing.

Corpus The corpus is the set of all documents. It is defined as C = {d1, d2, . . . , dN−1, dN}, with N the

number of documents in the corpus.

To save space and computational troubles, all 0 vectors can be omitted.

3.2.1 TF-IDF

Within NLP the need arises to rank words for their significance. Either the word does not contain information relevant for the test (and, or, the) or the word is too rare within the Corpus. There needs to be a balance between rarity and information. Before tackling the concept of TF-IDF we introduce some notation. For any set defined by S, with the example being {1, 5, 50, 512} we introduce the notion of elements inside of the set: |S| = |{1, 5, 50, 512}| = 4

Now suppose that we have a corpus of text documents and we wish to rank which document are most relevant to our query ‘a tidy room’. A simple way of querying this data set is by eliminating all documents that do not contain the words ‘a’, ‘tidy’, and ‘room’. This however creates a two-fold problem: we are left with a lot of documents, and the documents that are left have no ranking in an order of their relevance. To distinguish between the leftover documents we count the frequency of the terms in each document; this is aptly named the Term Frequency (TF). The first form of this insight is made by Luhn (1957). This thesis

(16)

uses the following definition for TF:

TF(t, d) = ft,d=|{t ∈ d}|, (3.2)

with t indicating the token as discussed in section 3.2, d the document being analyzed, and ft,d specifies

the raw count of the token in that particular document.

This insight of TF however, does not control for the amount of documents available in total. We shall take the word ‘a’ as an example: it is a common word, therefore the term frequency as defined by equation 3.2 will wrongly assert a high importance to documents containing the word ‘a’. We would assume that the words ‘tidy’, and ‘room’ carry more weight in defining the importance of a text. We therefore employ the concept of Inverse Document Frequency (IDF):

IDF(t,C) = log ( N |{d ∈ C : t ∈ d}| ) ,

where t again constitutes the token being researched, C the corpus, N a count of the number of documents in the corpus,|{d ∈ C : t ∈ d}| leads to the number of documents where the term t is present. The assumption here is that only tokens are researched that are actually in the corpus. The work by Sparck Jones (1972) created the basis for Inverse Document Frequency.

By combining Term Frequency and Inverse Document Frequency, we obtain a ranking function that is a trade-oﬀ between term frequency on the document level, and the frequency the token appears in the corpus. We calculate it as follows: TF-IDF(t, d,C) = tf(t, d) × idf(t, C) (3.3) =|{t ∈ d}| × log ( N |{d∗_{∈ C : t ∈ d}∗_}| ) (3.4) This ranking function can be used to make topic modelling computationally more eﬃcient. Not all tokens present in the original texts are still in the documents after parsing.

3.2.2 LDA

Within the realm of machine learning and NLP, a topic model is a statistical tool for extracting the abstract ‘topics’ that are assumed to be hidden in a collection of documents. For text mining purposes a topic model can lay bare the hidden semantic structures of the given text. This subsection covers a specific type of topic modelling: the Latent Dirichlet Allocation (LDA) model. First described by Blei et al. (2003), this type of topic model allows for the diﬀerences in sets to be explained by latent factors described by the presence of certain topics within documents.

(17)

3.2.2.1 Model

LDA is a generative model, which means that it learns the joint probabilities of word, document, and topic distributions. The word generative comes from the fact that it starts with a set of priors, which it then changes upon learning new information. This relates to the ‘prior’ and ‘posterior’ probability part of equation 3.1. Here the text and word level probabilities will be elaborated on. Documents are represented as random mixtures over latent topics. Each topic chosen from{1, . . . , K} is characterized by a distribution over the words. Basically, documents are represented as random mixtures over latent topics. Each topic in turn is characterized by a distribution over the words. For each document d∈ C, LDA assumes the following generative process, as described by Blei et al. (2003):

1. We have a corpusC which consists of W documents each with Ni words.

2. Choose a W -dimensional θ∼ Dirichlet(α). With α being a prior. 3. Choose a K-dimensional φ∼ Dirichlet(β). With β being a prior.

4. For each of the document, word combinations i, j, with i∈ {1, . . . , W } and j ∈ {1, . . . , Ni}:

a) Choose a topic zi,j ∼ Multinomial(θj).

b) Choose a word wi,j ∼ Multinomial(φzi,j)

In other words, we choose a set of priors (α, β) and build our model around this, assigning a topic and word distribution for each combination. The paper by Hong and Davison (2010) shows us that the method of choosing priors is of importance, where topic models typically assume symmetric Dirichlet priors, where α and β are chosen so that each topic, word, and document probability is the same. The paper by Wallach et al. (2009) suggests that an asymmetric α and symmetric β allow for better performance than uniformly distributed priors. Intuitively, as explained in the paper by Andrzejewski et al. (2009, p. 1) this makes sense: in general a word or document will have a preference towards a certain topic and this information should be incorporated in the priors, if known.

3.2.3 Pre-processing & stemming

Before the model can analyze a text, some parsing needs to occur: by removing all punctuation, the words can be converted to tokens. Because texts are grammatically sound for a human, it means that "work", "working", and "worked" are seen as separate tokens. Within natural language there are families of related words that have the same semantic meaning. By correcting for these diﬀerences we garner more information than if we did not apply stemming (Lovins, 1968).

3.2.4 Topic Number Estimation

In the field of topic modelling it is critical to have a correct intuition about the topics: they have to be modelled to actually represent the text being analyzed. By creating such a criterion, we are able to deduce

(18)

3.3. ARTIFICIAL NEURAL NETWORKS 10

the amount of topics K.

One of the ways we are able to do so is via perplexity. Perplexity is used in information theory as a measurement of how well a probability distribution or model predicts a given test sample. It can be used, in our case for K, to determine the distribution which fits the sample the best (Blei et al., 2003, p. 1008).

There are, however, problems with the use of perplexity for estimating k. One of the most notable is that perplexity does not correlate strongly to the judgment of humans, as outlined in the paper by Chang et al. (2009): they tested multiple measures of model likelihood and correlated it to human judgment in large-scale user studies. They conclude that a higher adherence to a levels of model likelihood leads to lower number of semantically meaningful topics.

An alternative method for evaluating the optimal number of topics in LDA is based on so called topic coherence as suggested in Chang et al. (2009). Topic coherence (or in the paper: Cv) is a measure of how

interpretable the topics are to humans. Coherence starts by picking the top-N words, sorted by term weight within the topic. It then calculates how similar the words are to each other. There are multiple methods for doing so, of which almost all are outlined in the paper by Röder et al. (2015). The authors performed an analysis of the various methods and correlated them to human judgment. The method which is called Cv was found to be the most highly correlated of all. The method makes multiple passes over the corpusC,

accumulating both term occurence and co-occurence count (how many times a word is used in conjuction with another word), and does for the top-N words in each topic.

3.3 Artificial Neural Networks

(Artificial) Neural Networks, or NN, are systems inspired by the biological neural systems of computing that constitute brains. These systems are thought to learn progressively, without being programmed to do a specific task at hand.

A Neural Network is a collection of units (outputs) and weights in a mesh called a net which are analogous to the synapses and axons in a human brain. Each connection (synapse) between neurons can transmit a signal to another neuron. The receiving neuron (post-synaptic) can process the signal and apply a signal to downstream neurons connected to it. Neurons have a state which is presented by a real number typically ranging between 0 and 1. Neurons and the connections associated with it have a weight that may vary during learning, and this weight can increase and decrease the strength of the signal sent downstream. The strength of a signal is learned from feedback given a loss function associated with the output. Furthermore, the weights may have a threshold such that the aggregate signal from connected neurons may not be above of below a certain level. An example for Neural Networks would be the binary classification between "a building" and "not a building". The Neural Network would train on a data set which contains features that constitute a "building" and "not a building". From here the Neural Networks would train to classify new

(19)

samples.

3.3.1 Feed-forward network

The basic form of a (1 hidden layer) Neural Network consists of I linear combinations of the input variables (or: features){xi}Ii in the form:

aj = I

∑

i=1

w_ji(1)xi+ b(1) j ∈ {1, . . . , H}

With H the number of nodes in the hidden layer. The variable b(1) is a bias with respect to this particular layer. The quantities aj are called integrations. These integrations are transformed by the use of a

(non-)linear activation function, which computes the new activation (zj), which is be defined as:

zj = h(aj).

The function h(·) is often times chosen with the aim of keeping the end result within certain boundaries for next layers to work with (more examples will be given in subsection 3.3.2). The next layers use the resulting zj: yk = h  ∑H j=1 w(2)_kjzj+ b(2)  

Where the variables w_ji(k)are the weights associated with the output level yk, and where k∈ {1, . . . , K} with

K being the total number of outputs. This transformation of zj → yk constitutes going from the hidden

layer to the output layer. We again introduce biases for this level. Graphically, it can be demonstrated as follows: x1 Input 1 x2 Input 2 z1 z2 z3 y1 y2 Hidden layer Input layer Output layer b(2) b(1)

Figure 3.3.1: A simple example of a neural network

The arrows in figure 3.3.1 going to and from the input, hidden, and output layers have associated weights w_ji(y), with y ∈ {1, 2}. Combining all stages and the output layer, we ascertain:

yk(x, w) = h  ∑H j=1 w(2)_kj × h ( _I ∑ i=1 w(1)_ji xi+ b(1) ) + b(2)   (3.5)

(20)

We call the act of evaluating 3.5 as forward propagation of information throughout the network (Bishop, 2006, p. 229). The network as defined in this section may be easily expanded on by introducing new layers with their own biases, weights, and transformations (see subsection 3.3.5) .

3.3.2 Activation layers

As mentioned in the previous subsection, the outputs between layers use activation functions. There are a number of activation layers with desirable properties that will be discussed here: ReLu, Sigmoid, tanh, and softmax.

1. aReLU(x) = max(0, x) ( Nair and Hinton (2010))

2. asigmoid(x) = _1+e1−x

3. atanh(x) = _1+e2−2x − 1 = 2asigmoid(2x)− 1

4. asoftmax(x, j) = e xj ∑

k=1exk

aReLU is less computationally expensive than atanh and asigmoid because ‘ReLU’ involves simpler

mathemat-ical operations and creates a less analog output (half of the output is 0). The use of asoftmax is restricted in

that the values of x need to be in the range [0, 1] and add up to 1.

3.3.3 Network Training

Given a training set comprising of input vector{xn}Nn=1, and an associated target vector{tn}Nn, we minimize

the error function,taking the squared sum of errors an example:

E(w) = 1 2 N ∑ n=1 ∥y(xn, w)− tn∥2 (3.6)

To derive the optimal weights and biases in the network, the gradient of the error function must be found. We evaluate the gradient for each weight individually:

δEn δwji = δE δyi δyi δzj δzj δwji

and find the local error signal involved in change a weight. The Neural Network is optimized by observing and acting upon the change in error function as defined above. Every instance of training the network is called an epoch.

3.3.4 Loss function

For many applications the objective is probably more complex than minimizing the amount of mis-classifications. An example can be seen in medicine. For a doctor it is more important to correctly identify a healthy person than treat a healthy person. In other words: a patient with illness should not be turned away (type-I error), but an ill person could receive treatment that is unneeded (type-II error).

(21)

We can formalize this issues by identifying a loss function, alternatively called a (negative) cost function, which provides a single overall measure of loss incurred in taking the actions and decisions as so defined by the Neural Network. The goal of the Neural Network is to minimize this function. The optimal solution is the one which minimizes the loss function.

One of the potential (non-linear) loss functions is the Cross Entropy Loss function which is useful when training a classification problem with n classes. The loss in class i, using the data vector d

loss(d, i) =−log(e dj₎ ∑ jedj (3.7) =−di+ log  ∑ j exp(dj)   (3.8)

Where d is a vector containing the data. The Cross Entropy Loss function may be generalized for imbalanced data sets. When handling imbalanced data set one can include weights wi, with

∑ i=1wi= 1: loss(d, i) = wi  −di+ log  ∑ j exp(dj)     .

Other loss functions that may be considered are L1 (∑|y − t|) and L2 (∑|y − t|2_{) regularization.}

3.3.5 Other layers

The hidden layer described earlier uses full-connected layers. A fully-connected layer is one where all nodes are connected to all previous layer’s nodes and the next layer’s nodes. There are a multitude of layers, but the ones worth noting for this thesis are: convolutional layers, pooling layers, and dropout layers. They are all layers used by AlexNet (Krizhevsky et al., 2012).

The problem with normal Neural Networks is that they do not scale well with the inclusion of images. In AlexNet the input image size is 224 x 224 x 3 (3 because of the three main colour channels: Red (R), Green (G), and Blue (B)), a vector of size 150,528. Although this would be manageable, we would certainly want multiple layers, therefore full connectivity would lead to either over fitting or an untrainable model. As when dealing with high-dimensional queries one may come across the "dimensionality curse" where the number of possible parameters is larger than the number of training samples. Within the context of Neural Networks, pruning or lowering the amount of parameters has shown to be useful Bengio and Bengio (2000, p. 1).

Convolutional layers use the inputs of an image in a more geometric sense: the neurons are arranged in 3 dimensions: width, height, and depth. After this arrangement a filter of (for example) 5x5x3 slides over the volume defined earlier. We then take the dot product, with the filter being w:

(22)

3.4. SUPPORT VECTOR MACHINE 14

which in our case would be 75-dimensional dot product with the end result being a scalar. This operation is called a convolution. The end result of this would be a 28x28x1 volume. The 28 comes from the fact that it is the number of unique positions possible from the 32x32 field. The convolutional layer is built up from the amount of filters placed on the original filters, this could be more than 1, allowing for diﬀerent types of information. Because there are multiple passes over the same pixels, the spatial information is preserved.

AlexNet contains so-called Pooling Layers (Krizhevsky et al., 2012, p. 4). These layers aim to progres-sively reduce the size of the spatial information by summarizing the content of a previous nodes, therefore reducing the dimensionality of the net. By reducing the amount of parameters a control is exerted on the net to prevent over fitting (i.e. dimensionality curse). A way of thinking of a pooling layer is down sampling: reducing the size of a 500x500x3 image to 100x100x3. This preserves the spatial information but reduces the parameters needing to be optimized.

Dropout layers, as introduced by Srivastava et al. (2014), are very specific in their function. Like the Pooling Layer they allow for a control with over-fitting, but they do not conserve spatial information. The idea of a dropout layer is to randomly "drop out" certain activations in a layer by setting them equal to zero. This forces the network to create redundant path for the same answer.

3.3.6 Normalization

Within the field of statistics normalizing a random variable is a way of forcing the random variable to a certain distribution, after which more easily an analysis can be made with regard to the source material. The field of image processing is similar, we want the input parameters (pixels in our case) to be similarly distributed. This will make convergence faster when training the network (Ioﬀe and Szegedy, 2015, p. 8). To accomplish normalization, we first define X to the training set, and Y the test set.

1. For each c∈ {R, G, B} pertaining to the colour channels calculate the mean (µc) and standard

devia-tion (σc) using the information in training set xi ∈ X

2. For both xi ∈ X and yj ∈ Y , and for each channel apply the following formula, with c being the

relevant channel:

c∗ = c− µc σc

.

If we did not scale our input training vectors, the ranges of our distributions of feature values would likely be diﬀerent per feature, and therefore the learning rate would cause corrections in each dimension that would diﬀer from one another. In other words: the inputs should be distributed similarly.

3.4 Support Vector Machine

Support Vector Machines, or SVM, is a machine learning algorithm which is used for both classification and regression problems. However, it is mostly used in classification problems. In this algorithm each data

(23)

3.4. SUPPORT VECTOR MACHINE 15

item is in an n-dimensional space with the value of each feature being the value of the coordinate. We can perform classification by finding a hyper plane that diﬀerentiates k-classes.

To understand the concept of Support Vector Machines, it is of importance that one knows the definition and possible applications of hyper-planes. The definition used by Curtis (1968) is used, accompanied with a geometric interpretation: We may consider a hyper-plane to be a subspace, but of one dimension less that the space it is residing in (also called ambient space). Therefore, as an example, if a space is 4-dimensional, its hyper-plane would 3-dimensional. In general: if the space is n-dimensional, its hyper-plane is (n-1)-dimensional. This notion of hyper-planes can be used in generality in spaces where the notion of sub-spaces is defined.

A Support Vector machine creates a hyper-plane (or a multitude of hyper-planes) in a n dimensional space. With n being of a high order, this space can be used for regression analysis, and binary classification. The method creates a separation by classes by creating hyper-planes that have the greatest distance to the nearest training data point of any class, the so called margin.

There are two methods when dealing with class classification problems: implementing the multi-class multi-classification problem by creating a single optimization problem (Crammer and Singer, 2001). Or by using the version as studied to be the best by Duan and Keerthi (2005) by reducing the single multi-class problem into multiple subsets of binary classification problems. In this paper the latter will be used, as the run-time is significantly reduced (Pedregosa et al., 2011).

3.4.1 Ensemble Learning

Support Vector Machines, is computationally intensive Chapelle (2007). With the computational complexity described as O(max{n, d}, min{n, d}2) with n the number of points and d the number of dimensions. To counter this, ensemble methods may be employed.

In statistics, and machine learning in particular, ensemble methods obtain a classifier that is a combina-tion of multiple classifiers. Multiple methods of obtaining a classifier using subsets are available, to reduce the computational complexity. Breiman (1999) creates a classifier from using random subsets of samples. In the paper by Ho (1998) randomly sample the subsets of features, and call this Random Subspaces . When using a sampler on both the samples and features, we call this Random Patches Louppe and Geurts (2012). Finally, when drawing samples with replacement one calls this Bagging Breiman (1996).

3.4.2 Kernel

A kernel with regards to Support Vector Machines is a way of computing the similarity of two vectors x and

y in some high-dimensional feature space (Bishop, 2006, p. 292). Suppose we have function φ that maps our

vectors as: Rn→ Rm. We may define the dot product in this space to be ϕ(x)Tϕ(y). A kernel k(·, ·) is a function that corresponds to this dot product (in that space). In our example we use k(x, y) = ϕ(x)Tϕ(y).

(24)

3.5. SYNTHETIC SAMPLING 16

This kernel is also called the linear kernel (Bishop, 2006, p. 292). Another example is the Radial Basis Function (RBF), this is an often used kernel in training non-linear SVM problems (Chang et al., 2010, p. 1475). It is defined as:

k(x, y) = exp(−γ∥x − y∥2₎

The parameter associated with RBF is γ, which will be discussed in the next subsection.

3.4.3 Parameters

The parameters associated with SVM depend on the kernel used. Here we will discuss the parameters associated with SVM and the RBF kernel. The first parameter of interest for SVM itself is C, this parameter is a positive integer that controls the cost of misclassification.

The γ coeﬃcient is a parameter for the RBF kernel and allows tuning in the case of over-fitting: a higher γ is associated with over fitting. A smaller γ would imply a distribution of Gaussian shape, with a large variance. This would imply that the influence of ’surrounding’ (i.e. close by in higher dimensional space) is greater. A large γ would imply the opposite: no wide-spread influence therefore the choice of γ entertains the classic bias-variance trade-oﬀ: a large γ leads to a high bias and low variance and vice-versa.

3.5 Synthetic Sampling

In the field of Machine Learning, a great source of bias and misclassification, with multi-class classification problems, is an unbalanced data set. If your data set is in total 1000 objects and 980 of them belong to the class "dog" and the rest of the 20 belong to the class "cat" it gets hard to predict the class of unseen data that belongs to "cat". The classifier will tend towards the more available class to maximize accuracy. This problem of imbalanced data may occur with a multitude of causes: the data is skewed because of the latent nature of the variable at hand, there may be a sample selection bias, or there are other problems at play. To better analyze a dataset, one must identify, and if possible correct for, biases in the data set.

To counteract this bias of unbalanced data sets one can apply sampling of classes. There are three cases to discuss: the oversampling of the minority class ("cat" in the example), under sampling the majority class ("dog" in the example), and a combination of over-and under sampling.

The SMOTE Chawla et al. (2002) is a method for over-sampling a data-set to counter classification problems posed by imbalanced dependent variable distribution. SMOTE generates synthetic cases when a rare target value is required. Chawla et al. (2002) uses an interpolation strategy when creating these examples artificially. The smote algorithm introduced by Chawla et al. (2002) uses an over-sampling strategy consisting of generating synthetic cases ˆxi when a rare target value yi is requested. They proposed a strategy whereby

the strategy is to select one of the target value yi’s k-nearest neighbours and use these new observations and

(25)

In the paper by He et al. (2008), the ADAptive SYNthetic sampling approach (or ADASYN) is in-troduced. The authors build upon the methodology of SMOTE (Chawla et al., 2002) by focusing on the minority classes which are diﬃcult to learn. ADASYN generated synthetic data for the minority classes, but increases the amount generated for the minority classes which are harder to learn. The ADASYN algorithm works as follows:

We start with a training set D with in total n samples of labelled and paired data {xi, yi}ni=0. With

xi ∈ X and yi ∈ Y = {−1, 1} with yi identifying the class associated with the corresponding xi. We now

define two subsets of D: D₋(n₋) and D+(n+) as the minority and majority classes and their respective

counts. We therefore have n = n₋+ n+, n−< n+, and D−∪ D+= D.

1. One calculates the number of samples we wish to generate for the minority class:

G = (n+− n−)× β (3.9)

with β∈ [0, 1] as a parameter to specify the balance level after generation of the synthetic data. β = 0 would indicate no samples being trained and β = 1 would indicate a fully balanced data set after synthetic data generation.

2. for each item xi ∈ D− find the K nearest neighbours based on the euclidean distance. Using this

information calculate the ratio

ri= ∆i/K, i = 1, . . . , n−

where ∆i is to be defined as the number of examples in the K nearest neighbours of xi that belong to

D₋, we therefore achieve ri ∈ [0, 1].

3. we now normalize ri as follows:

ˆ ri =

ri

∑n−

i=1ri

so that ˆri has the following attribute:

∑

irˆi= 1

4. we now calculate the number of synthetic data points that are to be generated for every minority example xi:

gi= ˆri× G (3.10)

with G being the total number of synthetic data examples that are to be generated, in accordance with equation 3.9

5. We know now how many data points are to be generated for each xi ∈ D−. For each of these gi we

loop over them and randomly pick one point from the K nearest neighbours and call this point xki:

we then generate a point by the following formula:

si = xi+ (xki− xi)λ

(26)

This algorithm can be generalized for multi-class imbalanced data sets by handling one minority class at a time. The introduction of synthesized data is known to improve the performance of classification algorithms, in the original paper of He et al. (2008) the increase of classification accuracy was better than SMOTE in all but one case presented in the paper.

(27)

4 |

Methodology

The main focus of this chapter is methodology used to answer the research question. First, the pre-processing of the data is discussed in section 4.1. Secondly, in section 4.2 feature extraction will be discussed. Thirdly, section 4.3 will discuss the classification and prediction part. Fourthly, section 4.4 will discuss the fusion of feature sets. Lastly, section 4.5 will summarize the methodology.

Figure 4.0.1: The structure of the methodology at a glance

The methodology described in this chapter in graphically explained in figure 4.0.1. Section 4.1 is linked to part 1 in blue, section 4.2 is part 2, section 4.3 will describe part 3, with finally part 4 being described by 4.4.

4.1 Pre-processing

The pre-processing (part 1 of figure 4.0.1) is the part where an 80%/20% train/test set is created. We do not use a 95/5 split as in the literature review, as the authors employed a 20-fold cross-validation. The choice for 80/20 is done out of computational constraints. Some restrictions are put on the input data of this method: the target variable (rating) needs to consist of at least 1 rating to allow for the average target rating to be between the min and max values of the average rating.

(28)

4.2. FEATURE EXTRACTION 20

4.2 Feature Extraction

For the analysis of the online content given, we define the feature sets to be used. Four feature sets will be discussed here, namely:

• Image of the item (4.2.1)

• Textual feature sets (title & description) (4.2.2) • Genre information (4.2.3)

4.2.1 Image

To generate a feature set from images, the pre-trained Convolutional Neural Network named AlexNet is used. By using data from ImageNet we are able to ascertain spatial and object data from the image associated with an item. The normalization for images used in AlexNet uses the following normalization parameters:

(µR, µG, µB) = (0.485, 0.456, 0.406) (σR, σG, σB) = (0.229, 0.224, 0.225)

These values are because of the use of a pre-trained model (PyTorch, 2017). The output of AlexNet is a 1000-fold feature vector.

4.2.2 LDA

As explained in the theory chapter, Latent Dirichlet Allocation is used to extract textual features from a text. In this specific case, the the textual features contained within the title and description of an item are used. The parameters of interest in this feature set is k and number of iterations over the data set. For computational convenience we will set the number of iterations to be 20. The end result of this analysis is that there will be two feature sets (title and description) containing both ki features. The variant according

to the paper by Hoﬀman et al. (2010) is used: an online, therefore sequentially run, variant of LDA. 4.2.2.1 Parameters

As discussed in the theory chapter, the optimal number of topics is to be determined. As stated, there are multiple methods for determining the optimal number of topics k. We shall use the coherence measure Cv to

determine the optimal number of k by choosing the lowest coherence measure with k∈ {25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}, with 5 passes over the corpus.

4.2.3 Genres

The assumption of this thesis is that an online content item has informational of categorical nature. This may be interpreted as the genre of the item. This will be added as a feature set with the number of items being pertaining to the number of genres/categories in dummy form.

(29)

4.3. PREDICTION GOAL 21

4.3 Prediction Goal

In defining the goal of an algorithm, the need for defining accuracy arises. The way this is handled in this thesis is by using a binned scale. A binned scale is defined as a scale which is cut up into a pre-defined amount of pieces. We define the scale by defining the width of the scale ε. The subsequent definition of accuracy (a(·)) is:

a(ε) = 1 N N ∑ i=1 b(ε, actuali, expectedi)

with b(·, ·, ·) defined as:

b(ε, x, y) = Iy∈[x−(x mod ε),x−(x mod ε)+ε)∩D D = [min{ratings}, max{ratings}]

Note the boundaries: [·, ·), the boundary value max D is included in the last set. With the assumption that the rating is defined as starting on an integer or multitude of ε. The N is defined as the number of items in the relevant set of items. The goal of all algorithms is to be able to predict the rating of an item as accurately as possible. To compare this work with other work, we will facilitate two accuracy definitions:

1. Spot-on is where ε = 0.5. 2. Next-to is where ε = 1.0.

The most interesting definition is spot-on. This definition allows for a more thorough learning process with regards to Neural Networks and SVM, as it allows for the diﬀerence between e.g. a rating of an item of 3.2 and 3.6, if that is the most prevalent rating. This information is lost when viewing the rating of an item through the lens of the next-to definition.

4.3.1 Continuous

The choice of binning the end variable brings up the loss of information going from a continuous variable to a discrete variable. This is a valid point and the reasoning behind it is that we would like to compare this algorithm with previous work (Glisovic, 2016; Wieser, 2016). Furthermore the discretisation of the variable allows for faster run-times with regards to Neural Networks and SVM (which would have to be change to Support Vector Regression).

4.4 Prediction

This section describes the final two steps: generating the feature sets and the fusion of those feature spaces.

4.4.1 Sampling

Before any classification algorithm is applied to the data set, the data set is run through ADASYN to generate a balanced data set.

(30)

4.4. PREDICTION 22

4.4.2 Support Vector Machine

Before any generation can take place the original unbalanced data set has to be over sampled as described in section 3.5. Before any generation can take place, the training set is supplemented with ADASYN to ensure a balanced data set. The target value is the rating of online content split by use of the spot-on definition. To ascertain the parameters for this intermediate step a grid search is performed on the following variables:

• kernel, one of – linear: ⟨x, x′⟩

– radial basis function: exp(−γ∥x − x′∥2) • γ ∈ {0.01, 0.1, 0.5, 0.9, 0.99}

• C∈ {1, 10, 100, 1000}

Where γ will be only be adjusted upon in the case of the Radial Basis Function (rbf). This grid search will be done on a random sampling (10%) on an ADASYN sampled training set. As for the method of SVM multi-classification used, the method as proposed in Duan and Keerthi (2005) will be used, to reduce the multi-class optimization problem to a multitude of binary classification problems. The SVM classifier will be trained with the spot-on definition as defined in section 4.3. I

The outputs used from SVM are the so called probabilities: in the binary case, the paper by Platt et al. (1999) may be used to ascertain per-class probabilities. It uses a logistic regression on the SVM scores, fit by an additional cross-validation on the data. However, because we are in a multi-class case we use the extension as described by Wu et al. (2004). Furthermore as SVM is computationally intensive, the Bagging ensemble method with overlapping subsets is used.

4.4.3 Neural Network

A neural network with two hidden layers with both 400 nodes is trained to ascertain a 1×q vector of outputs to use for classification, which depends on the ε chosen (8 for spot-on and 4 for next-to). The activation function used is either aRelu for ease of computation or asoftmax if the first does not converge successfully.

The number of epochs used is 20.000. To counteract over fitting a dropout layer is included between the first and second layer. The loss function used is the Cross Entropy Loss function as defined in subsection 3.3.4. The output of the Neural Network is normalized according to the same procedure of image normalization by using the column vector for each feature (y), and applying the following function:

normalize(y) = σ−1_y y− ιµy.

With ι = [1, . . . , 1]′. Using the normalized concatenated input feature sets the same Neural Network as before is trained.

(31)

4.5. SUMMARY 23

4.5 Summary

To grant the reader a more structured view of this chapter we summarize the methodology: Preprocessing Split the data set into 80% training and 20% test set.

Feature Extraction Extract Genre, Image, description and title textual information. Balance with ADASYN.

Feature preparation Prepare the features for fusion by applying Support Vector Machines and Neural Networks, training on spot-on definition.

Fusion Use Support Vector Machine and Neural Network to train a classifier using the spot-on definition. Prediction Predict on the test set and enumerate the spot-on and next-to accuracies per rating bin.

(32)

5 |

Experiment

The results from an experiment using scraped application data is outlined here. Firstly, the data is described in section 5.1. Secondly the feature sets and their generation is outlined in sections 5.2 until section 5.5. Thirdly, the fusion step is described in section 5.6. Finally, the results are summarized in section 5.7

5.1 Origin & Explanation

The data used for this chapter is App Store data. The application store in question is the iOS App Store by Apple. The data is provided by a cooperation between the University of Amsterdam and AppTweak (http://www.apptweak.com). The tools of AppTweak scraped the top rated applications per genre collecting data such as title, description, thumbnail link, gallery videos, images, price, and reviews. A subselection has been made with regards to the data used, only apps that have the categorization of ’Game’ have been included.

5.1.1 Statistics

The general statistics of relevant variables is displayed in table 5.1.1. Values

Test Train Total mean 4.05 3.97 3.99 std. dev. 0.72 0.74 0.74 skew -1.62 -1.30 -1.36 Kurtosis 3.20 1.84 2.07 #(1.0 - 1.5) 86 293 379 #(1.5 - 2.0) 87 335 422 #(2.0 - 2.5) 118 739 857 #(2.5 - 3.0) 268 1,507 1,775 #(3.0 - 3.5) 717 3,646 4,363 #(3.5 - 4.0) 1,250 5,093 6,343 #(4.0 - 4.5) 2,552 9,477 12,029 #(4.5 - 5.0) 1,828 6,530 8,358 all 6,906 27,620 34,526

(33)

5.1. ORIGIN & EXPLANATION 25

Figure 5.1.1: Train (left) and test (right) rating distribution (spot-on)

The table 5.1.1, combined with the information shown in figure 5.1.1, shows that the average rating is skewed towards the higher ratings, this can be explained from the way AppTweak has scraped the data: only the top rated applications per genre are collected. Furthermore, the mean is fairly similar (no more than 10% diﬀerence) and the standard deviation is also in line with the total set, same with the skew. The kurtosis is diﬀerent, which can be seen in figure 5.1.1. From figure 5.1.1, and using the information contained within 5.1.1 we see that the two subsets are evenly distributed.

The results are displayed in a table format, where when the next-to columns are concerned the values of the ratings within [1.0-1.5] are checked to coincide with values of [1.0-2.0] to allow for a comparison with previous work. As an example if a rating is predicted to be 1.0-1.5 and the actual value is 1.7, the prediction (when using the next-to definition) is considered correct. We now focus our attention to the feature sets.

(34)

5.2. GENRES FEATURE SET 26

5.2 Genres Feature Set

This section describes the results for the Genres feature set. To grant the user some insight in the data we describe the statistics and distributions of genres in subsection 5.2.1. We then focus on the results garnered in subsection 5.2.2.

5.2.1 Statistics

The following table shows the number of apps, and average rating of apps per genre, with the added note that an app can pertain to more than one genre.

Genre Name # Apps Average Rating Genre Title # Apps Average Rating

Action 12,825 3.68 Music 2,840 3.71

Adventure 11,582 3.77 Puzzle 14,239 3.80

Arcade 12,046 3.68 Racing 5,250 3.63

Board 5,617 3.56 Role Playing 9,165 3.80

Card 4,653 3.59 Simulation 10,600 3.53 Casino 4,619 3.63 Sports 4,655 3.41 Dice 2,779 3.58 Strategy 8,593 3.66 Educational 8,492 3.53 Trivia 6,057 3.64 Family 13,702 3.67 Word 4,803 3.86 Kids 0

-Table 5.2.1: Genre information

As can be seen from table 5.2.1, the genre kids has no applications associated with it, it will therefore be dropped. We now apply oversampling on all non-majority classes to achieve the new distribution of the ratings:

Figure 5.2.1: Before (left) and after (right) applying ADASYN

On the balanced data set we apply the methods outlined in the methodology chapter: Support Vector Machines and Neural Network.

(35)

5.2.2 Results

This section will describe the results for Support Vector Machine applied to the applied. We begin by applying a grid-search to find γ and C in a random 10% sample of the balanced data set:

C γ Accuracies Test Train 1 0.01 21 17 1 0.1 34 25 1 0.5 34 30 1 0.9 35 33 1 0.99 34 33 10 0.01 18 20 10 0.1 35 29 10 0.5 34 38 10 0.9 35 43 10 0.99 36 44 100 0.01 33 26 100 0.1 34 34 100 0.5 36 47 100 0.9 36 50 100 0.99 36 51 1000 0.01 36 30 1000 0.1 36 41 1000 0.5 36 51 1000 0.9 37 52 1000 0.99 36 52

Table 5.2.2: Grid search results for SVM (10% sample)

Rating spot-on next-to Test Train Test Train 1.0 - 1.5 23.26 31.76 25.58 46.82 1.5 - 2.0 5.75 24.80 17.24 65.44 2.0 - 2.5 3.39 39.55 8.47 58.57 2.5 - 3.0 0.75 12.51 8.21 46.24 3.0 - 3.5 4.60 10.65 22.45 25.04 3.5 - 4.0 16.80 11.49 61.28 44.34 4.0 - 4.5 41.89 40.20 85.50 85.69 4.5 - 5.0 39.93 34.37 75.55 57.84 total 30.02 25.91 66.02 53.87

Table 5.2.3: Accuracies associated with

opti-mal SVM parameters (10% sample)

C Accuracies Test Train 1 21 20 10 21 20 100 21 20 1000 21 20

Table 5.2.4: Results with a linear kernel

From the grid search in table 5.2.2 we see that with C = 1000, γ = 0.99, and therefore a RBF kernel is needed. From this we ascertain an accuracy of 25.91% on the training set. Figures 5.2.2 and 5.2.3 show the predicted and actual ratings for this sample.

Figure 5.2.2: Prediction values(10% sample) Figure 5.2.3: Actual values (10% sample)

(36)

the full data set. We use the values C = 1000, γ = 0.99 on the full dataset. From this we get the following results and distributions:

Figure 5.2.4: Prediction ratings Figure 5.2.5: Actual values of the test ratings

The following tables show the results for both applying the Neural Network methodology and Support Vector Machines with the acquired parameters.

Rating spot-on next-to Test Train Test Train 1.0 - 1.5 10.47 67.87 11.63 76.52 1.5 - 2.0 0.00 65.96 2.30 81.88 2.0 - 2.5 1.69 58.21 5.08 72.24 2.5 - 3.0 1.87 47.93 7.09 61.89 3.0 - 3.5 3.63 28.52 14.50 42.03 3.5 - 4.0 10.24 19.96 82.08 64.31 4.0 - 4.5 69.91 69.71 92.71 93.55 4.5 - 5.0 23.19 32.91 89.00 74.52 all 34.43 48.77 74.72 71.00

Table 5.2.5: Results for full training set (SVM)

Table 5.2.6: Results for full training set (NN)

We attain a result of 34.43% using the spot-on definition, with 74.72% with the next-to definition when using SVM. Of note here is the very low accuracy for ‘1.5’-‘3.5-4.0’ for the Neural Network, it seems that even with ADASYN sampling the Neural Network prefers the higher valued ratings. Both SVM and Neural Network perform badly on the 1-5-2.0 range, indicating a lack of information contained within the data set with regards to diﬀerentiating a 1.5-2.0 rating. Seeing as most genres have a fairly even distribution with respect to their average rating, a low accuracy is not an unexpected result, when used in isolation.

(37)

5.3. IMAGE FEATURE SET 29

5.3 Image Feature Set

This section describes the intermediate results associated with the Images feature set. Before putting the images through AlexNet the images have to be adjusted. The images from the app scraping process ranged from 1024x1024 to 75x75. To consolidate this, and be able to use AlexNet, the input had to be converted RGB values (as some pictures were stored in black and white) and re-sized to 227x227. The images also have to be normalized. The balanced data set is comparable in distribution compared to figure 5.2.1. Figure 5.3.1 shows a random sampling of images, to illustrate the source material.

Figure 5.3.1: Random selection of images

5.3.1 Results

Below are listed the grid search results for parameters 5.3.1 and the accuracies associated with those optimal SVM parameters (table 5.3.2). C γ Accuracies Test Train 1.0 0.01 22 49 1.0 0.1 30 98 1.0 0.5 18 99 1.0 0.9 18 100 1.0 0.99 18 100 10.0 0.01 23 92 10.0 0.1 30 99 10.0 0.5 18 100 10.0 0.9 18 100 10.0 0.99 18 100 100.0 0.01 23 99 100.0 0.1 30 100 100.0 0.5 18 100 100.0 0.9 18 100 100.0 0.99 18 100 1000.0 0.01 24 100 1000.0 0.1 31 100 1000.0 0.5 18 100 1000.0 0.9 18 100 1000.0 0.99 18 100

Table 5.3.1: Grid search results for SVM (10% sample)

Table 5.3.2: Accuracies associated with

opti-mal SVM parameters (10% sample)

C Accuracies Test Train 1.0 11 48 10.0 10 48 100.0 11 47 1000.0 11 47

Table 5.3.3: Results using a linear kernel

(38)

5.3. IMAGE FEATURE SET 30

associated with the optimal C and γ. As the SVM estimator does not only guess 1 rating we continue. Table 5.3.3 and 5.3.1 illustrate that the best choice within this random sample is the Radial Basis Function as a kernel accompanied with C = 10, γ = 0.1. Using these settings we train the entire data set. We see also that C does not matter a lot.

Table 5.3.4: Results for full training set (SVM)

Table 5.3.5: Results for full training set (NN)

We can see a disparity between the two results, where the neural network is good > 25% in categorizing the higher end of the ratings spectrum, the SVM classifier is biased (in accuracy) towards the lower end of the spectrum with less skew. We also see that the SVM classifier does not have a high accuracy on the higher end of the spectrum, whilst the Neural Network performances more evenly across the board.

(39)

5.4. DESCRIPTION FEATURE SET 31

5.4 Description Feature Set

This section describes the results attained for the description feature set. First the parameters are searched for the LDA model in subsection 5.4.1, followed by a discussion of the results in 5.4.2.

5.4.1 Parameters

We first calculate the TF-IDF values for each individual word after stemming, where after observing the TF-IDF values, a cutoff is chosen. From visual inspection of figure 5.5.1, we set the cutoff to be between [2.5, 6.5]. We now list the difference pre-and after cull in figures 5.5.1, 5.5.2, 5.5.3, and 5.5.4.

Figure 5.4.1: TF-IDF graph

Word TF-IDF Word TF-IDF up 0.925657 thi 0.5981 new 0.909615 it 0.587034 have 0.888618 play 0.510886 by 0.879416 on 0.460175 will 0.872721 is 0.329279 as 0.846381 for 0.290002 be 0.843011 with 0.286958 that 0.818909 in 0.257535 more 0.760925 your 0.224049 from 0.759918 game 0.189414 are 0.736809 you 0.186967 all 0.699321 of 0.174215 or 0.686651 to 0.0905064 featur 0.636971 and 0.0740746

Table 5.4.1: Lowest TF-IDF words pre-cull (description)

Word TF-IDF Word TF-IDF

behindthescen 10.2263 kidz 10.2263 brog 10.2263 pogu 10.2263 superpong 10.2263 jonesin 10.2263 spiﬀywar 10.2263 chubukov 10.2263 waﬄeturtl 10.2263 jirbo 10.2263 sorel 10.2263 pixio 10.2263 doublet 10.2263 ijezzbal 10.2263 provision 10.2263 stephenflem 10.2263 easthaven 10.2263 galley 10.2263 fanni 10.2263 glovercom 10.2263 agnew 10.2263 aki 10.2263 askew 10.2263 xerc 10.2263 saratoga 10.2263 cupertino 10.2263 tournement 10.2263 sextupl 10.2263

Forecasting the Popularity of Applications : an analysis of textual and graphical properties