Forecasting the success of Apps based on their visual appearance

(1)

M

ASTER

’

S

T

HESIS

Forecasting the Success of Apps Based on

their Visual Appearance

Author:

Sebastian Wieser, MSc

Supervisor: Prof. Dr. M. (Marcel) Worring

Second reader: Dr. N.P.A. (Noud) van Giersbergen

A thesis submitted in fulfillment of the requirements for the degree of Master of Science

at the

Faculty of Economics and Business Section Econometrics & Statistics

(2)

(3)

A

BSTRACT

D

espite the vast proliferation of apps over the last years, the existing related literature on

the analysis of apps is scarce. With this thesis we contribute to this literature. In particular, we study the question, to what extent can we predict the success of an app based on its visual appearance. For this purpose, we make extensive use of machine learning techniques. In particular, we adopt ideas from the related field of predicting popularity of social media content. We hypothesize that an app’s rating is influenced by certain parameters. Therefore, we investigate the influence of color, visual sentiment, 15k concepts and a parameter that indicates whether text is present within an icon. By conducting experiments on data from the Google Play store and the iTunes App store we reveal that the visual appearance has predictive power on the success of an app. Also, we find that the implemented parameters complement each other. In addition, as we make use of explicit parameters, we are able to reveal the factors within an icon that are associated with success.

(4)

(5)

D

EDICATION AND ACKNOWLEDGEMENTS

I

would like to express my deepest gratitude and appreciation to Prof. Dr. Marcel Worring

for his kind supervision and his valuable inputs. During the process of writing this thesis, I recognized the appeal of this research area. Furthermore, I want to acknowledge my deepest thanks to Dr. Masoud Mazloom for his selfless help. Most parts of this thesis would not have been possible without his kind support.

(6)

T

ABLE OF

C

ONTENTS Page 1 Introduction 1 2 Related Works 3 2.1 Analysis of Apps . . . 3 2.2 Predicting Popularity . . . 4

2.3 Image Classification using Deep-Learning . . . 5

3 The proposed method 7 3.1 Apps . . . 8

3.2 Extracting the features . . . 8

3.2.1 Color . . . 9

3.2.2 Low Level Features . . . 10

3.2.3 Visual Sentiment . . . 11

3.2.4 15k Concepts . . . 13

3.2.5 Presence of Numbers and/or Text . . . 13

3.3 SMOTE - Synthetic Minority Over-sampling TEchnique . . . 14

3.4 Predicting Success . . . 15

4 Data 17 5 Results 19 5.1 Color Analysis . . . 19

5.1.1 HSV and Intensity Analysis . . . 19

5.1.2 The 50 Distinct Color Approach . . . 20

5.2 Visual Sentiment Analysis . . . 22

5.3 15k Concepts . . . 24

5.4 Text or Number within an Icon . . . 25

5.5 Prediction . . . 26

6 Conclusion 33

Bibliography 35

(7)

C

H A P T E R

1

I

NTRODUCTION

H

and in hand with the emergence of smart-phones, a vast proliferation of apps occurred

over the last years. Nowadays, the Google Play Store is the leading provider of apps

and offers approximately 2,2 million distinct apps1. Overall, the leading app stores

combined, provide approximately 5.7 million apps. Today, apps constitute a well established software repository with billions of users all over the world. Some typical functions of apps are entertainment, organization of daily life or simply pastime. Therefore, the usage of apps became part of many people’s daily life.

Especially for companies, apps have become an invaluable tool over the last years. This is due to the fact that apps facilitate several processes like providing information for customers, gather information about the customer and probably most important, apps help to improve the firm-customer relationship. Not to forget, apps provide a new platform for advertisements. All these facts highlight the role of apps in today’s society.

Also, app stores, like the Google Play Store or the iTunes App store, provide an interesting source of information. As well as some technical details like the version or the size, also information provided by the users is available. More precisely, as a customer it is possible to write a review about the app and to rate it. This allows subsequent potential customers to retrieve useful information and to some extent determines the total number of downloads of an app. This makes app stores a unique source of information and furthermore a quite interesting basis for research. Surprisingly, albeit the importance of apps and the fact that app-stores constitute an excellent source of information, the existing related literature is quite scarce. The work of Harman et al. [9] as one remarkable example focuses on rather basic statistics concerning apps.

This thesis contributes to this literature by investigating the effect of the visual appearance of

(8)

CHAPTER 1. INTRODUCTION

an app (in short, the icon) on its rating. Noteworthy, despite the extensive information available from app stores, we limit our study to the effect of the app’s visual appearance on its success. More precisely, we aim at predicting the rating (which serves as proxy for success) of an app exclusively based on the visual appearance of the icon. Also, no previous work was dedicated to the task of determining the factors that make an app successful. A fortiori, the relevant features concerning the icon of an app are yet undiscovered.

No previous work aimed at predicting success of apps solely based on their icons. Therefore, the applied approach in this thesis is motivated by works in related research areas. In particular, the four works [16], [4], [21] and [20] seek to determine the features that make an image popular and therefore are related to the task of this thesis. More precisely, they investigate content on social media sites. Despite the fact that the aim of these works is quite similar, their approaches differ substantially in many ways.

In this thesis we combine several methodologies proposed in these works. In particular, we conduct a color-analysis inspired by [16]. Hereby, we investigate, besides the standard HSV and intensity measures, also the importance of particular colors. Moreover, the 15k concepts and visual sentiment approach is employed as proposed by Mazloom et al. [20]. Additionally, in order to make use of low level features, a deep neural network similar to the one proposed by [30] was employed. Moreover, we hypothesize that the presence of text or numbers within an icon plays a non-negligible role and therefore also incorporate this feature into our analysis. This is a quite unique approach, since previous works focused on images on social media platforms and those usually do not include text or numbers within the image.

Obviously, the bulk of utilized parameters is explicit and therefore has an additional advantage. More precisely, those features allow to reveal which factors are important for the icon of an app. Thus, this approach allows to highlight those features that are related with successful apps, but also those that are related to unsuccessful apps.

Additionally, we investigate the predictive power of the features separately and combined. This approach has the further advantage that it reveals whether the parameters complement each other.

The remainder of this thesis is structured as follows. Chapter 2 gives a summary of the most important works related to this thesis. In chapter 3 we describe in detail the applied methodology and chapter 4 describes the employed data set. Chapter 5 then reveals the results. Thereby, both the statistics concerning the parameters as well as their predictive performance is presented. Finally, chapter 6 concludes.

(9)

C

H A P T E R

2

R

ELATED

W

ORKS

T

he works related to this thesis can roughly be separated into three major research areas.

First, the analysis of app stores as described in section 2.1. Second, section 2.2 discusses the prediction of success of online content, in particular images on social media platforms

like Flickr1. Finally, section 2.3 roughly summarizes the most recent and important contributions

in the field of image classification.

2.1 Analysis of Apps

Mining of software repositories (MSR) is a well established field in research (cf. [11] or [33]). However, the mining of app stores has not received much attention from the research community yet, even though app stores provide an excellent source of information. In particular, it is possible to receive information about the costumer as well as technical details of various apps. This is discussed in some detail in the work of Harman et al. [9], which is one of the few contributions in the field of app (app-store) analysis.

In their work, the authors scrape information of more than 32,000 apps from the blackberry app store. All of the investigated apps have a non-zero price. This approach made it possible to analyze the relationship between the price, the rating and the number of downloads of apps. The authors find the surprising result that there is neither a significant relation between price and downloads, nor between price and rating. However, they find a strong relationship between the rating and the number of downloads of apps.

(10)

CHAPTER 2. RELATED WORKS

2.2 Predicting Popularity

As the aim of this thesis is yet unique, we orientate ourselves on methodologies from related research areas. In particular, popularity prediction of content on social media platforms received a lot of attention by the research community. This is due to the fact that over the last few years a tremendous proliferation of online content occurred. This in return was a result from the steadily increasing popularity of social networks. Every minute, thousands of pictures, videos or other content is uploaded and shared. This emergence inspired the research community to investigate the semantics of success of online content.

In particular, textual based analysis received a lot of attention from the research community, with contributions such as [1], [14] and [29]. Whereas the former two works focus on detection of factors that drive popularity of tweets on Twitter, the latter work is concerned with the popularity of videos on YouTube.

Remarkable contributions that also incorporate the visual appearance of online content are the works by Khosla et al. [16], Capallo et al. [4], McParlane et al. [21] and Mazloom et al. [20]. All of these works aim at detecting the factors affecting popularity of online content. More precisely, they investigate images from social media platforms like Flickr.

By making use of image cues, such as color, gist, texture, gradient and deep learning features, the authors in [16] predict the popularity of images with astonishing accuracy. The authors use images from Flickr and investigate two particular components of success, namely image content and social context. By combining these two components, the authors are able to predict the

number of views with a Spearman’s ranking coefficient (cf. [28]) ofρ= 0.81.

A different approach was used in [21]. In their paper McParlane et al. focus on popularity predic-tion of images in a cold start scenario. This means that there is no or limited textual informapredic-tion given. In order to translate the prediction problem into a binary classification problem, the authors divide the data into popular and unpopular content. By making use of image, user and tags related information, the authors achieve an accuracy of 76%.

Yet another approach was proposed by [ 4]. Whereas the most common approach in the existing literature is to combine textual and image content features, Capallo et al. explicitly focus on the image features. In particular, the authors aim at extracting both features that make an image popular as well as those that make an image unpopular. For instance, they found that images containing landscapes are rather prone to become unpopular, whereas images containing faces are likely to receive more attention. Furthermore the authors find a Spearman’s ranking

coefficientρ= 0.345, for the visual sentiment ontology [3] dataset. In comparison, Khosla et al.

[16] find a ranking coefficient of 0.315, considering only their image analysis.

Finally, the work of Mazloom et al. [20] focuses on prediction in the context of brand-related social media content. Hereby, the authors use, in addition to low level features, explicit engage-ment parameters. More precisely, the authors make use of cues related to factual information, sentiment, vividness and entertainment parameters. With their approach, the authors achieve

(11)

2.3. IMAGE CLASSIFICATION USING DEEP-LEARNING

superior prediction accuracy compared to approaches that are based on direct textual and visual features. Additionally, their approach enables the authors to reveal the factors that drive popu-larity within a brand-related post.

2.3 Image Classification using Deep-Learning

Due to the fact that this thesis makes use of deep nets for feature extraction, in the following we also review the most important contributions in that area.

In the year 2009, the ImageNet data set was introduced, which consists of approximately 50 million cleanly sorted images, cf. [6]. This initiated a tremendous progress in the fields of picture classification and object recognition. More precisely, based on this data set (see [25]), the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched. Within this competition, algorithms concerning both image classification and object recognition are evaluated. The two relevant measures are the top-5 and top-1 test error rates. The former is defined as the fraction of classifications, where the correct label is not part of the top 5 labels predicted by the model. It was in 2010 when the first ILSVRC took place, and since then the research area concerned with image classification and object recognition flourished. Within the six years since the competition was initiated, steadily more sophisticated methodologies have been developed and employed. This results in continuously declining classification error rates.

Among these methodologies is the pioneering work by Krizhevsky et al. [17]. The authors use a deep convolutional neural network, whereas their overall methodology contains several noteworthy details: In order to decrease computation time, the authors use rectified linear units (ReLU as denoted in [23]), i.e. f (x) = max{0, x} as a neuron’s output and additionally use a highly optimized GPU implementation of 2D convolutions. In order to reduce overfitting, the authors first enlarge the dataset. This is achieved by (label-preserving) image transformations, as proposed by [26]. As a second measure to reduce overfitting, the authors include dropout layers, cf. [13]. The overall architecture of their network consists of five convolutional layers and three fully connected dense-layers. Their methodology won the ILSVRC in 2012 with a top-5 test error rate of 15.3%.

Another contribution to the field of image classification and object recognition that emerged within the context of the ILSVRC, was the work by Szegedy et al. [30], which won the competition in 2014. Their methodology consists of a more complex architecture (known as GoogleNet) compared to [17], resulting in a top-5 error of 6.67%. This percentage highlights the progress that has been accomplished over the years. Within two years the top-5 error rate has halved (from 15.3% to 6.67%).

(12)

(13)

C

H A P T E R

3

T

HE PROPOSED METHOD

T

his chapter describes our approach in predicting the success of apps by combining several

methodologies. First, we provide a basic description of apps in section 3.1. Then in section 3.2 we describe those parameters that might be important in this context and how to obtain them. In section 3.3, we explain how to deal with unbalanced data. Section 3.4 finally explains how to use the parameters to predict the success of apps.

Apps _Others Name Price Version Reviews ... Selection Icon Feature extraction Color Low Level Visual Sentiment 15k Concepts Text/Number Balancing SMOTE Prediction SVM Rating

Figure 3.1: Prediction of the rating of an app in steps. First, select the considered parts of the apps. Second, extract the features. Third, re-balance the data set. Fourth, predict the rating of an app based on the extracted features using multivariate SVM. Note, the actions are presented with a red background.

(14)

CHAPTER 3. THE PROPOSED METHOD

3.1 Apps

Nowadays, due to the fact that apps are widely spread, the ideas of what defines an app may be a little vague. That is why we aim, for the sake of clarity, to provide a proper definition.

An app, like the Facebook App or WhatsApp, to name famous examples, consist of several components. Noteworthy, some information is provided by the developers and some by the customers. First, every app has a certain name that allows to distinguish between apps. Second, most apps have an icon, by which most developers try to create a certain appeal towards the app. Furthermore, an app, or to be more specific an app-website includes reviews provided by customers. In addition to these reviews, customers also rate the apps. Those ratings range from one to five stars, where five (one) stars indicate that customers are highly (not at all) satisfied with the service provided by the app. Moreover, the developers provide a description of the app, the version, the size in bytes and also, the price of the app.

Summarizing, we define an app in the following way:

Definition 1. An app is the tuple {name, technical content, icon, prize, size, version, description,

ratings, reviews}, where the former five components are provided by the developer and the latter two are provided by customers.

Since this thesis aims at detecting the role of visual appearance of apps, the components of interest are the icon of the app as well as the corresponding rating. To illustrate some icons of apps, figure 3.2 illustrates 16 examples. In particular, the apps are ordered in such a way that the first (top left) has a rating of five stars and the last (bottom right) a rating of 1 star.

Figure 3.2: Examples for icons of apps, ranging from 5-stars to 1-star

3.2 Extracting the features

The main aim is to develop a model that predicts the success of apps based on their icons (see figure 3.1). Therefore, we seek to incorporate all the parameters that might play a role in this context. As indicated earlier, the analysis of apps is a relatively undiscovered research area.

(15)

3.2. EXTRACTING THE FEATURES

Therefore, the approaches employed in this thesis are to a huge part motivated by works that seek to identify the factors that drive popularity of online content.

In particular, we are inspired by the works of [16], [21], [4] and [20]. All of these work aim at predicting popularity of online content on social media platforms. Interestingly, all these works propose approaches that are quite different from each other. These works are described in some detail in section 2.2.

3.2.1 Color

Motivated by the work of Khosla et al. [16], the first investigated parameter is color. More precisely, the first investigated color measures are hue, saturation and value. Approximations for these measures are obtained as follows.

First, we normalize the RGB-color-space such that R0_{, G}0_{, B}0_{∈ [0,1], i.e. R}0₌ R

255, G0= G 255, B0=

B 255.

Then the hue (H) is computed by:

H :=                  0 _{,if M A X = M I N} 60◦_{M A X −M I N}G0−B0 _{,if M A X = R} 60◦_{(2 +}_{M A X −M I N}B0−R0 ) _{,if M A X = G} 60◦_{(4 +} R0_−G0 M A X −M I N ,if M A X = B (3.1)

, whereas M A X = max{R,G, B} and M I N = min{R,G, B}. The saturation (S) is computed by, S :=    0 _{,if M A X = 0} M A X −M I N M A X ,else (3.2)

Finally the value (V ) is simply computed by,

(3.3) _{V := M AX}

Additional to these parameters we also investigate the role of color intensity of the icons. To receive the intensity measure we first transform the images into gray-scale. This is achieved by taking the mean over the R,G,B values of each pixel. This allows to represent the images as length×width dimensional (intensity) vectors. These vectors then lead to the intensity mean, intensity variance and intensity skewness.

Additional to the HSV and intensity measures we seek to identify the importance of different colors. This means, as a first step we investigate which colors are related to high rated apps and which are related to low rated apps. As a second step we further analyze whether the variegation has a certain effect on ratings.

Therefore, we divide the entire RGB color space, which consists of 256×256×256 ≈ 1.67*107

(16)

al. [15], i.e. to separate the color-space meaningfully, a universal color discriminator with 50 colors was employed. Those 50 universal colors were learned by application of their proposed algorithm on several data sets, with the aim to incorporate a broad scope of images. In particular, the authors trained their model on the Flower102, Bird200 and PASCAL 2007 data sets. The

respective file is publicly available1.

As a next step, we analyze the effect of those 50 colors on the average rating of apps. Therefore, every sample icon is transformed in such a way, that every pixel of the respective image is assigned to one of the 50 color clusters.

This procedure allows to present every image in the data set as a 50 dimensional feature vector, which is necessary for further proceedings that aim at learning the importance of colors. More precisely, the entries in the vector indicate the number of pixels assigned to each particular color. i.e. the vector [90.000,0,0,...,0] would represent an image that is entirely black.

Next, we perform a L2-normalization. This transformation is necessary to conduct a support vector regression (SVR, cf. [27]), by which the importance of the 50 distinct colors is learned. Finally, in the context of color analysis, we investigate the role of variegation. Therefore, we simply count the number of different colors present in an icon. Hereby we again refer to the 50-color representation of the icon instead of the original image. This is due to the fact that an entirely blue image could consist of hundreds of different color values in the original RGB-color-space and therefore would be wrongly classified as relatively colorful. By making use of the 50 universal color representations this problem does not occur.

3.2.2 Low Level Features

Additional to color features, we make use of deep convolutional neural networks (cf. [ 18],[2]) in order to receive low level visual features. Convolutional networks or more precisely convolutional layers are widely applied in the analysis of images using deep learning. This results from the fact that these layers take into account the spatial structure of the data. In particular, convolutional layers incorporate the fact that pixels that lie close to each other are stronger correlated than pixels that are far apart. This is achieved by the following three steps, which are illustrated in figure 3.3. First, a convolutional layer is organized in feature maps, which are planes of units, whereas the input for these planes is a small region from the input image. Second, to achieve sparsity, the units within a feature map share the same weight values. Third, a convolutional layer is followed by a subsampling layer. Again the units of this layer are organized in planes, whereas each unit takes as its inputs small fields from the corresponding feature maps. These units then conduct the subsampling.

For this thesis we trained a network that makes extensive use of convolutional layers. In particular, our network is almost identical to the architecture proposed by Szegedy et al. [30]. Their architecture consists of 22 layers (27 layers, if the pooling layers are also counted) as

1_{http://www.cat.uab.cat/~joost/ColorDescriptors.html}

(17)

Figure 3.3: A simple illustration of convolutional neural networks. The figure was taken from [2]. Note, for many applications, several pairs and/or variations can be used.

illustrated in figure 3.5.

A remarkable property of this architecture is the incorporation of nine inception modules. An example for the inception architecture is illustrated in figure 3.4. The fundamental idea is to consider how readily available dense components can approximate an optimal local sparse structure of a convolutional network. By making use of these techniques, Szegedy et al. set a milestone in image classification and object recognition.

Due to this remarkable performance, we incorporate this neural network in this thesis. In particular, to get a feature vector, the final fully connected layer (CNN-Pool5) of the network is utilized. This results in a 1024-dimensional low level feature vector.

Figure 3.4: Inception module with dimensionality reduction. Figure taken from [30].

3.2.3 Visual Sentiment

To reveal the sentiment within an icon, we use the visual sentiment ontology as proposed by Borth et al. [3]. In their work, the authors make use of Plutchik’s Wheel of Emotion [24] to

(18)

Figure 3.5: GoogleNet Architecture, figure taken from [30]. .

(19)

derive search keywords. Using these, the authors gathered data from YouTube and Flickr. For instance, images with the tags ‘joy’ or ‘beautiful’ were retrieved. As a next step, these tags were investigated in order to assign sentiment values and to specify nouns, verbs and adjectives. From this set, the authors then created Adjective Noun Pairs (ANP). Examples are ‘colorful butterfly’, ‘misty night’ or ‘crying baby’.

Then they train individual detectors using images with the ANPs as tags on Flickr. From this procedure, the SentiBank library was created, whereas only those ANPs were included for which the detection showed a sufficiently good performance. All in all this resulted in a library including 1,200 ANPs. In their paper, Borth et al. demonstrate the capability of this library to capture the sentiment within an image.

At this point it is important to highlight that this library results from social media content analysis and is not specified to detect the sentiment within an icon. Nevertheless we hypothesize that the trained classifiers (using SentiBank) also achieve reasonable performance in predicting the sentiment within an icon.

Besides our hypothesis that the revealed sentiment within an icon improves prediction accuracy of success, the approach has another advantage. In particular, the approach allows to represent each icon as a 1,200 dimensional vector, where every entry represents the probability of the respective adjective noun pair being present in the icon. This in return allows to investigate which sentiments are related to a high rating and which to a low rating. To do so, we train a support vector regression.

3.2.4 15k Concepts

Within the ImageNet [6] data set, described in section 2.3, there are 15,293 concept categories present, for which at least 200 images are available. To incorporate these concepts, we extract the convolutional neural network features, such that each icon can be represented as a 15,293 dimensional output of the softmax-layer of the network. With other words, this 15k concept methodology enables us to represent each icon as a vector, whereas every entry indicates the probability that the respective concept is present within the icon. Examples for this 15k concepts are ‘goldfish’, ‘water snake’ or ‘electric guitar’.

Also this methodology has the advantage that it allows to detect those concepts that are related to high ratings as well as those that are associated with low ratings.

3.2.5 Presence of Numbers and/or Text

The above described parameters have been employed in various previous works and have been proven to work well. In particular, the work of Khosla et al. [16] shows the importance of incorporating color parameters. The work of Mazloom et al. [20] illustrates the benefits of employing low level features, 15k concepts and visual sentiment.

(20)

already indicates this difference. In particular, by creating icons, it is apparently quite common to make use of either text or numbers within the image. The presence of text or numbers within an image, which determines our final parameter, has not been investigated in earlier works. This results from the fact that ordinary images, uploaded on Flickr for instance do not, or at most hardly ever, include text within the images. Hence, for the purpose of this thesis it is interesting to investigate a binary indicator that states whether a text or a number is visible within the icon.

3.3 SMOTE - Synthetic Minority Over-sampling TEchnique

In various prediction tasks, the data at hand is unbalanced. This problem can either occur due to deficient data gathering or due to the fact that the data distribution is indeed skewed, which means that there exist more observations for certain categories. Whereas the latter case is less problematic, the former results in biased predictions. This occurs as the trained model falsely incorporates a skewed category distribution. This issue gives rise to the need of approaches that deal with unbalanced data.

Over the last years the research community devoted a lot of attention to the problem of unbalanced data. Various approaches have been developed for this task. In general there are two classes of approaches. The first idea is to under-sample the majority-class. Popular examples are ‘Extraction of majority-minority Tomek links’ [31], ‘NearMiss-(1 & 2 & 3)’ [19] and ‘Condensend Nearest Neighbour’ [10]. Table 4.1 shows that there are only 102 examples of 1-star apps. Therefore under-sampling is not ideal for re-balancing our data set.

The second approach to re-balance a data set is to over-sample the minority class. Popular examples for this idea are ‘SMOTE - Synthetic Minority Over-sampling Technique’ [5], ‘bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2’ [8] and ‘ADASYN - Adaptive synthetic sampling approach for imbalanced learning’ [12]. Whereas each of these approaches has its advantages, the SMOTE procedure outperforms the other approaches in terms of computation speed. Due to the fact that we deal with relatively large feature vectors in this thesis, this method is applied. The corresponding algorithm works as illustrated in Algorithm 1. Noteworthy, Algorithm 1 is a simplified version of the algorithm illustrated in [5]. In particular, it is reduced to what is relevant for this thesis.

The idea of the algorithm can be summarized as follows. Pick for every sample i in the minority the k-nearest neighbors. Then, among those, (depending on the size of synthetic data that is needed to balance the data set) neighbors are randomly chosen. Finally, the synthetic data is generated as random points on the line segments between the picked sample i and the randomly chosen neighbors.

The following illustrates how the code works on a simple example:

Consider a sample (2,5) and its randomly chosen neighbor (4,8) which is in the set of the k-nearest neighbors. Therefore the ga p is (4,8) - (2,5) = (2,3). A synthetic data point f0_{= ( f}₁0, f₂0) is then

(21)

3.4. PREDICTING SUCCESS

Algorithm 1 SMOTE(T,N,k)

Input: Number of minority class samples T; Amount of synthetic data as % of T; Number of

nearest neighbors k

Output: (N/100)*T synthetic minority class samples

1: procedure SMOTE(T,N,k)

2: _{N = (int)(N/100)}

3: _{k = number of nearest neighbors}

4: _{numattrs = Number of attributes}

5: Sam pl e[][] : array for original minority class samples

6: newind ex : counts numbers of synthetic samples generalted, initilaized to 0

7: S ynthetic[][] : array for synthetic samples

(* Compute k nearest neighbors for each minority class sample only *)

8: for i ← 1, T do

9: Compute k nearest neighbors for i and save the indices in nnarray

10: Populate(N,i,nnarray)

11: end for

12: return

13: end procedure

14: procedure POPULATE(N,i,nnarra y)

15: _{while N 6= 0 do}

16: _{Choose a random number nn ∈ {1,k}. This selects one of the k nearest neighbors of i}

17: for attr ← 1, numattrs do

18: _{Compute: d i f = Sample[nnarray[nn]][attr] − Sample[i][attr]}

19: Compute: ga p = random number between 0 and 1

20: _{S ynthetic[newind ex][attr] = Sample[i][attr] + gap ∗ di f f}

21: end for 22: _{newind ex + +} 23: _{N = N − 1} 24: end while 25: return 26: end procedure

generated by f0_{= (2, 5) +}λ_{∗ (2, 3), whereas}λis a randomly chosen value between 0 and 1.

3.4 Predicting Success

In the previous section we explained which icon-related parameters might be relevant for the success of apps. Now, the task is to use these parameters for prediction. In particular, the task is to investigate the predictive power of the combined parameters, as well as to figure out the contribution of each parameter separately.

In earlier works, a quite common approach was to treat the popularity prediction of posts as a ranking problem (cf. [7], [16] and [20]). This allows to report the performance of the prediction based on Silverman’s rank correlation. However, due to the fact that for this thesis the measure

(22)

of success is the average rating, this approach is not suitable. Alternatively, we report the performance of the prediction in terms of the percentage of correctly classified apps. Hereby, the ratings 1-5 define the classes.

For the purpose of predicting the success of apps we use multivariate support vector machines. Therefore, a model was trained on a train set, consisting of 70% of the data. The entire approach that leads to the predictions is summarized and illustrated in figure 3.1.

(23)

C

H A P T E R

4

D

ATA

S

imilar to the work of Harman et al.[ 9], the investigated data originates from app-stores.

In contrast to their work however, the data used in this thesis is not limited to one specific

app-store. The company IQU1provided an API-scraper that allowed to retrieve data from

the Google Play Store as well as the Itunes App store. As a result, this thesis makes use of a data base consisting of 19.988 apps in total.

More precisely, the provided scraper allows to get information concerning certain technical details like developer ID, app-version, app-URL and icon URL. Additionally, one can retrieve customer specific information like reviews, average rating etc. This setup apparently provides an ideal source of information for the aim of this thesis.

This work, as indicated in the title and pointed out in section 3.4, is exclusively concerned with the visual appearance of the investigated apps. Therefore, the analysis in this thesis refers solely on the effect of the icons of the apps on the apps’ success. This highlights the contribution of this thesis as no previous work has been dedicated to that research question.

A remaining issue is how to measure success. One possible approach would be to use popularity as a measure of success. In particular, the bulk of related works (cf. [21], [4] or [16]) use popularity as their dependent variable. This approach, however, is only partially accessible for this thesis. This is due to the fact that first, the amount of views for each app is not available and the number

of downloads is only given in a certain range, e.g. #downloads ∈ [107, 5∗107]. Hence, the employed

proxy for success in this thesis is the average rating of an app. Due to the fact that [9] found a strong significant relationship between the average ratings and the number of downloads, one measure serves as a proxy for the other.

Independent of the app-store from which the information was retrieved, all the icons are

(24)

CHAPTER 4. DATA

Table 4.1: Summary statistics of the ratings of the apps

mean std.dev skew _#(∼5) _#(∼4) _#(∼3) _#(∼2) _#(∼1) #(0) #apps

ratings 3.82 1.21 -2.22 4,011 12,688 1,536 223 102 1,428 19,988

images, with a total size of 300 × 300. This feature is quite convenient as it allows to work with the data without the necessity of re-scaling them beforehand.

In various previous works (cf. [17] or [30]), a substantial part of the analysis was dedicated to image pre-processing. However, due to the fact that in this thesis the aim is to predict success based on the icon (i.e. the appearance of the app), image transformations do not add any value. The ratings of the investigated apps range between one and five. Summary statistics concerning the ratings of the apps are given in table 4.1.

This table shows that the employed data is quite left-skewed and the majority of the apps received a rating of approximately 4 stars, which is indicated by #(∼4). In total, there are 1,428 apps that did not receive a rating (indicated by #(0)) by the time the data was scraped. Due to the lack of information concerning why these apps did not receive a rating we exclude them in the remainder of this thesis.

(25)

C

H A P T E R

5

R

ESULTS

I

n this chapter we describe the results of the proposed methods described in chapter 3 on the

data set described in chapter 4. First, the role of color is discussed in section 5.1. Then, in section 5.2 we describe the results concerning the visual sentiment analysis and in section 5.3 the results of the 15k concept analysis. In section 5.4 we present the results of our proposed parameter, which indicates whether text/numbers are included in the icon. Finally, section 5.5 discusses the prediction performance of the parameters.

5.1 Color Analysis

5.1.1 HSV and Intensity Analysis

In this section, the methods described in section 3.2.1 are applied on the data set. In a first step we seek to analyze whether there is a specific effect of the color parameters on the ratings. Therefore, the ratings have to be interpreted as an ordinal measure, which is legitimate as we excluded the zero-rated apps.

Based on this subset, the Hue, the Saturation, the Value (HSV) and the color intensity measures have been obtained for each image. In order to understand the relationship between these color features and the ratings, we conduct a simple OLS-regression. The output is presented in table 5.1. Noteworthy, the (adjusted) R-squared equals 0.003. This means that the 6 employed regressors are not really suitable for explaining the rating of an app. To illustrate the effects of the color features on the rating, figure 5.1 plots the HSV components against the ratings and figure 5.2 the color intensity measures against the ratings.

Table 5.1 shows that there is no significant effect of color hue on the rating. Furthermore, neither the saturation nor the value affect the rating significantly. Therefore, these features will be

(26)

CHAPTER 5. RESULTS

Table 5.1: OLS Regression Results. Dependent variable is ratings and the regressors are the color features.

coef std err t P>|t| [95.0% Conf. Int.] Intercept 3.957 0.035 114.267 0.000 3.890 4.025 Hue 0.000 0.000 1.032 0.302 0.000 0.000 Satur 0.087 0.052 1.685 0.092 -0.014 0.189 Value 0.239 0.145 1.654 0.098 -0.044 0.524 Intensmean -0.000 0.001 -0.365 0.715 -0.002 0.001 Intensvar 0.000 0.000 1.956 0.051 0.000 0.000 Intensskew 0.026 0.008 3.148 0.002 0.010 0.043

Dep. Variable: Ratings R-squared: 0.003

Model: OLS Adj. R-squared: 0.003

Method: Least Squares F-statistic: 9.486

100 50 0 50 100 150 200 250 300 350 mean Hue 0 1 2 3 4 5 6 rating coef = 0.000 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 mean saturation 0 1 2 3 4 5 6 rating coef = 0.087 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 mean value 0 1 2 3 4 5 6 rating coef = 0.239

Figure 5.1: HSV analysis and correlation of the components (Hue, Saturation, Value) with ratings

neglected for prediction based on the icon of an app.

Moreover, table 5.1 shows that the color intensity mean has also no significant effect on the rating. Surprisingly, however, both the intensity variance and the intensity skewness show significant results. These results are illustrated in figure 5.2. Due to the fact that those features which show significance are quite close to zero, also the intensity features will be neglected for prediction.

5.1.2 The 50 Distinct Color Approach

To investigate the role of different colors, we conducted the approach as described in section 3.2.1. Therefore, every icon was transformed into a 50 color representation as illustrated in figure 5.3. Based on the 50-colors feature vectors we then conducted a support vector regression. The result is illustrated in figure 5.4.

A quite interesting and to the same extent surprising pattern is recognizable. It seems as if the reddish colors have a negative effect on ratings, whereas the blueish and greenish colors have a positive effect. This means that colors with a high hue (blue and green) appear to be more

(27)

5.1. COLOR ANALYSIS 50 0 50 100 150 200 250 300 mean Intensity 0 1 2 3 4 5 6 rating coef = -0.000 5000 0 5000 10000 15000 20000 Intensity variance 0 1 2 3 4 5 6 rating coef = 0.000 20 15 10 5 0 5 10 15 20 Intensity skewness 0 1 2 3 4 5 6 rating coef = 0.026

Figure 5.2: Intensity analysis and correlation of the components (mean, variance and skewness) with ratings

Figure 5.3: The left image shows the original icon and the right the respective 50 color represen-tation.

appealing to potential customers as colors with a low hue (red).

Noteworthy, in the work of [16], the authors investigated the factors that cause popularity of images. In their analysis they found an inverse relationship between colors and popularity, compared to the analysis in this thesis. In particular, they found that it is rather the reddish images that become popular, whereas blueish and greenish images are rather prone to reside in oblivion.

In order to argue from a psychological point of view, Valdez & Mehrabian [32] provide some valuable insights. In their study, they investigate the effects of color on emotion. More pre-cisely they use the Mehrabian’s [22] pleasure-displeasure, arousal-nonarousal and dominance-submissiveness scale to evaluate the emotional reactions to color.

By making experiments, including 250 undergraduates, the authors find that the green-yellow, blue-green and green were the most arousing, whereas purple-blue and yellow-red turn out to be

(28)

CHAPTER 5. RESULTS 0 1 2 3 4 5 6 7 8 9 ₁₀ ₁₁ ₁₂ ₁₃ ₁₄ ₁₅ ₁₆ ₁₇ ₁₈ ₁₉ ₂₀ ₂₁ ₂₂ ₂₃ ₂₄ ₂₅ ₂₆ ₂₇ ₂₈ ₂₉ ₃₀ ₃₁ ₃₂ ₃₃ ₃₄ ₃₅ ₃₆ ₃₇ ₃₈ ₃₉ ₄₀ ₄₁ ₄₂ ₄₃ ₄₄ ₄₅ ₄₆ ₄₇ ₄₈ ₄₉

colors

0.04 0.03 0.02 0.01 0.00 0.01 0.02

color Importance

Figure 5.4: Importance of the 50 distinct colors, ordered from low to high.

the least arousing.

Compared to figure 5.4, there is a certain resemblance recognizable. However, to investigate the psychological background concerning the effects between colors and ratings lies beyond the scope of this thesis and is a remark for further research.

As a final point related to color analysis we investigated the role of variegation. To determine the effect of variegation on the rating of an app we conducted a simple OLS regression. The resulting

coefficient equals 5 ∗ 10−4_{and is not significant.}

5.2 Visual Sentiment Analysis

In this section we present the results of the visual sentiment approach as described in section 3.2.3. This means that every icon in the data set is represented as a 1,200 dimensional feature vector. For instance, the adjective noun pairs with the highest probabilities for the icon illustrated in figure 5.5(a) are ’classic cars’, ’powerful cars’, ’hot car’, ’famous car’ and ’super cars’. This adjective noun pairs appear to be quite accurate. On the other hand, the icon illustrated in figure 5.5(b) fails to be correctly classified. For this image the ANPs with the highest probabilities are ’favorite team’, ’elegant rose’, ’safe driver’, ’super food’ and ’hot cup’.

It is important to highlight at this stage, that the SentiBank library originates from the task of image classification. In particular, the library was not designed to identify the sentiment within an icon, which explains why the assigned ANPs are not accurate for some icons.

After classifying each icon within the data set, we conduct a support vector regression. This is done with the aim to reveal which sentiments are related to high ratings and which are related to low ratings.

The results are represented in table 5.3. The left column lists those adjective noun pairs that are 22

(29)

5.2. VISUAL SENTIMENT ANALYSIS

Figure 5.5: Two illustrations where the ANP classification works accurately (a) and not accurately (b)

associated with a high ranking, whereas the terms in the right column are related to low ratings. In particular, hot cup is the adjective noun pair that is in this context the ‘best’ adjective noun pair. On the other hand, lovely church is the ‘worst’ ANP concerning the apps rating.

Noteworthy, there are some ANPs present in the rows, where one could intuitively argue that it is logical that they are associated with high ratings or low ratings. For instance, one would assume that cute girls or classic cars are associated with high rankings. Also, it is quite intuitive that dirty streets or cloudy sunrise are rather negative terms. In general, however, there is no particular pattern recognizable.

Table 5.2: Results of the SVR based on visual sentiment

10 best adjective noun pairs 10 worst adjective noun pairs

hot cup lovely church

favorite team classic architecture

inspirational poster nice autumn

classic cars graceful bird

colorful flowers dirty streets

safe driver cloudy sunrise

outdoor market hot legs

classic cocktail wild party

cute girls dry landscape

(30)

CHAPTER 5. RESULTS

5.3 15k Concepts

Prior to the analysis of the 15k concepts as described in section 3.2.4, adjustments of the feature vectors were required. As every icon in the data set has a representation of a 15k concept feature vector, a severe memory problem occurred. To tackle this issue, we applied heuristics.

In particular, the goal was to reduce the dimensionality of the feature vectors. A straightforward way to achieve this is to keep, among the 15k concepts, only those that are relevant. This idea, however, requires a definition of what is ‘relevant’. Our applied approach was the following: For

every feature vector f in the data setDwe only kept those concepts that were among the 10 most

probable, i.e. the concepts with the ten highest probabilities. For further convenience we denote

those vectors, with the 10 most probable categories as entries, ˜f . This approach already reduced

the dimensionality from 15k to 8k. With other words, every category among the 8k concepts was

present as a top-10 concept for at least one feature vector f . From the set ˜Dconsisting of the

vectors ˜f we then selected those categories that were present at least 18 times. This approach

allowed to reduce the dimensionality sufficiently. In particular, the 15k concepts were reduced to 1,691 concepts.

Once the adjustments were preformed, we first investigated the accuracy of the 15k concept

Figure 5.6: Two illustrations where the 15k concept classification works accurately (a) and not accurately (b)

classification. Figure 5.6 illustrates two examples. The five most probable concepts for figure 5.6(a) are ’roadster’, ’sports car’, ’racer’, ’coupe’ and ’pace car’. Apparently, the model indeed correctly recognizes the content of the icon. However, the five most probable concepts for figure 5.6(b) are ’saw’, ’star’, ’measure’, ’handsaw’ and ’tape’. This is not accurate at all. This highlights what was pointed out in section 3.2.4 already. Namely that the 15k concepts classifier was trained on the ImageNet data set and not on icons. Therefore it is comprehensible why the classification fails for some examples. Nevertheless, for the bulk of icons in our data set the classification is

(31)

5.4. TEXT OR NUMBER WITHIN AN ICON

quite accurate.

Next, we trained a support vector regression in order to learn the importance of the different concepts. A remarkable caveat in this analysis is that we only consider a subset of the original 15k concepts. The results are presented in table 5.3.

Whereas in general there is no certain pattern recognizable, some results are worth being highlighted. First, the concept ‘roulette ball’ turns out to be the ‘best’ and since especially gambling apps make use of the concept ‘roulette ball’ this result might indicate that gambling apps are in general highly rated. Moreover, there are several geometrical concepts present in both the best and the worst categories. Particularly surprising is the fact that the term ‘polyhedron’ appears in both, the list of the best and the list of the worst concepts.

Table 5.3: Results of the SVR based on 15k concepts

10 best concepts 10 worst concepts

roulette ball computer screen

convex polyhedron menorah

tetrahedron domestic cat

hypotenuse regular polyhedron

cog pulsar

aircraft circuitry

horizontal stabilizer display

air bubble adding machine

armored personnel carrier airplane

speed skate clavier

5.4 Text or Number within an Icon

In the previous sections, we made use of a model that is capable of automatically assigning probabilities to the 1,200 adjective-noun pairs, the 15k concept, respectively for each icon. In order to detect whether there is a text or a number included within the icon, no such model is available. Therefore, it was necessary to label the entire data set by hand. In particular, we assigned a one to an icon that includes either text or numbers and assigned a zero in the case where neither was visible. Therefore, the parameter simply is a binary indicator.

To see the effect of the parameter on the ratings, we regressed the binary indicator on the ratings. The results of the regression are presented in table 5.4

Table 5.4: OLS-regression output for the binary indicator on ratings

coef std err t P>|t| [95.0% Conf. Int.] Intercept 4.114 0.005 799.817 0.000 4.104 4.124

(32)

CHAPTER 5. RESULTS

Apparently, the proposed parameter is not significantly related to the ratings. This means that neither text nor numbers within the icon induce a certain appeal to potential customers. Thus, also this parameter will be neglected for the purpose of predicting success based on the visual appearance of apps.

5.5 Prediction

Finally, in this section the results concerning the prediction of success of apps based on their visual appearance are reported. As explained in section 3.4 we used the ratings of an app as the proxy for success.

Due to the insignificance of the parameters HSV, color intensity, variegation and the text/number indicator, they will not be taken under consideration. All in all, we incorporated four parameters for the purpose of prediction. The first one is the 50-color-feature vector, described in section 3.2.1. The second incorporated parameter is the visual sentiment feature, described in section 3.2.3. The third parameter is the 15k concept feature, described in section 3.2.4 and finally, the fourth parameter included, is the 1k concept feature vector, which results from a deep net similar to the one proposed by [30]. This parameter is discussed in some detail in section 3.2.2.

To investigate the performance of prediction of the parameters we employed multivariate support vector machines. Therefore we applied a L2 regularization on the feature vectors, i.e. the feature

vectors were transformed into ˆ_{f =} f

k f k. Then, we split the entire data set into a training set

consisting of 70% of the original data and a test set containing the remaining 30%. Next, we trained a multivariate SVM on the training set, using a linear kernel. The penalty term was determined by cross validation, whereas the values C = {100, 10, 1, 0.1, 0.01} were considered. The best accuracy scores were achieved with C = 10. Moreover, we rounded the ratings to the closest integer. This was necessary, as the ratings then depicted 5 distinct categories, ranging from 1 to 5 (as explained earlier in this thesis, apps with a rating of 0 were excluded).

Table 5.5 shows the distribution of the 5 categories in the test set. Noteworthy, the distribution of this data is highly skewed. Therefore, a naïve classifier that predicts for every app a rating of 4 stars would achieve an accuracy of 69.7%.

Table 5.6 shows the accuracy results. In particular, table 5.6(a) shows the percentage of correctly predicted ratings. The color feature hereby exhibits the best overall performance. This is mainly driven by the fact that by exclusively using the color feature for prediction, almost every app in the test set is classified as a 4-star app. Similar results are visible in table 5.6(b). For this table we also accounted a prediction as correct in cases where it was ‘one off ’. This means a 3-stars app that was predicted to be a 4-stars (or a 2-stars) app was counted as ‘correct’. Again the prediction based on the color features achieves the best overall accuracy. In this case an accuracy of even 98.3% is achieved. However, a naïve prediction (classify all apps as 4-stars apps) would achieve the same accuracy in this setup. A further remarkable result that holds for both table 5.6(a) and

(33)

5.5. PREDICTION

table 5.6(b) is that the features do not complement each other. This is obvious from the fact that the combined features do not show superior prediction performance.

These results are, however, to some extent biased due to the fact that the skewed distribution was incorporated when the SVM was trained. To tackle this problem we applied SMOTE, as described in section 3.3 on the training data.

The results for this more objective setup are presented in table 5.7. For this setup, the prediction based on color features shows the worst performance. However, it is capable of correctly classifying 35% of the 1-star apps. This is a quite remarkable result, as the 1-star apps represent, with 0.5% of the test set, the vast minority. Another interesting result is that the prediction based on low level features shows superior accuracy compared to the other features. This result is in line with other works (cf. [16]). Furthermore, once the training set is balanced, the features complement each other. This is obvious from the fact that prediction based on the combined features shows superior accuracy. Making use of the combined features, we even achieve an accuracy of 46%. This is quite remarkable, as the SVM was trained on an evenly distributed training set. The same implications that hold for table 5.7(a) are valid for table 5.7(b). By allowing the prediction to deviate by 1 star, we achieve an accuracy of even 91%. These results reveal that the icons of apps show significant predictive power.

Figures 5.7 - 5.11 illustrate some examples concerning the accuracy of the predictions. As basis for these figures we used prediction based on combined features and as accurate we only excepted exact predictions. The figures show the prediction examples for 1-star - 5-star apps. To facilitate the explanation of these figure we refer in the following to figure 5.7. The first quadrant in the figure shows an example of a correct prediction. Note, there is no icon provided, which is due to the fact that there was no app correctly classified as a 1-stars app. The second quadrant shows an example for a correct negative, this means that the related app has a different rating than one, but was correctly predicted. Note, the correct negatives are irrelevant for the investigated category and were only included due to completeness. Quadrant three shows an example for false negative. This means that the apps rating was one, however its prediction rating was not equal to one. The fourth quadrant finally shows an example for false positives. This means the prediction for the app was one, whereas the actual rating was not equal to one. The interpretation for the figures 5.8 - 5.11 is analog.

Table 5.5: Number of apps present in each category in the test set

1 stars 2 stars 3 stars 4 stars 5 stars overall

(34)

CHAPTER 5. RESULTS

Table 5.6: Prediction Accuracy for Ratings using Multivariate Support Vector Machines on the

unbalanced data set.

(a) Unbalanced Data - exact

Color 0.0 0.0 0.0 0.997 0.005 0.697

Deep Net 0.0 0.0 0.023 0.938 0.080 0.673

ANP 0.0 0.0 0.009 0.966 0.045 0.683

15k concepts 0.0 0.0 0.043 0.894 0.127 0.653

Combined 0.0 0.0 0.055 0.843 0.217 0.637

(b) Unbalanced Data - 1-star off

Color 0.0 0.0 0.994 1.000 1.000 0.983

Deep Net 0.0 0.0 0.928 0.998 0.992 0.975

ANP 0.0 0.0 0.974 0.999 0.994 0.979

15k concepts 0.0 0.079 0.898 0.996 0.982 0.970

Combined 0.0 0.048 0.866 0.998 0.979 0.967

Table 5.7: Prediction Accuracy for Ratings using Multivariate Support Vector Machines on the

balanced data set.

(a) Balanced Data - exact

Color 0.357 0.270 0.160 0.210 0.222 0.210

Deep Net 0.036 0.048 0.306 0.420 0.398 0.400

ANP 0.107 0.143 0.243 0.301 0.299 0.293

15k concepts 0.0 0.095 0.281 0.412 0.366 0.386

Combined 0.0 0.016 0.249 0.523 0.368 0.460

(b) Balanced Data - 1-star off

Color 0.536 0.794 0.472 0.498 0.384 0.477 Deep Net 0.143 0.444 0.660 0.926 0.699 0.848 ANP 0.357 0.492 0.645 0.752 0.545 0.696 15k concepts 0.107 0.397 0.698 0.917 0.731 0.851 Combined 0.107 0.206 0.728 0.974 0.840 0.913 28

(35)

5.5. PREDICTION

positive ne gative

correct

f al se

Figure 5.7: False positives, false negatives, correct positives, correct negatives for 1 star apps. Prediction was conducted using combined features.

positive ne gative

correct

f al se

(36)

CHAPTER 5. RESULTS

positive ne gative

correct

f al se

positive ne gative

correct

f al se

(37)

5.5. PREDICTION

positive ne gative

correct

f al se

(38)

(39)

C

H A P T E R

6

C

ONCLUSION

This thesis contributes to the scarce existing literature concerned with app analysis. From the set of components that define an app, we limit the analysis to the visual appearance of the icons. Thereby our contributions are twofold. First, we reveal the factors that are important within an icon and second, we investigate the prediction performance of icons on ratings.

Inspired by the work of Khosla et al. [16], we first investigate the effect of standard color parameters like HSV, intensity mean, variance and skewness. The results reveal that the influence is to a large part negligible. Additionally we also investigate the influence of colors. Therefore, we down-scaled the entire RGB color space into 50 colors. We train a support vector regression in order to learn the importance of these colors. This reveals that the greenish and blueish colors have on average a positive and the reddish colors a negative effect on an apps success.

In addition, we apply approaches proposed by Mazloom et al. [20]. More precisely, we conduct a visual sentiment analysis in order to reveal the sentiment within an icon. Therefore, we made use of the SentiBank library initiated by [3], which consists of 1,200 adjective noun pairs (ANP). By SVR we then learn which ANPs are related with high rankings, low rankings, respectively. The top three ANPs are ‘hot cup’, ‘favorite team’ and ‘inspirational poster’. The three worst ANPs are ‘lovely church’, ‘classic architecture’ and ‘nice autumn’.

Similar to the visual sentiment analysis, we conduct an analysis of 15k concepts, which is a library resulting from the ImageNet data set. Again we learn by SVR which concepts are related with high ratings and which are associated with low ratings. The top three concepts are ‘roulette ball’, ‘convex polyhedron’ and ‘tetrahedron’. The three worst concepts are ‘computer screen’, ‘menorah’ and ‘domestic cat’. It is important to highlight that both the SentiBank and the 15k concepts libraries originate from the task of image classification. In particular, they were designed

(40)

CHAPTER 6. CONCLUSION

for images on social media platforms and not for icons. Therefore, for some icons, the detected ANPs or concepts are not accurate.

Moreover, we propose a new parameter in this thesis. The idea of the parameter highlights a difference between icons and ordinary images. In particular, we investigate whether the presence of text or numbers within an icon has an effect on the rating of the app. However, the effect is insignificant and therefore the parameter is also neglected for the prediction of ratings.

Additionally to the described parameters we incorporate low level features for predicting the ratings of apps. These features result from a convolutional neural network that is based on the GoogleNet proposed by Szegedy et al. [30].

Then, we train a multivariate support vector machine. Therefore, we split the data set into a training set and a test set, on which the SVM’s performance is evaluated. Noteworthy, the used data is highly unbalanced. Therefore, we report results for both, an unbalanced and a balanced training set. The training set was balanced by application of SMOTE [5]. By training on the unbalanced training set, the resulting SVM predicts for almost the entire test data a rating of 4-stars. This is due to the fact that the distribution of the ratings is incorporated in the SVM. This additionally makes the results harder to interpret. For the SVM that is trained on the balanced training set we get reasonable results. The low level features show the best prediction performance with 40% accuracy. When we combine the features, we even achieve an accuracy of 46%. This result also reveals that the features complement each other.

For further research we recommend to conduct the analysis on a larger data set. This would make the deep net methodology less prone to over-fitting. Furthermore, it would be interesting whether the results can be replicated in case the number of downloads instead of the ratings is used as a proxy for success. As a last remark, we recommend to combine features based on the icon with features from other aspects of apps. In particular, concatenating visual aspects (as proposed in this thesis) with textual features (which could result from reviews) would be an interesting idea for future research.

(41)

B

IBLIOGRAPHY

[1] Y. BAE AND H. LEE, Sentiment analysis of twitter audiences: Measuring the positive or

negative influence of popular twitterers, Journal of the American Society for Information Science and Technology, 63 (2012), pp. 2521–2535.

[2] C. BISHOP, Pattern recognition and machine learning (information science and statistics),

1st edn. 2006. corr. 2nd printing edn, 2007.

[3] D. BORTH, R. JI, T. CHEN, T. BREUEL, ANDS.-F. CHANG, Large-scale visual sentiment

ontology and detectors using adjective noun pairs, in Proceedings of the 21st ACM international conference on Multimedia, ACM, 2013, pp. 223–232.

[4] S. CAPPALLO, T. MENSINK,ANDC. G. SNOEK, Latent factors of visual popularity prediction, in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR ’15, New York, NY, USA, 2015, ACM, pp. 195–202.

[5] N. V. CHAWLA, K. W. BOWYER, L. O. HALL, ANDW. P. KEGELMEYER, Smote: synthetic

minority over-sampling technique, Journal of artificial intelligence research, 16 (2002), pp. 321–357.

[6] J. DENG, W. DONG, R. SOCHER, L.-J. LI, K. LI,ANDL. FEI-FEI, Imagenet: A large-scale

hierarchical image database, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–255.

[7] F. GELLI, T. URICCHIO, M. BERTINI, A. DELBIMBO,ANDS.-F. CHANG, Image popularity

prediction in social media using sentiment and context features, in Proceedings of the 23rd ACM international conference on Multimedia, ACM, 2015, pp. 907–910.

[8] H. HAN, W.-Y. WANG,ANDB.-H. MAO, Borderline-smote: a new over-sampling method in

imbalanced data sets learning, in International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.

[9] M. HARMAN, Y. JIA,ANDY. ZHANG, App store mining and analysis: Msr for app stores, in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR ’12, Piscataway, NJ, USA, 2012, IEEE Press, pp. 108–111.

(42)

BIBLIOGRAPHY

[10] P. HART, The condensed nearest neighbor rule (corresp.), IEEE Transactions on Information

Theory, 14 (1968), pp. 515–516.

[11] A. HASSAN, Mining software repositories to assist developers and support managers, (2004).

[12] H. HE, Y. BAI, E. A. GARCIA,ANDS. LI, Adasyn: Adaptive synthetic sampling approach for

imbalanced learning, in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328.

[13] G. E. HINTON, N. SRIVASTAVA, A. KRIZHEVSKY, I. SUTSKEVER,ANDR. R. SALAKHUTDI

-NOV, Improving neural networks by preventing co-adaptation of feature detectors, arXiv

preprint arXiv:1207.0580, (2012).

[14] L. HONG, O. DAN,ANDB. D. DAVISON, Predicting popular messages in twitter, in

Proceed-ings of the 20th international conference companion on World wide web, ACM, 2011, pp. 57–58.

[15] R. KHAN, J. WEIJER, F. KHAN, D. MUSELET, C. DUCOTTET,ANDC. BARAT, Discriminative

color descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2866–2873.

[16] A. KHOSLA, A. D. SARMA,ANDR. HAMID, What makes an image popular?, in International

World Wide Web Conference (WWW), Seoul, Korea, April 2014.

[17] A. KRIZHEVSKY, I. SUTSKEVER, AND G. E. HINTON, Imagenet classification with deep

convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097–1105.

[18] Y. LECUN, B. BOSER, J. S. DENKER, D. HENDERSON, R. E. HOWARD, W. HUBBARD,AND

L. D. JACKEL, Backpropagation applied to handwritten zip code recognition, Neural

computation, 1 (1989), pp. 541–551.

[19] I. MANI AND I. ZHANG, knn approach to unbalanced data distributions: a case study

in-volving information extraction, in Proceedings of workshop on learning from imbalanced datasets, 2003.

[20] M. MAZLOOM, R. RIETVELD, S. RUDINAC, M. WORRING,ANDW.VANDOLEN, Multimodal

popularity prediction of brand-related social media posts, 2016.

[21] P. J. MCPARLANE, Y. MOSHFEGHI, AND J. M. JOSE, Nobody comes here anymore, it’s

too crowded; predicting image popularity on flickr, in Proceedings of International Conference on Multimedia Retrieval, ACM, 2014, p. 385.

[22] A. MEHRABIAN, Measures of individual differences in temperament , Educational and

Psy-chological Measurement, 38 (1978), pp. 1105–1117. 36

(43)

BIBLIOGRAPHY

[23] V. NAIR ANDG. E. HINTON, Rectified linear units improve restricted boltzmann machines,

in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814.

[24] R. PLUTCHIK, Emotion: A psychoevolutionary synthesis, Harpercollins College Division,

1980.

[25] O. RUSSAKOVSKY, J. DENG, H. SU, J. KRAUSE, S. SATHEESH, S. MA, Z. HUANG, A. KARPA

-THY, A. KHOSLA, M. BERNSTEIN, A. C. BERG,ANDL. FEI-FEI, ImageNet Large Scale

Visual Recognition Challenge, International Journal of Computer Vision (IJCV), 115 (2015), pp. 211–252.

[26] P. Y. SIMARD, D. STEINKRAUS, ANDJ. C. PLATT, Best practices for convolutional neural

networks applied to visual document analysis, in null, IEEE, 2003, p. 958.

[27] A. SMOLA ANDV. VAPNIK, Support vector regression machines, Advances in neural

informa-tion processing systems, 9 (1997), pp. 155–161.

[28] C. SPEARMAN, The proof and measurement of association between two things, The American

journal of psychology, 15 (1904), pp. 72–101.

[29] G. SZABO AND B. A. HUBERMAN, Predicting the popularity of online content,

Communica-tions of the ACM, 53 (2010), pp. 80–88.

[30] C. SZEGEDY, W. LIU, Y. JIA, P. SERMANET, S. REED, D. ANGUELOV, D. ERHAN, V. VAN

-HOUCKE, ANDA. RABINOVICH, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

[31] I. TOMEK, Two modifications of cnn, IEEE Trans. Systems, Man and Cybernetics, 6 (1976),

pp. 769–772.

[32] P. VALDEZ AND A. MEHRABIAN, Effects of color on emotions., Journal of experimental

psychology: General, 123 (1994), p. 394.

[33] A. ZAIDMAN, B. VANROMPAEY, S. DEMEYER, ANDA. VANDEURSEN, Mining software

repositories to study co-evolution of production & test code, in Software Testing, Verifica-tion, and ValidaVerifica-tion, 2008 1st International Conference on, IEEE, 2008, pp. 220–229.