Prediction of product life cycles using Twitter data

(1)

Prediction of product life cycles using Twitter data

Thesis Master Information Science – HCM

Student:

Robert Jan Prick

Brouwersgracht 103

1015 GD Amsterdam

tel: +31 655 800 600

e-mail:

robertjan.prick@student.uva.nl

Supervisor: Frank Nack (UvA) , Francine Chen (FXPAL)

FXPAL, Palo Alto, CA, USA

UvA – Graduate School of Informatics, Amsterdam, Netherlands

20/6/2014

Signature Supervisor

Signature Student

_______________

_________________

_______________

(2)

Prediction of product life cycles using Twitter data

Robert Jan Prick (10313737)

Graduate School of Informatics, IS HCM, University of Amsterdam Amsterdam, The Netherlands

E-Mail: robertjan.prick@student.uva.nl

ABSTRACT

This paper describes a system that can analyze Twitter data and make predictions about the future success of mobile applications, based on tweets and their sentiment. Such a system would decrease qualitative research costs and possibly improve on it. We describe our approach, the challenges we faced and report why according to our research number of tweets and their content do not provide sufficient information to make predictions in this domain.

KEYWORDS

Social media, microblogs, Twitter, information retrieval, machine learning, sentiment analysis

1. INTRODUCTION

We first give a brief introduction to the domain of our research (product life cycles, mobile apps and Twitter) followed by the problem formulation.

1.1. Product Life Cycles

The Boston Matrix (Figure 1) has been developed around the 1970’s. It originally was focused on larger retail companies like Nestlé and Unilever, but also for car manufacturers. In general, it was meant as a kind of a simplification of the ideal portfolio for companies with a broader range of products. If such a company wanted to be successful in the future, it had to take the future into account. The logic was the following (Figure 1): One first needs “cash cows” to generate profit, or cash. The money of these cash cows is first used to invest in “question marks” (new ideas/products). These new ideas will either be unsuccessful and one thus needs cash to divest (“dogs”), or they might be successful and then one needs cash to support these “stars” with marketing and sales budget.1

In the mean time the world has changed. Development cycles have shortened and development costs per product have decreased. Especially in software and/or application development, it is relatively cheap to develop new products: no development of hardware and/or prototypes but only software, no extensive physical consumer testing but testing online with a small group of users, and distribution is done digitally instead of physically. This way one can offer a broad range of products2_{and because the market is the whole}

world, a bit of success with one product out of many is often sufficient to generate a profitable business.

1_{“And so money makes the world turn around.”} 2_{C. Anderson, the long tail}

We do think the matrix is still valid in its basic form. There is only one important difference compared to the past. In earlier days, companies made sure they had a portfolio of products evenly divided over the four quadrants, with more successful companies having more stars than dogs, i.e. these companies were better equipped for the future. We now see companies developing new products at a high rate and accepting a high percentage of their products will fail (that is being divested). In terms of the Boston Matrix, the number of “question marks” has increased.

Figure 1: Outline of the Boston Consulting Matrix3 1.2. Mobile Apps

Mobile applications are typical examples of such contemporary products because they are relatively cheap to develop and if successful, show a steep growth curve. We also chose this domain because the assumption was that a) we could get reliable data with regard to the success of apps, e.g. number of downloads and ratings via for example Google Play and b) the life cycle of apps would be relatively short so we would be able to find distinct download rate (and possibly rating development) patterns in the two months of data gathering.

1.3. Twitter

Twitter is a well-established social media platform and still gets more popular by the day. Users create status

(3)

messages called “tweets". These tweets sometimes express opinions about different topics. Because of the 140-character limit, a single tweet only contains a small amount of information. Although the body of a tweet contains a maximum of only 140 characters, it is shipped with a large amount of meta-data, like creation date, language, geo-location etc. This combination of text and metadata makes it interesting to investigate further, as a large number of tweets potentially hold more information than ever thought possible. Every day, over 500 million tweets are posted4_{. If indeed this large}

sum of tweets does hold precious information, one of the challenges is to separate useful tweets from garbage, and to extract the right information.

1.4. Problem formulation

Companies want to understand what people think of their new products in the early stages of development, e.g. to make improvements and/or to predict what products will be successful. They use for example qualitative research for this purpose. The increase in number of question marks during the product development phase means an increase in costs for qualitative research. As social media data is seen as a rich source of information and is “freely” available, we want to investigate if we can use this data instead to decrease these research costs and possibly to improve on it.

The idea was to examine if there is a correlation between the quantity and quality of the tweets with regard to specific apps, and their commercial development. We defined commercial development as the download rate over time. For this purpose, data from Google Play was to be used as a kind of ground truth. The aim was to understand what early detection mechanisms, or features, in the Twitter data could be used to predict this commercial development of apps. If we could make such an application, we would be able to predict, based on only Twitter data whether a question mark would develop into a star.

For this, we wanted to use the following parameters: from Google Play the number of downloads (as equivalent of market share), download development or download rate (as equivalent of market growth) and rating (as equivalent of sentiment) and from Twitter the number of tweets and the sentiment of the tweets as the features for prediction. The hypotheses were the following: a high number of tweets and a positive sentiment will lead to stardom and a low number of tweets and a negative sentiment to the divestment phase, or dog phase, depicted in Figure 2 respectively bottom-left and top-right. A combination of a high number of tweets and a negative sentiment and vice versa, are the “interesting” combinations. The hypothesized download rates for the latter two are also depicted in Figure 2, respectively bottom-right and top-left. If we could find such a correlation based on automatically generated quantitative research, it might be of interest if we then would be able to extrapolate the outcome to other apps

4_{https://about.Twitter.com/company}_{(date: 10 June 2014)}

with a similar combination of number of downloads, growth rate, number of tweets and/or tweet sentiment.

Figure 2: hypothesized download rate development as a result of sentiment and number of tweets In the following section we will describe the literature that forms the basis for our idea. Thereafter we will describe the method or experimental set up, followed by our findings and the challenges we encountered. We will end with conclusions and the lessons learned and we will share our opinion about options for future research.

2. RELATED WORK

To make predictions based on (online) social media is a hot topic and quite new. This is caused by the exponential increase in the use of social media during the last decade. The US, as a frontrunner, is exemplary in this worldwide trend (Figure 3).

Figure 3: social media trends US 2005 – 2013

We first discuss the characteristics of both the Twitter user and the content they distribute. We then describe efforts that analyze the data gathered via social media and the accompanying applications that have been built so far, including applications for prediction. This looked very promising, hence the idea for the project. We finish with work that describes challenges to overcome when

(4)

working with social media data, and specifically with Twitter data.

2.1. How and why do people use Twitter?

Java et al. [13] did some of the early research into Twitter and investigated why people do engage in Twitter and what they want to get out of it. They found that people engage on Twitter to talk about their daily activities, to share information or to be informed. They divide the users into info seeking, info sharing and social activity users. Naaman et al. [19] also looked at the characterization of the content of messages and in their research they found that a majority talks about them selves and another large amount, but a minority, shares information (“Meformers” 80% and “Informers” 20%). They reported that the latter group has more than twice as many friends and followers. Cha et al. [5] found that the number of followers represents the popularity of a user and the number of retweets and mentions represent, respectively, the content value and the user name value. Huberman et al. [11] found that the driver of usage (the user posting activity) within Twitter is mostly correlated with the number of friends, not with number of followers or followees (Figure 4 and 5).

Figure 4: Number of posts eventually saturates as a function of the number of followees [11, page 4]

Figure 5: Number of posts increases as a function of the number of friends without saturation [11, page 4] Although these papers suggest some differences between various groups of users, each of these groups contribute to the meaning of tweets with respect to number of tweets and their sentiment. We therefore didn’t see a good reason to look at these different groups separately.

Jansen et al. [12] examined the use of consumer brands in social media and found that Twitter is also used as a kind of word of mouth. They claim 19 percent of tweets are related to consumer brands. This outcome is promising for our domain, because it suggests we will

find sufficient tweets related to brand names of mobile apps, especially because they seem to be closely related to the social media and digital world.

Weerkamp et al. [26] elaborate on the research of Java et al. [13] and find differences in how and why people engage in Twitter per language. Germans are structured (structured = use of hashtags and links) and use Twitter relatively seldom for personal communication options, Dutch and Spanish users are not structured and they use Twitter more often for personal communication. They conclude that German tweets could benefit greatly from hashtag analysis, and Dutch and Spanish tweets are more likely to benefit from social network analysis. Twitter provides the language as metadata with the tweets. To avoid differences between various languages, we will only look at English tweets.

2.2. Recognition of semantics in noisy text structures Because of the specific nature of tweets due to the 140-character limit, there has been done much research to predict a variety of events.

Some research investigated the number of tweets and its content and meta-data to predict natural disasters and epidemics. Sakaki et al. [22] detect the course of typhoons and earthquakes based on tweets. They predicted earthquakes by using Twitter as opposed to seismic sensors and claimed their system was much faster in warning people for an earthquake than the warning system from the Japan Meteorological Agency. Sadilek et al. [21] developed a very similar system predicting the spread of diseases. Lee et al. [16] detect various kind of events based on the number of tweets, number of tweet users and the movement of the tweet users within the local area.

We also want to take the sentiment of tweets into account. Researchers have examined sentiment of tweets to predict overall human behavior. The already mentioned research from Jansen et al. [12] also investigated the sentiment of tweets and they found that from the 19 percent of tweets related to consumer brands, 50% had a positive sentiment and 33% a negative one. Bollen at al [3] show there is a correlation between the mood (positive or negative sentiment) in tweets about certain stocks and the development of related stock prices. They created a system that was able to correctly predict the up and down changes in stocks 87.6% of the time. Tumasjan et al. [24] show that the number of tweets reflects the voter preferences during political elections and their results come close to traditional election polls. Asur/Huberman [1] demonstrate how Twitter content can predict real-world outcomes by predicting box-office revenues from movies based on number of tweets and their sentiment. They claim they outperformed other market-based predictions, but only after the release of the movie. Based on these results, we included sentiment as a feature to predict the download rate of mobile apps. We use Google Play as an alternative source to validate our findings. In a similar fashion, Oghina et al. [20] predict movie ratings on IMDB based on analysis of information from both Twitter and YouTube and Bothos

(5)

et al. [4] predict the Oscar winners based on tweets and information from related movie websites.

2.3. Is it all so easy?

Although the general perception is that tweets are a rich source of information, some research suggests there are quite some mountains to climb.

Combining the location of the user with the content might give more semantics to the content. Hecht at al [9] found that the self-reported location of Twitter users is not reliable due to the movement of people and purposeful misinformation. They also found that a very small fraction (0,23%) has geo-location activated. A year later, Starbird at al [23] developed a system that improves the reliability of the claimed geo-location, but they acknowledge that their system still needs improvement. Finally, Leetaru et al. [17] reported a percentage of 1.6% users with geo-location activated in 2013. However, for detailed analysis in our domain, with often only 100-200 (English) tweets per day5_per

app, we considered this number to be too low to incorporate it in our research.

Massoudi et al. [18] built a model for retrieving tweets that is enhanced with specific microblog indicators (usernames, hashtags, links). One of their findings is that it is difficult to make long-term predictions in this area (information retrieval and prediction based on this information) because of the rapid language evolution and the limited re-use of words. As recentness is important, one should focus on predictions related to recent sensitive queries. Berendsen at al [2] come to the same conclusion researching ranking systems based on social media because of the rapidly evolving content and the continuously changing user behavior.

There appear to be some limitations in what one can achieve analyzing tweets (content and meta-data), but there is also sufficient encouraging material to support the original idea of the project.

3. METHOD

Our sentiment analyzer is a recently developed proprietary analyzer from FXPAL, specifically developed for analyzing tweets. Furthermore, we aim to use data from Google Play, download rate and rating, as a ground-truth. This way we can check if the sentiment analyzer works for our domain, mobile apps, products with a relatively short life cycle.

3.1. Google Play

For this purpose, we developed ourselves a scraper (in Python) that extracts information from Google Play (and AppBrain.com). In February 2014, we scraped 1,700 app names from Appbrain.com, evenly divided over games and other apps. At the same time, we gathered data about the apps from Google Play (and Appbrain.com) as shown in Figure 6. We anticipated scraping the data once a month for a 3-month period (four times). This way, we expected we could track the

5_{with few exception, like “flappy bird” or “candy crush”, but} they were not the main focus of our research

download rate over that period and to be able to classify the apps in various groups based on their download rate.

Figure 6: AppInfo and AppUpdate 3.2. Twitter Streaming API

How does the Twitter streaming API work? One can enter a set of keywords as a filter to track Twitter, according to current limitations one can track up to 400 keywords per filter6_{. After retrieving the tweets, we had}

to filter the data to remove noisy data. This is because Twitter doesn’t allow retrieving specific combinations of words. E.g. we look for “flappy bird”, we will retrieve tweets with the words “flappy” and “bird” in the tweet, irrespective if the two words are next to each other. Next to that, one has to take into account there will always be noise in the data because it is almost impossible to define keywords that precisely give what we want by simple filtering. At this point, one always need to ask one selves if it is necessary to filter out tweets that are not related to the topic, for example by language detection or contextual information retrieval. By using the 1% sample form Twitter Streaming API we collected tweets from February 12th_{– June 4}th_with

some intermezzos due to technical problems, re-boots and/or updates of the filter. In total, we downloaded over 100 million tweets in this period, from which about 35% were tagged as an English tweet (lang = “en”). This 35% we used for further analysis. We downloaded tweets, thus the text including metadata like language, geo-location etc. via the streaming API into MongoDB. Twitter allows entering a set of keywords as a filter to track Twitter. According to current limitations one can make a query that can track up to 400 keywords. We used in total 1,700 keywords or appnames, we used several filters at the same time. These filters were stored in separate “collections”.

3.3. Sentiment analysis

We used a proprietary sentiment analyzer from FXPAL that was specifically developed for tweets to analyze the sentiment of the tweets. That is why we chose to use it. This analyzer requires pre-processing as described by Go et al. [8]

(6)

4. FINDINGS

4.1 Google Play

Mobile apps are published for various platforms, like IOS (App Store), Android (Google Play), Windows, and Blackberry etc., from which Android and the App Store are the largest7_{. Because Google Play reports various}

data for the applications, like rating and number of downloads, we chose to use Android apps for our research. We found the data less easy to use than anticipated. Google Play doesn’t report the exact number of downloads, but uses intervals, e.g. “10”, “50”, “100”, “500”, “1000” etc. up to “1,000,000,000” (for the app “maps”) as of June 2014. The level “10” represents apps with between 10 and 50 downloads, “50” has between 50 and 100 downloads etc. In Figure 7, the red pillars represent the actual percentages per existing interval in Google Play; the blue line stresses the low percentage of apps that grow seriously, e.g. above 100,000 downloads. The apps with less than 100,000 downloads, represent 71.5% of total, and below 1 million over 85%.

Figure 7: number of apps per interval for 1,708 apps reported by Google Play (May 7th_{, 2014).}

We then looked at the download rate, i.e. the growth in number of downloads over a period in time. This should give insight which apps are successful. The fact that Google Play uses (more or less) logarithmic levels makes it difficult to get good insight in the download rate over a limited period of time because the band with of the intervals are too wide. The apps don’t grow fast enough to pass several levels, as was the case in our research. Table 1 shows an overview of the number of apps that passed one or more levels during a 4-week period. Two-third of the apps remained in their bandwidth during this test period. One third passed one or more levels (“2x” means from e.g. 50-100 to 100-500, “5x” means from 100-500 to 500-1,000). In this 4-week period, 12% of total passed more than two levels. However, if an app passes two levels, one doesn’t know if the growth was in extremis from e.g. 1,001 to 44,999 or from 4,999 to 10,001. One can consider to use an average per level, but this average should be chosen

7_{http://en.wikipedia.org/wiki/List_of_mobile_software_distrib}

ution_platforms (June 11, 2014)

with care per level, because just the mean of each level seems not correct. The exponentially decreasing number of apps per level means that in e.g. the level from 10,000-50,000 downloads probably more apps have between 10,000 and 30,000 downloads than between 30,000 and 50,000.

Table 1: Changes for 1708 apps in a 4-week period (May 7th – June 4th)

Number Percent Rating

Same Range: 1149 67.3% 4.203 +1 Level (2 x) 183 10.7% 4.218 +1 Level (5 x) 161 9.4% 4.152 >=2 levels 205 12.0% 4.126 out: 10 0.6% 4.084 Total: 1708 100.0% 4.190

We hypothesized that the performance of the app, in the sense of number of downloads, would be (partially) dependent on the rating of the app. In Table 1 we can see that this is not the case for our sample. We compared the 1,700 apps that we scraped from Appbrain.com and looked at their rating at the beginning of the period. A month later we could not find a correlation between an increase in downloads and the rating. Maybe there is none, and this could be caused by different rating cultures per country (e.g. a large country rating lower than average will have impact on the total numbers), or the fact that once an app has passed a certain threshold in number of downloads, the rating becomes less important for the developers and they use the app generate extra profit by e.g. extra advertisement. Facebook, Angry Birds Go! and Papa Pear Saga are notable examples of apps with a high number of downloads (> 10 million) and a lower than average rating8_{. Kovacs et al. found that “prizewinning}

books tend to attract more readers but that the readers’ ratings of award-winning books tend to decline more precipitously relative to books that were named as finalists but did not win” [15, page 1]. For apps, it might be the case that a steep increase in usage (downloads) means an even steeper increase of users who do not belong to the target group and rate the app accordingly (less positive).

However, if we look specifically to relatively new apps, defined as apps with only 50-100 downloads at the start of the period and their download rate over this 4-week period, we see that the average rating of apps that grew over the period was higher than average (Table 2). The average rating of the apps that grew is above average of all apps (4.190), and well above the average of the apps that remained in the same level during that period. We noticed that some apps were rated much more often than other apps. Therefore we calculated also the rating weighted for the number of people rating the app. In Table 3 we show these numbers. Now we see a different

(7)

picture: the apps that stay in the same bandwidth have the second highest rating and well above average. Table2: increased number of downloads and rating

Downloads Downloads

May 7th June 4th Rating May 7th

50 5000 4.200

50 1000 4.253

50 500 4.440

50 100 4.381

50 50 4.131

Average ALL APPS 4.190

Table3: increased number of downloads and rating, shown as a weighted average of number of reviews

Downloads Downloads Normalized

May 7th June 4th Rating May 7th

50 5000 4.200

50 1000 4.425

50 500 4.650

50 100 4.862

50 50 4.668

Average ALL APPS 4.377

Google has many servers placed around the world. Something as simple as counting the overall app downloads gets harder, since each of the servers should go through its massive database and count the number of installations for its region and then send that information to a central computer. That is probably why Google counts within specified intervals instead of providing specific quantities.

If someone wants to use the numbers of Google Play, one should take the following into account and one should use a number that is based on:

-‐ The number of downloads over time, preferably weighted to favor more recent downloads and including the bounce rate (how many users uninstall the app right away or after a single use)

-‐ A distinction between paid and free apps -‐ A normalization for downloads and rating per

country.

-‐ Possibly a normalization for commercially triggered downloads

This is not feasible within the scope of this research and therefore we could not use the Google Play numbers for this research, i.e. to use Google Play as a kind of ground truth and to find a correlation with the number of tweets and/or its sentiment.

4.2 Twitter streaming API:

If one makes a call to action to use a specific keyword, it probably is going to spike now and then. If this spike goes beyond 1% of the fire hose at that moment, one will get all the tweets matching that 1% and a rate limit message telling what tweets were missed due to them falling out of your limit. For this research, we did not

expect to come close to the 1% rate limit and therefore chose not to log. Twitter states they have about 500 million tweets per day and the 1% sample streaming API allows to download of 1%, but measured as the equivalent per second. This means that some spikes for specific filters might be missed. We tracked one filter for 39 hours (2,340 minutes) and found that in 17 minutes we downloaded an equivalent of more than 3 million tweets per day, as shown in figure 8. If we assume that a filter for specific key words spikes more than the average of all tweets, it is feasible that we missed some tweets. However, if the rate was the equivalent of 10 million tweets per day during 30 seconds of that 1 minute, then the number of tweets lost, at a rate of 5 million tweets/day, would be 5 million / (60 * 24) * 0.5 = 1.736 tweets. With two million tweets per day on average, that is less than 1 thousands. Even times 17, it seems to be a negligible number. If one logs the tweet IDs before and after one gets a rate limit notice, one can use the Search API to try to excavate for the tweets that were missed (Twitter calls this the "backfill"). This should be done regularly, as it is limited to max 6-9 days backwards9_{. For this research,}

we did not keep the log, so we cannot verify if we missed tweets, or how many.

Figure 8: downloads per minute (right y-axis) and their equivalent per day (left y-axis).

From the number of tweets downloaded we can make some observations. We found a clear difference between two queries we processed at the same time, as shown in Table 4. Possible explanations could be that there were problems on the line, i.e. connection problems that were not noticed. It also could be a temporary error from Twitter. In other time intervals we did not find these differences and we retrieved more or less the same amount of tweets (Table 5).

We encountered some more problems one should take into account using the Twitter Streaming API. For example, we found tweets with a creation date before we started downloading. This probably means the creation date is not always reliable.

(8)

Table 4: Number of Tweets per day for “Flappy Bird” from two different queries in the same period10_. 2/12/14 1393 8589 2/13/14 1182 8447 2/14/14 1869 4480 2/15/14 924 2361 2/16/14 596 1815 2/17/14 571 1712 2/18/14 1064 1521 2/19/14 10649 1223 2/20/14 1299 1529 2/21/14 876 1539 2/22/14 486 1090 2/23/14 295 626 2/24/14 292 545 2/25/14 2117 399 2/26/14 339 518

Table 5: Number of Tweets per day for “Flappy Bird” (or “FlappyBird”) from two different queries in the same period.

4/9/14 8797 8803 4/10/14 6807 6807 4/11/14 9423 9427 4/12/14 6180 6181 4/13/14 6064 6059 4/14/14 7084 7086 4/15/14 5941 6021 4/16/14 6898 6897 4/17/14 5795 5797 4/18/14 5713 5716 4/19/14 7723 7751 4/20/14 6278 6282 4/21/14 6665 6670 4/22/14 5595 5609 4.3. Sentiment analysis

The goal of the sentiment analysis was at first to find a correlation between the rating in Google Play and the sentiment. The second idea was to find a correlation between the sentiment and the number of tweets. As the data from Google Play was not to be used, we investigated only the second option.

We cleaned the data as described by Go et al. [8], with the exception of handling repeated letters in tweets. Go et al. replaced any letter occurring more than two times with two occurrences, e.g. “huuuuungry” becomes “huungry”. We replaced them with one occurrence instead, following for example Kouloumpis et al. [14]. That this preprocessing has some impact can be seen in Figure 9 below. Of course this can have an opposite

10_{E.g. the 1393 tweets on 2/12/2014 reflect the number of} tweets downloaded from 11/1/2014 0.00 am until 11/1/2014 23.59.59 pm. The date and time reflect Greenwich Mean Time (winter time).

effect. However, most words with two vowels are nouns and they usually have no sentiment attached. However, “cheeeeery” would now become “chery” instead of “cheery”.

Positive (+0.430) [english] sharethis i won a breakfast sandwich maker!!! yeah!!

Neutral (+0.000) [english] sharethis i won a breakfast sandwich maker!!! yeaahh!!

Neutral (+0.000) [english] m ice cream sandwich maker!

Positive (+0.545) [english] mm ice cream sandwich maker!

Figure 9: example of sentiment change for two different sentences due to change in pre-processing. From a sample of 10,000 tweets, we found 64% had a positive sentiment, 17% negative and 18% neutral (1% error). We compared 1,000 tweets to see if the slight pre-processing change compared to Go et al. made a difference. 44 percent of the sample tweets changed in sentiment and 10 percent changed more than 0.2 (or 10% in a range from -1 to +1). An example can be seen in Figure 9. We found insufficient proof to favor one approach over the other.

Figure 10: percentage positive, negative, neutral tweets (left y-axis) and nr. of tweets (right y-axis) per week for “pokemon game” from Feb 12 – May 27. 4.4. Positive, negative and neutral tweets

We then investigated if a correlation could be found between the sentiment of tweets and the number of tweets. We looked at the results of various apps as shown in Figure 10 - 14. To clarify this we will explain in detail the result for the two graphs (figure 10 and 11) for “pokemon game”. For “pokemon game”, we see in figure 10 few neutral tweets (green line). We do see big changes in tweets (black line), up to more than 100% compared to the previous week. But the share of positive and negative sentiment does not follow the number of tweets. In the middle of the graph we see an increase in number of tweets and an increase of the positive sentiment, in the week ending May 13th_{we also}

see an increase in tweets, this time accompanied by an increase in negative tweets. In figure 11 we show the same graph in a different way. This time the dots

(9)

represent the weeks and the blue line connects them representing the time line. If the blue line would show a logic path, it would mean a pattern. This was not the case.

Figure 11: average sentiment (max 1, min -1) per week for “pokemon game” for a 15-week period (from Feb 12 – May 27). The dark red dot reflects the first week, the dark blue the last week. The blue line connecting the dots reflects time.

Figure 12: sentiment and number of tweets “amazing football”. See for explanation Figure 10 and 11

Figure 13: sentiment and number of tweets “sandwich maker”. See for explanation Figure 10 and 11

Figure 14: sentiment and number of tweets “video poker”. See for explanation Figure 10 and 11

(10)

For other apps, we found various patterns, but not a trend nor an explanation (figure 12-14).

In Table 6 we show the correlation between total number of tweets and the number of positive tweets, meaning if the number of positive tweets grow, the total number also growth. For other combinations, like total tweets and negative tweets, total tweets and positive sentiment and total tweets and negative sentiment, the mixed results we found in the graphs were confirmed. Positive tweets form a large part of the total tweets and therefore the chance is high that they dictate the direction of the total number of tweets, hence the high correlation. However, we could not find any other causal relationship.

5. CONCLUSION

5.1. Google Play:

The data is not usable if one can track the data for a limited period of time. One should rather work together with an app developer, with a broad range of apps, who can give reliable real-time data with regard to rating and number of downloads. In our experimental set-up, we anticipated to scrape Google Play every 4-weeks. Due to the count quantization of Google Play and the large size of the intervals, we recommend scraping Google Play at least once a week. To be able to extrapolate the download rate, the factor time should be relatively precise to compensate for the wide interval of the Table 6: mixed results calculating correlation (Pearson) for total numbers of tweets and resp. number of positive tweets / negative tweets / average sentiment (per week, Feb 11 – May 27 2014)

Total tweets / positive tweets Total tweets / negative tweets Total tweets / average sentiment Sandwich maker 0.651 0.800 0.560 Amazing football 0.999 0.118 0.528 Video poker 0.967 0.538 0.188 Pokemon game 0.963 0.866 0.094

number of downloads. However, one should carefully choose the number of apps and the data to be gathered, because our program’s processing time went up considerately after a couple of updates.

As our application is not developed yet, it is difficult to predict which apps are successful (!). Therefore one should consider to first gathering training data and afterwards test data to get an understanding of the mechanics. Furthermore, one should check the results regularly and with care. Google might implement changes that have impact on the way the data should be scraped.

5.2. Twitter

If one filters data via the streaming API, it is wise to use two filters with the same query / keywords at the same time. This way one can track the differences. The differences we encountered we could not explain. Keeping the log seems mandatory, also when one does not expect to reach the limit of 1%.

Although the percentage tweets with an incorrect creation date was low, we recommend making sure the program finds those. The percentage in our case was very low, but it took quite some extra effort to separate these and check in retrospect if they were anomalies, or not.

Do not expect everything to go smooth. Some tweets are just corrupted and will give errors. Build in trial and error methods to make sure a program that easily can run for half a day will not stop due to such an error. 5.3 Tweets and Sentiment

We could not find any proof that early popularity of apps, measured in number of tweets and/or positive sentiment, is a predictor for future success. This is in line with recent research from Weng et al. [27]. With regard to sentiment specifically, we could not find results suggesting that either a positive or negative sentiment can be used as a predictor for the future success of mobile apps. We conclude that, in the domain of mobile apps, using total numbers of tweets combined with their sentiment cannot be used as a predictor for the download rate.

6. FUTURE WORK

Due to the rapidly changing world in many aspects, it might be more difficult than expected to extrapolate findings from one domain to another, or from one period in time to another or from one country to another. If one wants to investigate the usefulness of Twitter data in general, one should consider to track numbers for specific countries, in line with the outcome of Weerkamp et al. [26]. However, [9, 17] found that a small percentage of users have geo-location activated. This makes research in countries that speak a language that is used in various countries, e.g. English or Spanish speaking countries, more complex. Combined with the fact that Twitter has a very specific and rapidly changing language model [2, 18], investigating the content of tweets might be a dead end for prediction purposes.

Instead of looking at total tweets and its content, it might be a better idea to find underlying features of the users of Twitter, especially features related to the communication structure. Research from Hodas et al. [10] elaborates on research from Feld [7], “The Friendship Paradox”, and shows why the friends of social media users are in general of more interest than the user him- or herself. In another field of social science, Christakis et al. [6] used a similar (successful) approach to investigate contagious outbreaks. And recent research from Weng et al. [27] found that features based on community structure and their

(11)

relationships are powerful predictors for hashtags going viral (And they found, in line with our research, that early popularity is not a good predictor). It might be interesting to investigate if this is also the case for the success of mobile apps.

REFERENCES

[1] S. Asur and B. A. Huberman, “Predicting the Future With Social Media,” 2010.

[2] R. Berendsen and M. Tsagkias, “Pseudo Test Collections for Training and Tuning Microblog Rankers,” in SIGIR 2013, 2012.

[3] J. Bollen and H. Mao, “Twitter Mood as a Stock Market Predictor,” Computer (Long. Beach.

Calif)., vol. 44, pp. 91–94, 2011.

[4] E. Bothos, D. Apostolou, and G. Mentzas, “Using Social Media to Predict Future Events with Agent-Based Markets,” IEEE Intell. Syst., vol. 25, 2010.

[5] M. Cha, H. Haddadi, F. Benevenuto, and P. Gummadi, “Measuring User Influence in Twitter: The Million Follower Fallacy.,” Icwsm, pp. 10–17, 2010.

[6] N. a Christakis and J. H. Fowler, “Social network sensors for early detection of

contagious outbreaks.,” PLoS One, vol. 5, no. 9, p. e12948, Jan. 2010.

[7] S. L. Feld, “Why Your Friends Have More Friends than You Do1,” Am. J. Sociol., vol. 96, no. 6, pp. 1464–1477, 1991.

[8] Go, A.; Bhayani, R. & Huang, L. (2009), 'Twitter Sentiment Classification using Distant Supervision', Processing , 1--6 .

[9] B. Hecht, L. Hong, B. Suh, and E. H. Chi, “Tweets from Justin Bieber ’ s Heart  : The Dynamics of the ‘ Location ’ Field in User Profiles,” in Conference on Human Factors in

Computing Systems, 2011, pp. 237–246.

[10] N. O. Hodas, M. Rey, and K. Lerman,

“Friendship Paradox Redux: Your Friends Are More Interesting Than You,” in ICWSM 2013, 2013.

[11] B. A. Huberman, D. M. Romero, and F. Wu, “Social networks that matter  : Twitter under the microscope,” pp. 1–9, 2008.

[12] B. J. Jansen and M. Zhang, “Twitter Power  : Tweets as Electronic Word of Mouth,” vol. 60, no. 11, pp. 2169–2188, 2009.

[13] A. Java, A. Java, X. Song, X. Song, T. Finin, T. Finin, B. Tseng, and B. Tseng, “Why we Twitter: understanding microblogging usage and communities,” Int. Conf. Knowl. Discov.

Data Min., p. 9, 2007.

[14] E. Kouloumpis, T. Wilson, J. Moore, Twitter Sentiment Analysis: The good and the bad and the OMG!, 2011, Association of the

Advancement of AI (www.aaai.org)

[15] B. Kovacs, A.J. Sharkey, The Paradox of Publicity: How Awards Can Negatively Affect the Evaluation of Quality (November 6, 2013), Forthcoming in Administrative Science Quarterly. http://ssrn.com/abstract=2350768 [16] C. Lee, H. Kwak, H. Park, and S. Moon,

“Finding influentials based on the temporal order of information adoption in Twitter,” Proc.

19th Int. Conf. World wide web - WWW ’10, p.

1137, 2010.

[17] K. H Leetaru, S. Wang, G. Cao, A.

Padnamabhan, E. Shook, “Mapping the global Heartbeat: The geography of Twitter”, peer-reviewed article on the internet, May 2013, http://firstmonday.org/article/view/4366/3654#a uthor

[18] K. Massoudi, M. Tsagkias, M. De Rijke, and W. Weerkamp, “Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts,” 2011.

[19] M. Naaman, J. Boase, and C. Lai, “Is it Really About Me  ? Message Content in Social Awareness Streams,” pp. 189–192, 2010. [20] A. Oghina, M. Breuss, M. Tsagkias, and M. De

Rijke, “Predicting IMDB Movie Ratings Using Social Media,” in ECIR 2012: 34th European

Conference on Information Retrieval., 2012, pp.

503–507.

[21] A. Sadilek, H. Kautz, and V. Silenzio, “Modeling Spread of Disease from Social Interactions,” in Proceedings of the Sixth

International AAAI Conference on Weblogs and Social Media, 2012, pp. 322–329.

[22] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users: real-time event detection by social sensors,” Proc. 19th

Int. Conf. World wide web, pp. 851–860, 2010.

[23] K. Starbird, G. Muzny, and L. Palen, “Learning from the Crowd  : Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions,” in

ISCRAM, 2012, vol. 2011, pp. 1–10.

[24] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe, “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment.,” Icwsm, pp. 178–185, 2010. [25] J. Ugander, B. Karrer, L. Backstrom, C.

Marlow, and P. Alto, “The Anatomy of the Facebook Social Graph,” pp. 1–17. [26] W. Weerkamp, S. Carter, and M. Tsagkias,

“How People use Twitter in Different Languages,” no. 1.

[27] L. Weng, F. Menczer, Y-Y Ahn, Predicting Successful Memes using Network and

Community Structure, 2014, Association of the Advancement of AI (www.aaai.org)

(12)

ANNEX:

(13)

1 ''' 2 Created on Apr 5, 2014 3 4 @author: Prick 5 ''' 6 import urllib2 7 import pickle 8 import threading 9 import sys 10 import time

11 from pprint import pprint 12 import traceback

13

14 class AppUpdate: 15 date=''

16 lastUpdate=''

17 likes=0 # Google+ reccomendations 18 rating=0

19 stars=[] # [5stars, 4stars, 3stars, 2stars, 1stars] 20 downloads=0 21 price=0 22 userratings=0 23 isValid=True 24 date=None 25 26 def __init__(self): 27 self.stars = [0]*5 28 29 def __str__(self): 30

return self.date + " (" + self.lastUpdate + ") = " + str(self.isValid) + " [" + str(self.likes)+", "+str(self.rating)+ "(#"+str(self.userratings)+"), "+str(self.downloads)+", " + str(self.stars) + ", "+str(self.price)+"]";

31 32 33 class AppInfo: 34 name='' 35 description='' 36 url='' 37 playstoreurl='' 38 androidpackage='' 39 checks=[] 40 41 def __init__(self): 42 self.checks = [] 43 44 def __str__(self): 45 s = "name = " + self.name + "\n"

(14)

46 # s += "description = " + self.description + "\n" 47 for update in self.checks:

48 s += '\t'+str(update)+'\n'; 49 return s;

50 51

52 def mineAppInfo(url, appdict, dontAdd=False): 53 appinfo = AppInfo() 54 appinfo.url = url; 55 appUpdate = AppUpdate() 56 appUpdate.date = time.strftime("%d/%m/%Y %H:%M:%S") 57 58 try: 59

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 60 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 61 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 62 'Accept-Encoding': 'none', 63 'Accept-Language': 'en-US,en;q=0.8', 64 'Connection': 'keep-alive'}

65 req = urllib2.Request(url, headers=hdr) 66 page = urllib2.urlopen(req);

67 html = page.read(); 68 page.close() # close page 69

70 keyword = "This app is unfortunately no longer available on the Android market." 71 begin = html.find(keyword);

72 if begin >= 0: # App removed from play store! 73 print url, ' App removed from play store!' 74 return

75

76 keyword = "<title>"

77 begin = html.find(keyword) + len(keyword) 78 end = html.find(" |",begin)

79 appinfo.name = html[begin:end]

80 if appinfo.name == "AppBrain - Android app content warning": 81 print url, ' App only for mature audience!'

82 return 83

84 # keyword = "User ratings " 85 # keyword2 = ""

86 keyword = "title=\"AppBrain score (0-100)\">";

87 begin = html.find(keyword) + len(keyword) + 1 # +1 for next line 88 end = html.find("\n",begin)

89 appUpdate.rating = int(html[begin:end]); 90

(15)

92 keyword3 = ""

93 begin = html.find(keyword2) + len(keyword2) 94 begin = html.find(keyword3,begin) + len(keyword3) 95 end = html.find("<",begin)

96 appUpdate.userratings = int(html[begin:end]); 97

98 keyword = "" 99 begin = html.find(keyword) + len(keyword) 100 end = html.find("+",begin)

101 appUpdate.downloads = int(html[begin:end].replace(",","")); 102

103 keyword = "New Price: $"

104 begin = html.find(keyword) + len(keyword) 105 if (begin > len(keyword)): 106 end = html.find("<",begin) 107 appUpdate.price = float(html[begin:end]); 108 else: 109 appUpdate.price = 0 # FREE 110

111 keyword = "<div class=\"app_descriptiontab\">" 112 begin = html.find(keyword) + len(keyword) + 1;# + 6 113 end = html.find("\n",begin)

114 appinfo.description = html[begin:end].replace(" ","\n"); 115

116 # keyword = "googleplay" 117 # keyword2 = "<a href=\"" 118 # begin = html.find(keyword)

119 # begin = html.rfind(keyword2,0,begin) + len(keyword2) 120 # end = html.find("\"",begin)

121 # appinfo.playstoreurl = html[begin:end] 122

123 keyword = "<a href=\"http://play.google.com/store/apps/details?id=" 124 begin = html.find(keyword) + len(keyword)

125 end = html.find("&",begin)

126 appinfo.androidpackage = html[begin:end];

127 appinfo.playstoreurl = "http://play.google.com/store/apps/details?id=" + appinfo.androidpackage; 128

129 # https://apis.google.com/_/+1/fastbutton?usegapi=1&annotation=inline&url=https%3A%2F%2Fmarket.android.com%2Fdetails%3Fid%3Dcom.mobilesrepublic.appy 130 page = urllib2.urlopen(appinfo.playstoreurl)

131 html = page.read() 132 page.close() # close page 133

134 keyword = "span class=\"bar-number\">" 135 begin = 0

136 for i in range(5):

(16)

138 end = html.find("<",begin)

139 appUpdate.stars[i] = float(html[begin:end].replace(".","").replace(",","")) 140

141 # print appinfo.playstoreurl," == ", appUpdate.stars 142 keyword = "itemprop=\"datePublished\">"

143 begin = html.find(keyword, begin) + len(keyword) 144 end = html.find("<",begin) 145 appUpdate.lastUpdate = html[begin:end]; 146 147 gplusurl = "https://apis.google.com/_/+1/fastbutton?usegapi=1&annotation=inline&url=https%3A%2F%2Fmarket.android.com%2Fdetails%3Fid%3D"+appinfo.androidpackage 148 page = urllib2.urlopen(gplusurl) 149 html = page.read() 150 page.close() # close page 151

152 keyword = "+" 153 begin = html.find(keyword) + len(keyword) 154 if begin > len(keyword):

155 end = html.find(" ",begin)

156 appUpdate.likes = int(html[begin:end]) 157 else: 158 appUpdate.likes = 0 159 except: 160 appUpdate.isValid=False; 161 # print sys.exc_info()[0] 162 traceback.print_exc(file=sys.stdout) 163 try:

164 page.close() # close page at all times, 165 except:

166 print sys.exc_info()[0] # this could potentially result in error when page already closed? or when page is not yet defined as web-page 167

168 if appinfo.name not in appdict: 169 if dontAdd == True:

170 print "App changed it's name! cannot re-add ", appinfo.name 171 return 172 appdict[appinfo.name] = appinfo; 173 appdict[appinfo.name].checks.append(appUpdate); 174 print str(appdict[appinfo.name]); 175 176

177 def mineParalel(url, appdict): 178 # opener = urllib2.build_opener()

179 # opener.addheaders.append(('Cookie', 'agentok=1')) 180 # f = opener.open(url)

181 # html = f.read()

182 # new security AppBrain: accepts only requests from browsers

(17)

184 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 185 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',

186 'Accept-Encoding': 'none',

187 'Accept-Language': 'en-US,en;q=0.8', 188 'Connection': 'keep-alive'}

189 req = urllib2.Request(url, headers=hdr) 190 try:

191 page = urllib2.urlopen(req); 192 html = page.read();

193 page.close() 194 except:

195 print "error, could not open url: ", url 196 return

197

198 allthreaders = [] 199 index = 0;

200 keyword = '<a href="/app/' 201 while True: 202 index = html.find(keyword,index+1) 203 if (index == -1): 204 break; 205 end = html.find('\"',index+len(keyword)) 206 link = "http://www.appbrain.com/app/"+html[(index+len(keyword)):end] 207 if link.find('%') == -1: # filter arabic signs

208 print 'link: ', link

209 t = threading.Thread(target=mineAppInfo,args=(link,appdict)); 210 t.start(); 211 allthreaders.append(t) 212 # break 213 # mineAppInfo(link,appdict); 214

215 # search next button

216 hasnextButton = html.find('Next') 217 if (hasnextButton != -1) : 218 index1 = html.rfind('=\'',0,hasnextButton)+2 219 index2 = html.find('\'',index1+1); 220 linkadres = "http://www.appbrain.com"+html[index1:index2]; 221 mineParalel(linkadres,appdict) 222 223 for t in allthreaders: 224 t.join() 225 226 227 if __name__ == '__main__': 228 url = "http://www.appbrain.com/apps/hot-week/books-and-reference/" 229 # f = open('App_hot')

(18)

230 # appdict = pickle.load(f) 231 # f.close() 232 appdict = {} 233 234 mineParalel(url,appdict) 235 236 # f = open('App_hot', 'wb') 237 # pickle.dump(appdict, f) 238 # f.close() 239 print 'FINISHED'