Advanced analysis of Twitter data: study on the attitude towards physical activity at various locations

(1)

Bachelor Informatica

Advanced analysis of Twitter

data: study on the attitude

towards physical activity at

var-ious locations

Jens Kalshoven

June 8, 2018

Supervisor(s): dr. S. Wang

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

The life expectancy of people in large cities is lower than that of people in less populated areas. One of the reasons for this is the lack of physical activity of these citizens. The Playful Data-Driven Active Urban Living (PAUL) project, from the UvA, wants to create a personalized application that will help people to do more physical activity. For example by making people more interested in physical activity. However before they will create such an application there needs to be scientific prove that their is indeed a difference in the attitude towards physical activity between large cities and less populated areas. To be able to prove this a dataset containing people’s opinion about physical activity is needed, which can then be analyzed. However such a dataset is not available yet, so this is where the project comes in. The project is separated in three steps. The first step is to create a program which will collect this data. The source of the data will be Tweets from Twitter that contain words about physical activity. These Tweets then will be processed by a program created using Apache Storm. In this program their are two key components. First there is the component that links the coordinates of a Tweet to the closest city. The second component is the sentiment analysis, which will convert the text of the Tweet to a polarity that shows how positive or negative the sentiment of the Tweet is. The second step is to collect the data using the created program, which takes three weeks. The third step is to analyze the collected dataset. This analysis shows that their is indeed a negative correlation between the mean polarity and the population size of a location. This provides proof that people in large cities are less positive about physical activity than people in less populated areas and thus the application from the PAUL project could be helpful.

(4)

(5)

3 The Topology for Data Collection 13 3.1 Keywords . . . 13 3.2 Data Collection . . . 14 3.2.1 Data Processing . . . 14 3.2.2 Apache Storm . . . 14 3.3 Topology . . . 15 3.3.1 Incoming Data . . . 16 3.3.2 Location Conversion . . . 18 3.3.3 Sentiment Analysis . . . 19 3.3.4 Grouping . . . 21 3.3.5 Output . . . 21 4 Data Analysis 23 4.1 Data Collection . . . 23 4.2 Netherlands . . . 24

4.2.1 Correlation on State Level . . . 24

4.2.2 Correlation on County Level . . . 27

4.3 United States . . . 28

4.3.1 Correlation on State Level . . . 28

4.3.2 Correlation on County Level . . . 30

4.4 Further Research . . . 30 5 Conclusion 33 6 Appendix 37 6.1 Keywords . . . 37 6.1.1 Dutch . . . 37 6.1.2 English . . . 38 6.2 Output . . . 39 6.2.1 Statistics . . . 39

(6)

(7)

CHAPTER 1

Introduction

The health of people in the large cities is worse than that of people in less populated areas, as a result the life expectancy of these people is also lower. One of the reasons for this is the lack of physical activity of these citizens. The Playful Data-Driven Active Urban Living (PAUL) project[10], from the UvA, wants to investigate if the physical activity of them can be increased by the use of personalized applications. However at this moment there is a lack of scientific prove for these kind of applications. So to be able to provide prove a dataset containing people’s opin-ions on physical activity will have to be analyzed. An issue is that this dataset doesn’t exist yet, so it will have to be collected first. Once it has been collected the dataset can be used to inves-tigate the correlation between the attitude of people towards physical activity and their location. To gather this data a program has to be created and a data source has to be chosen. For this project the data source will be Tweets from Twitter, which will be collected using the Twit-ter API[19]. To link the Tweets to a location the coordinates associated with the Tweet will be used. These coordinates are however only provided if the Twitter user has enabled this option. Once the Tweets come in they have to be processed. In this project they will be processed by a program created in Apache Storm[1]. It will start by filtering these Tweets on physical activity related keywords like ”workout”. Next the location of each Tweet will be determined by finding the city that is the closest to the coordinates linked to the Tweet. Then the text of the Tweet will be converted to a sentiment value using sentiment analysis. The values coming from this process will be grouped per location and can then be used as a dataset to do analysis on. Concluding the main three contributions of this project will be:

1. A program that will be able to collect Tweets from Twitter and process them. Hereby it is important that the program is clear and thus easy to understand. The main reason for this is that it might be used by other in the PAUL project[10] later, so others will need to understand it too.

2. A dataset containing Tweets with people’s opinions on physical activity. This dataset will be collected for the Netherlands and the United States by the program created for this. 3. Analysis on the dataset to investigate the correlation between the attitude of people towards

(8)

(9)

CHAPTER 2

Theoretical Background

This chapter will discuss some of the basic components of the program for the data collection. First it will give some background on the Twitter API[19] used to collect the Tweets from Twitter. Next an important component within the program for data collection will be discussed, namely location conversion. For this an explanation on the algorithm used to convert coordinates to a location will be given.

2.1 Twitter API

During this project the dataset will be collected from Twitter1_{, a social networking platform.}

On this platform people can post short messages, better known as Tweets. One nice feature of these Tweets is that they have a maximum of 280 characters, which makes sure that people can’t talk about to many different subjects. Thus if one sentence is about physical activity, the rest is probably about physical activity too.

To collect these Tweets from Twitter it won’t be necessary to create a self-made web scraper as Twitter already created an API for this. For this project the Twitter API[19] will be used to collect Tweets for analysis, but that isn’t the only reason someone could use this. One other main reason would be for advertisement, as it also provides tools for campaigns and targeting.

2.1.1 Tweet Object

The Twitter API converts the Tweets and the associated information to a Tweet object[18], also known as a Status. This object contains all the information about a Tweet and some other objects, which also contain information themselves. These object are all encoded using JavaScript Object Notation (JSON). The main Tweet object contains information about the Tweet itself, for instance the text, the language, the coordinates and whether it is a reply, quote or retweet. Next there are three other sub-objects contained in the main object, namely:

• The user object, containing information about the user that wrote the Tweet. For example including the persons name, location and language. But also the number of followers, friends, tweets and favourites.

• The entities object, that contains information about specific parts of the Tweet. For instance hashtags, urls, videos, photos, user mentions and polls.

• The place object, which contains information about the location the Tweets was sent from and the location of the user. Next to coordinates it also has information about the location like its name, the country it is in, the type of place it is and its bounding box. The bounding box consists of four coordinates, which are the corners of the box surrounding the place (Figure 2.1).

(10)

Figure 2.1: Bounding box of the Netherlands1

2.2 Location Conversion

An important part of this project is converting the coordinates linked to a Tweet to a location. One way of doing this is by finding the closest location to the coordinates, for example the closest city. To find this city the project makes use of the nearest neighbour search in a k-d tree, which is an abbreviation for k-dimensional tree. This is a form of a binary space partitioning tree, which can be used for multidimensional searches.

The first step in this search is building up the k-d tree. In this case the space of the k-d tree is the entire world, so latitudes from -90 to 90 and longitudes from -180 to 180. In this space there are points, which are the coordinates of the cities. These points will now be used to split the space into planes by doing the following:

1. Choose a plane, the first time this will be the entire space. For the following steps only the points in the plane will be used.

2. Sort the points on a random axis, in this case longitude or latitude. 3. Take the median coordinate.

4. Split the plane on the chosen axis of the median coordinate.

5. All points lower than the coordinate will be gathered on the left child of the node. All points higher than the coordinate will be gathered on the right side of the node.

6. Repeat from step 1.

This process will end up with a space like the one in figure 2.2 and a tree like the one in figure 2.3. Building up the tree takes some time, but this is only necessary once at the beginning and all the following queries for the nearest neighbour search can use this tree.

Now the nearest neighbour searches can start. Lets say you are looking for the point the closest to A, which has location (B, C). This starts with the process of finding the leaf in which A can be found. This is done with the following process, starting at the root node:

1. Take the value of the axis on which the current node was split. Lets call it F .

2. If B or C is smaller than F , depending on the axis, go to the left child. Otherwise go to the right child.

(11)

Figure 2.2: k-d tree decomposition for the point set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2)1

Figure 2.3: The resulting k-d tree. 1

3. Repeat till a leaf has been reached.

Once at the leaf node, find the point closest to A using the Pythagorean theorem and call the distance between the closest point W . Now the rest of the tree can be searched as follows:

1. Go to the parent of the node.

2. If it has an unchecked child, go to that child. Otherwise go to step 1.

3. Every plane has a bounding box with the most outer points in the plane. If W is smaller than the distance between A and that bounding box, go to step 1.

4. Check if one of the points in the current node are closer than W , then that distance replaces W .

5. Repeat till all nodes have been checked.

During this process it is important to know that if the bounding box of a non-leaf node is to far away, all of the children won’t have to be checked either. This makes it that the nearest neighbour search in a k-d tree is an O(log n) operation in case of randomly distributed points.

(12)

(13)

CHAPTER 3

The Topology for Data Collection

This chapter will provide more in-depth information on the created topology for collecting and processing the data. First the keywords associated with physical activity will be discussed. Next it will introduce the computation system used for the data collection, Apache Storm[1], and some of the other options. Finally the actual topology for the data collection will be discussed in 5 parts. First it will show how the incoming data is handled. Second it shows the location conversion. Then it will introduce the sentiment analysis algorithm used for this project. Next it will discuss the grouping of the data per city, county and state. Finally it will give some more information on the output of the program, for example its visualization types.

3.1 Keywords

The whole point of the processor is gathering data with people’s opinion on physical activity. So one of the first things that needed to be done, was creating a list of words associated with physical activity. Using this word list it is then possible to accept and reject Tweets based on the criterion that at least one of these words had to be in it. Since the sentiment analysis library accepts both English and Dutch this word list needs to have a version for both languages. These words have been chosen in three categories:

1. Verbs. These are verbs with their conjugations that are completely related to physical activity. For example hike, hiked and hiking.

2. Sports. This is a list of the most popular sports in America[8]. For example football, baseball and soccer.

3. Other. These are words that couldn’t really be categorized as a separate sport and are thus added in this category. For instance bicycle, exercise and workout.

These lists were then tested by letting the data processor collect Tweets with these words. Then these Tweets were manually checked and words that turned out to be not completely related to physical activity were removed. An example of this is the verb ”to run”, as ”run” is used for a lot of other things than just the physical activity. For example ”The project was run by Alex” or ”I am running late today”.

After removing words like these, the lists were ready to be used in the program. These lists can be seen in section 6.1, where 6.1.1 includes the Dutch version of the list and 6.1.2 includes the English version.

(14)

3.2 Data Collection

3.2.1 Data Processing

Now the data source has been chosen it is time to choose how the incoming data will be processed. For this there are two main options:

1. Batch processing. This is a way of processing where a larger set of the data gets processed at once. For this to work for this project, an option would be to open the filter stream and write all the data to a file. Then every hour the file could be processed by this data processor. An option would here be to use Hadoop MapReduce1. This is a programming model used to do a set of calculations on a large dataset in a short time by running it in parallel on a cluster.

2. Stream processing, also known as realtime processing. With this approach each data el-ement will be processed individually as it comes in. So in the case of this project, every Tweet will be processed separately. Just like batch processing, stream processing can also be run in parallel on a cluster. This is because it uses flows to process the data and each part of this flow can be run in parallel. A system to do this with is Apache Storm[1], which has also been used for this project.

3.2.2 Apache Storm

During this project the system used to collect and process the data is Apache Storm[1]. This is a realtime computation system that uses a topology to set up the program. This topology is very easy to use and it makes everything very clear for others too. A general Storm program consists of so-called spouts and bolts, these are components of the data processor and in a combination they form the topology of the program. A topology starts at the spouts, these are the components from where the data comes in. These data will be emitted as a T uple, these are key-value based and can also be read as JSON objects. Once a Tweet has been emitted into the topology it ends up at the first bolt in the flow. Once data arrives at a bolt, the data will generally be processed in a few main ways:

• The data will not be sent any further as it doesn’t meet a requirement. • The data will be changed and sent further down the topology.

• The data will be converted to a form which will become output of the program. In this case the data has generally arrived at an ’end bolt’ so the data would probably not be sent any further down the topology.

The links between the spouts and bolts are setup in a separate file and all of the bolts, spouts and links form the topology (Figure 3.1). Because of the way the topology is built up it is possible to run bolts on different nodes and still process the data. It is even possible to run multiple instances of the same node, which comes in handy if one bolt is the bottleneck of the topology. This way the amount of problems like these will be greatly reduced. The scalability has also been benchmarked by Apache Storm and it turned out the it was able to process one million 100 byte messages per second per node on two Intel E5645@2.4Ghz processors and with a memory of 24 GB[13]

Advantages

For this project the scalability hasn’t been a huge advantage since the data stream wasn’t very big. There were however some other advantages of Storm that were very useful.

First of all Apache Storm[1] has a multi-lang protocol. This means that it is possible to write the bolts and spouts in different languages. Apache Storm itself is written in Java, but because

(15)

Figure 3.1: Example of an Apache Storm topology1

of the way the data gets emitted to other bolts it isn’t a problem to use other language. As long as the output uses JSON it can communicate with the other bolts. This feature came in very handy for this project as the sentiment analysis had to be done in Python.

Another important factor of this project was that it might be used for the PAUL project[10] later. So it needed to be easy to understand what happens in the program. Next to this it might need to be used in just a different way, so it should be easy to change certain things, while not breaking the rest of the program. This is where the topology comes in. Because everything is set up in such a modular way it is easy to do things like adding more bolts or changing something in a bolt.

3.3 Topology

As was explained in section 3.2.2 a topology in Apache Storm consists of spouts and bolts. Where spouts emit the data in the topology and bolts do transformations on the data. The topology of the data processor for this project can be seen in figure 3.2. It consists of 1 spout and 13 bolts.

Figure 3.2: Apache Storm topology for this project

From these 13 bolts 4 bolts would be done in Python and the rest in Java. This however be-came rather messy and since the program might be used for the PAUL project[10], it should

1 _{Example of an Apache Storm topology, https://dzone.com/articles/apache-storm-architecture} (ac-cessed on 12 April 2018)

(16)

stay as clear as possible. Thus other options were considered and one of these options was to do everything in Python. This would however mean that it wouldn’t be possible to use the regular Apache Storm anymore as that should be written in Java. The solution for this was to use StreamParse[16], a Python library used to run Python code on an Apache Storm cluster. StreamParse will bundle all the bolts, spouts and other pieces of code in a Java Archive (JAR). Next the project will need a Python file which will declare the topology that will be used. This file is then converted to a Thrift[2] Topology, which will then be passed on to the Storm cluster. From here the code will be run. Apache Thrift[2] is very important in this process as is allows to do scalable cross-language services development. This works for a lot of languages, for instance C++, Java, Python, PHP, Ruby en C.

Now the spouts and bolts of the topology in figure 3.2 will be discussed in more detail.

3.3.1 Incoming Data

For this project the first step is to collect a dataset. This dataset needs to consist of texts from people, but each text needs to meet some criteria:

1. The text need to be about physical activity. This means that it needs to contain at least one word about this, for example ”workout” or ”soccer”.

2. A GPS location needs to come with the text, which is necessary for two reasons. First of all during this project only the United States and the Netherlands are of interest. Second of all it wouldn’t be possible to analyze the correlation between the attitude towards physical activity and the location if this information wasn’t available.

3. The text needs to be either in Dutch or English, since the sentiment analysis library used only supports these languages.

To easily collect these Tweets, the Twitter API[19] will be used. As was already briefly discussed in section 2.1, the Twitter API can be used to gather Tweets in the form of JSON objects, otherwise known as Tweet objects[18].

Options

To collect these Tweet objects, the Twitter API provides various options to do this. However there is a lot of difference between the premium and the standard version of the API. For this project it was only possible to use the standard version, which provides two options.

The first option is to use the search API[14]. This API lets you search Tweets based on a set of parameters, for example coordinates from which it has been sent, the language of the Tweet and a time window during which the Tweet was sent. This seems like a good option for the project as you can get a lot of Tweets interesting for the project in one query. However the standard version has some important limits. For instance it only returns a sample with a maximum of 100 Tweets per query and it only returns Tweets from the past 7 days. Next to this the query would become too long to add the keywords associated with physical activity to the query, so only a small percentage of those Tweets would be about physical activity. This all could have been solved by using the premium version as this allows you to search all Tweets instead of just those from the past 7 days. It also increases the limit of the number of Tweets per query to 500 and the keywords can be added to the query. There is however one more issue that would occur and that is since it takes a sample it is possible that some Tweets are in the result of multiple queries, so the id of each Tweet would need to be checked too before it gets added to the dataset.

The second option is to use the filter realtime Tweets API[5]. This API can be used to open a stream of Tweets from which the Tweets will come in in realtime. There are two types of streams:

(17)

• A filter stream, which will take a sample of all Tweets that are being tweeted and meet certain criteria.

For this project the filter stream will be used, as we are only interested in certain Tweets. So for this the track, locations and language filters come in. The track filter allows you to filter on Tweets containing certain words, in this case words associated with physical activity. Secondly the locations filter allows you to set bounding boxes from which the Tweets need to be sent. Finally the language filter allows you to take only Tweets in certain languages. The best option would be to set all three filters and thus only receive Tweets meeting all three requirement. This is however not possible since the filters would then take Tweets, that contain a keyword or are in the language or are from a certain location. This would mean that even more Tweets need to be processed, which is unnecessary.

Location Data

Another problem during this project is the location data. A Twitter user can namely choose to provide their location along with their Tweets in the privacy settings. This option is however only enabled by a small percentage of the users, so a lot of Tweets can’t be used because of this. For this problem the premium API[19] would again be a solution, since this also provides a location of the user based on their profile. This function is part of the Twitter Firehose API, which is an API that outputs all Tweets meeting the requirements set by the filters instead of a sample. So this would mean that a lot more Tweets can be used for the dataset.

Spout

The topology starts with a spout. Here the Twitter Stream API[5] gets connected and Tweets are emitted in the topology. To set up this stream, the project makes uses of Tweepy[17]. This is a Python library created as an easy access to the Twitter API[19]. It can be used to create a StreamListener, which allows you to to download incoming Tweets in realtime. When setting up this stream, you will need to provide a filter. So in the case of this project the choice is between: keywords, location and language. Since combining the three filters will result in Tweets that are in the right language or are in the right location or contain one of the keywords, only one filter will be used. This filter has to be the filter with the least output. After some testing this turned out to be the location filter and thus only that filter is used. Each incoming Tweet object[18] is than put in a queue. This queue gets read by the spout and each time it is ready to process a Tweet object it takes the oldest one from this queue and emits it.

Filtering and Parsing

Once emitted in the topology the Tweet object[18] will reach the filter bolt. This is necessary since not all filters could be added when opening the stream. So this bolt checks the Tweet object on three criteria:

1. Does it have location info associated to it? For this project it is necessary to link each Tweet to a location, so Tweets without one will be rejected.

2. Is it from one of the countries that are being researched? In this case it check whether the Tweet is from the United States or the Netherlands.

3. Is it in an accepted language? This is important since the program and sentiment analysis library used in this project only accept Dutch and English.

4. Does the Tweet contain one of the physical activity related keywords? During this project only Tweets about physical activity are of interest, so the others can be rejected.

Next will be the parser bolt. This bolt extracts all the interesting information from the Tweet object and creates a Tuple with this data. Which data this is can be found in table 6.7.

(18)

3.3.2 Location Conversion

Because of a filter in the topology only Tweets with an associated geolocation are emitted. These geolocations are determined by the device the Tweet is sent from and have the form of [longitude, latitude]. For this project we are however more interested in the city, county and state the Tweet has been sent from. This process will be done in the third bolt in the topology, which is the location bolt.

One way of doing this is by taking a list of cities with their coordinates and find the city clos-est to these coordinates. Such a list has already been created by GeoNames[6], a geographical database. It contains over 10 million geographical locations with info on the locations, like the county it is in and the population size. For this project there is one particular file that is very interesting and that is the cities10002file. This file contains the coordinates and info of all cities with more than 1000 residents.

To find the the closest city to the coordinates Reverse Geocoder[12] is used. This is a Python library for offline reverse geocoding developed by Ajay Thampi. It uses a nearest neighbour k-d tree to find the closest city to the coordinates, which has been explained in section 2.2. It also supports multi-threading during the search, this is however only interesting when checking mul-tiple coordinates in one query. So if batch processing was used, it would help a lot as can be seen in the benchmark of figure 3.3. Here mode 1 in single-threaded and mode 2 if multi-threaded. It can be seen that mode 2 is approximately twice as fast on a dataset of 10 million coordinates.

Figure 3.3: The performance of Reverse Geocoder for mode 1 (single-threaded) and mode 2 (multi-threaded) plotted for various input sizes.[12]

When setting up this bolt it also sets up Reverse Geocoder[12]. Reverse Geocoder can then namely generate the k-d tree needed once and it can then be used for all the future nearest neighbour searches. This bolt then takes the GPS location that was extracted from the Tweet object[18] and queries that in Reverse Geocoder. Reverse Geocoder then returns the closest city with its other info. This info is extracted from the cities1000 file and includes the country, state and county the city is located in. These are all saved in a file like the one in table 3.1. Here name is the name of the city, admin1 is the state, admin2 is the county and cc is the country code according to ISO-3166. It will then be added to the Tuple and emitted to the next bolt.

2_{Download location of cities1000.zip, http://download.geonames.org/export/dump/ (accessed on 12 April} 2018)

(19)

lat lon name admin1 admin2 cc 44.97997 -93.26384 Minneapolis Minnesota Hennepin County US 52.37403 4.88969 Amsterdam North Holland Gemeente Amsterdam NL

Table 3.1: Example of the data in the cities1000 file

3.3.3 Sentiment Analysis

This section will go into the sentiment analysis part of the project and the bolts that have this task. For this the Pattern[9] library is being used.

Pattern

Pattern[9] is a Python library used for several processes. It was created by Tom De Smedt and Walter Daelemans from the University of Antwerp. The library can for example be used for web mining, machine learning, network analysis and in this case natural language processing. Due to the many domains it works on they can also easily be combined. For example it can use the web mining packages to gather data and then process the texts with the natural language processing packages. Next to all these different domains the library also supports many different languages. So because of this both Dutch and English Tweets can be collected for the dataset. English sentiment analysis already existed for a longer time, but good Dutch sentiment analysis packages are really rare. This package was created by De Smedt and Daelemans as a case study in 2012 [15]. Using Pattern they created a so-called subjectivity lexicon to be able to do sentiment analysis in Dutch, which they bundled in Pattern later.

Subjectivity Lexicon

The subjectivity lexicon created by De Smedt and Daelemans[4] consists of 1100 adjectives that occur frequently in online product reviews. Each of these adjectives has been manually annotated with three values:

1. Polarity. The polarity is a value between -1 and 1. This value describes how positive or negative a text is, where -1 is very negative and 1 is very positive.

2. Subjectivity. A text can generally be classified in one of the categories, either it is an objective fact or it a subjective opinion. To describe how subjective a certain text is, it uses a value between 0 and 1. Where 0 means it is a fact and 1 tells you it is an opinion. 3. Intensity. This value describes the influence of an adjective on a noun. For example when

comparing ”incredibly good” and ”pretty good” the intensity ”pretty” on ”good” is not a high as the one of ”incredibly” on ”good”. Thus ”incredibly” gets a higher intensity in the lexicon than ”pretty”.

To collect these adjectives they made use of the Pattern Web Mining package to collect ap-proximately 14000 online Dutch book reviews from bol.com. From these reviews about 4200 adjectives were collected using the Natural Language package of Pattern. From these 4200 ad-jectives the 1100 most frequent adad-jectives have been annotated. These were all adad-jectives that occurred more than four times. These adjectives were than given to seven human annotators in a random order and they were then asked to classify each adjective on polarity, subjectivity and intensity. If needed adjectives were also classified when used in different senses of the word. For example in the sentences ”Than man is crazy” and ”I am crazy in love” the word ”crazy” has two completely different senses. These annotations were done using a triangle representation like the one in figure 3.4. During the annotation some adjectives were removed due things like spelling errors. Bringing the total lexicon too 1044 adjectives and 1526 word senses [4].

To improve the created lexicon they made use of distributional extraction. In this process se-mantic relatedness between words is extracted using distributional information. This method also requires nouns as the adjectives will be vectors and the nouns will be the features in a vector

(20)

Figure 3.4: Triangle representation with polarity and subjectivity axes.[4]

space model. In this model the value of a feature is determined by the frequency of which an adjective precedes a noun. Using this information adjectives that precede most of the same nouns can be described similarly. To calculate this similarity it takes cosine of the angle between two vectors, which is also known as cosine similarity. After this they use dimensionality reduction and clustering by cosine distance to create groups of words that are semantically related. The nouns used for this were chosen by collecting 3 million words and taking the 2500 most frequent nouns in this text. Next the features were determined and for each adjective the top 20 most similar adjectives were determined. For example for ”fantastisch” (fantastic) the top 3 was ”geweldig” (great) with 70% similarity, ”prachtig” (beautiful) with 51% and ”uitstekend” (excellent) with 50%. This process resulted in a lexicon with 3121 adjectives and 3713 senses.

To test the lexicons a set of 2000 Dutch book reviews was used. Here the polarity and sub-jectivity of a review are determined by using the lexicon on the words it knows and then taking the average values of these words in the review. So for example if a sentence includes the words ”Fantastic” (1 polarity) and ”Good” (0.5 polarity) the polarity of the sentence becomes 0.75. Hereby it is important to note that these book reviews were not used to extract the initial lexi-con. Also they were evenly distributed in negative (1-2 star rating) and positive (4-5 star rating) opinions. During this tests a polarity greater than or equal to 0.1 was noted as positive and a polarity less than or equal to -0.1 was noted as negative. To determine the performance of each lexicon two values were calculated:

• The precision (True Positives / (True Positives + False Positives)) • The recall (True Positives / (True Positives + False Negatives))

During the first run the intensity was not taken into account and this resulted in a precision of 0.72 and a recall for 0.78 for the manually annotated lexicon. During the second run the intensity was being used and this increased the recall to 0.82. These were also tested for the automatically annotated lexicon, yet this did not increase the precision and recall as 90% of the most frequent adjectives were already covered by the manually annotated one. The tests were also done on Dutch music CD reviews to test the cross-domain applicability. Here the precision was 0.70 and the recall 0.77. Thus the sentiment analysis pack can also be used on different domains, for example Tweets in this case.

In the topology the process of sentiment analysis happens in two bolts. First there is the refactor bolt, which will refactor the text of the Tweet to prepare it for the sentiment analysis. It mainly changes four elements of a Tweet. First of all it removes mentions and URLs as these can’t be read by the sentiment analysis library. Next it removes the ”#” from the hashtags as the

(21)

sentiment analysis library can’t read those either, but it is possible that it can read the word after the ”#” and thus they can’t be removed completely. Finally it translates emoticons in the text. This is done by creating a translation dictionary where each emoticon is translated to a word with a similar sentiment. So for example a smiling emoticon will be translated to ”happy” and an crying emoticon will be translated to ”sad”.

Once the text has been refactored, it arrives at the sentiment analysis bolt. When setting up this bolt it will also setup the sentiment analysis packages from Pattern[9] in Dutch and English. Then once a T uple comes in it checks the language of the Tweet, which was extracted from the Tweet object[18] in the parser. Then it gives the text to the right sentiment analysis pack and this will return a polarity and a subjectivity. The subjectivity will then be compared to the threshold of 0.34. So if the subjectivity of the text if higher than 0.34 it will emit the T uple to the next bolts. This threshold was determined by the creators of the sentiment analysis pack [4].

3.3.4 Grouping

Once the sentiment values have been determined the data will go to four different bolts. Three of these bolts are responsible for grouping the data and will be discussed in this section. The fourth bolt will output the Tweets and will be discussed in section 3.3.5.

The grouping bolts all have a dictionary. In this dictionary every key is a different city, county or state. For every key there is a list and in this list the polarities will be saved. This way these lists can later be used for the analysis and output. For example in the form of box plots and statistics.

3.3.5 Output

The final step of the process is the output of the data. This is done in two main ways, either as raw data in CSV files or as visualizations. The techniques used for this will be discussed in the following two sections.

Raw Data

The first way of outputting the data is as raw data. The first time this happens is in the CSV output bolt. This bolt appends every new Tuple to a CSV file. All the information that is collected here can be found in table 6.7.

The second way happens right after the grouping of the data in the statistics bolt. This bolt uses the list of polarities per location to calculates the mean, standard deviation and number of Tweets.

Visualization

The other way of output is visualization. This happens in the heat map and box plot bolts, which are both right after the grouping bolts. The box plots are created using PyPlot[11] from the MatPlotLib library[7]. These are only generated for the states as there are too much counties and cities to fit on one box plot.

Finally heat maps are created for the polarities. To be able to do this the program makes use of the BaseMap Toolkit[3] from the MatPlotLib library[7]. BaseMap can be used to plot 2D data on maps in Python. It does however not plot the data itself, but it transforms coordinates to one of 25 different map projections making use of the PROJ.4 C library. MatPlotLib is then used to plot the lines and points with the transformed coordinates. To draw the lines BaseMap makes use of shapefiles. These shapefiles contain all the lines of for example a country. These shapefiles for the Netherlands were collected from Imergis3. For the United States they were

(22)

collected from the United States Census Bureau4. Then the polarity lists were used to determine the color needed to represent the sentiment of the location. These visualizations are only done for the states and counties.

4_United _States _Census _Bureau _Shapefiles, _{https://www.census.gov/geo/maps-data/data/} tiger-cart-boundary.html (accessed on 20 April 2018)

(23)

CHAPTER 4

Data Analysis

This chapter is separated into two parts. First it will give some information on the data collection process, for example the machines used for this and some statistics. Second it will provide the information about the data analysis. This part is again separated into two parts, namely the Netherlands and the United States. Then the findings will be discussed per country, where the focus will lie on the states and counties. Here a similar correlation is found between the attitude towards physical activity and the population size of the location in both countries.

4.1 Data Collection

Now the topology for the collection of the data is finished it is time to run this program. To determine the time the program has to be run to collect enough Tweets some testing was done. This test ran from 13:00 to 18:00 CEST and they counted the number of Tweets that met all of the following requirements:

1. It is either in Dutch or English 2. It has coordinates linked to it 3. It is from the Netherlands

4. It contains a keyword associated with physical activity 5. It has a subjectivity of at least 0.38

After this test it turned out only 1 Tweet about every 8 minutes met all of these requirements for the Netherlands. Also it is important to note that the test was run during the day, so during the night it would be even less. This meant that to collect about 2000 Tweets 11 days of data collection were required and to also take the nights into account this had to be approximately doubled. Thus about three week of data collection are required. To be able to compare the results in the Netherlands with another country the data will also be collected for the United States. So if a similar correlation can be seen in both countries, this will provide extra validation. To test what the approximate size of this dataset would be, the same test was also done for the United States. This resulted in about 14 Tweets per minute meeting the requirements. So here three weeks of data collection would be plenty.

The data collection process started on May the 7th 2018 at 9:00 on a Raspberry Pi 3B+1_.

The Raspberry Pi was able to run the program, but it took a lot of its resources. So to relieve it of some of its workload the visualization bolts were disabled. It however turned out to be not enough since after a week of data collection the program started to crash about twice a day. The

1_{Raspberry Pi 3B+, https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/ (accessed on} 25 May 2018)

(24)

following days the time between those crashes started to decrease rapidly until eventually the program crashed about every 30 minutes. At this pointed the data collection process was paused until the cause for the crashes or a better option was found.

After two days of testing the reason for the crashes couldn’t be determined, so the next best option was to hire a server to run the program on. The server used for this became a DigitalO-cean2 _{droplet with 4 cores and 8GB RAM running Ubuntu 16.04.4. Once everything was setup}

on this server and the program ran it has run without any more trouble until May 31st 2018 17:00. During this process it became clear how important backups were. So every thirty second the dictionaries in the grouping bolts discussed in section 3.3.4, were backed up by the backup bolt. This turned out to be very handy as these dictionaries can now also be used to for instance create box plots to compare only a couple of cities. Next to this these backups could be loaded when starting the grouping bolts to keep on going as fast as possible. Another thing that was created to keep everything running is a bash script that detects whether the program is still running and in case it crashed it restarts it. This way no time is lost if the program crashes for an unknown reason.

After three weeks of collecting data on the Netherlands and the United States the dataset was complete. The results of the analysis on this dataset will be discussed in the following two sec-tions. First the results of the dataset from the Netherlands will be discussed. After that the same will be done for the United States.

4.2 Netherlands

In the Netherlands 1671 Tweets were collected. These are approximately 300 less Tweets then hoped for and thus some of the locations only have a few Tweets to analyze. For example when looking at the number of Tweets per state in table 4.1, it can be seen that number varies a lot per state. A set of 21 Tweets from Flevoland for instance is a really small dataset to draw conclusions from.

Table 4.1: Statistics per state of the Netherlands ordered by the number of Tweets. State Mean Polarity Standard Deviation Polarity Number of Tweets Zuid-Holland 0.177902529669 0.447582543322 405 Noord-Holland 0.228488128397 0.446905901494 301 Noord-Brabant 0.25881083212 0.428812766333 229 Gelderland 0.236634728275 0.459174369797 222 Utrecht 0.181047554943 0.407257527644 174 Overijssel 0.30413997114 0.433332737429 77 Friesland 0.303438113777 0.385625794947 72 Groningen 0.355080128205 0.380036928245 52 Zeeland 0.182724358974 0.384248267635 39 Limburg 0.209800347222 0.413286690135 38 Drenthe -0.00433773216031 0.500887695796 31 Flevoland 0.193650793651 0.260568984177 21

4.2.1 Correlation on State Level

As a visualization of these statistics the heat map in figure 4.1 has been created. Here it can be seen that the states with larger cities like Noord-Holland, Zuid-Holland and Utrecht are less positive about physical activity than states with smaller cities like Friesland, Groningen and

(25)

Overijssel. Another point of interest in this figure is Drenthe, the state that is the most negative about physical activity according to this dataset. However most likely this is due to the lack of Tweets from this state, which are only 31 Tweets. Thus it is not really possible to include this state in the overall conclusions.

Another way of visualization is the box plot, which can be seen in figure 4.2. Here the states with less than 50 Tweets are not included. One of the notable things in this box plot is the lower quartile when comparing the states. Here it can again be seen that the lower quartile of the states with larger cities is significantly lower than the one of states with smaller cities.

(26)

(27)

To get a better idea of the influence of large cities on the mean polarity some other data was combined with the dataset. These are the population size and population density per state, which have been gathered from CBS StatLine3_{. Using this data the correlations between the}

mean polarity and both variables have been checked and this gave the result which can be seen in figures 4.3 and 4.4. Here a significant correlation between can be seen in both figures. It also shows that the more people there are in the state, and thus the bigger the cities are, the more negative the attitude towards physical activity. It is important to note that not all states have been included in these calculations, again only the states with more than 50 Tweets, thus only 8 of the 12 states. It would be more interesting if this was also possible for the counties, but since there aren’t enough Tweets for this, this isn’t an option. The Pearson Correlation has also been calculated (table 4.2) and this shows that both the population size and population density are correlated with the mean polarity and most important is that both correlations are significant.

Figure 4.3: Linear regression of the mean polarity on the population size per state in the Netherlands.

Figure 4.4: Linear regression of population the mean polarity on the population density (per squared kilometer) per state in the Netherlands.

Table 4.2: Pearson Correlation between mean polarity, population size and population density in the states of the Netherlands. Also includes the significance of the correlations within the parentheses.

mean polarity population size -0.713 (0.024) population density -0.857 (0.003)

4.2.2 Correlation on County Level

Going deeper into the data the counties can be compared. However due to the lack of Tweets only a small selection of these counties can be used for the comparison. In this case these are the 10 counties with the largest number of Tweets, which can be seen in table 4.3. From these counties Zuidhorn, Heerde, Zundert and Haaksbergen are counties from states with less big cities. The mean polarity of these counties is 0.2504, while the mean polarity of the other counties is 0.1565, which is a significant difference.

3_{CBS StatLine, http://statline.cbs.nl/StatWeb/selection/?DM=SLNL&PA=70072NED&VW=T (accessed on 3} June 2018)

(28)

Table 4.3: Statistics of the top 10 counties of the Netherlands ordered by the number of Tweets, where [1] means the county is in a state with big cities and [2] means it is not.

County Mean Polarity Standard Deviation Polarity Number of Tweets Aalsmeer [1] 0.171497895143 0.461531680461 94 Goeree-Overflakkee [1] 0.0829846753121 0.431708414731 72 Westland [1] 0.243629960317 0.381025231529 70 Montfoort [1] 0.229003453202 0.420732192036 53 Zuidhorn [2] 0.334636752137 0.408922580135 39 Heerde [2] 0.21007127193 0.369122437388 38 Voorschoten [1] 0.166968390805 0.444764737422 29 Zundert [2] 0.248952586207 0.4200310504 29 Haaksbergen [2] 0.189293154762 0.468815945557 28 Utrechtse Heuvelrug [1] 0.14324280754 0.389243860074 28

4.3 United States

In the United States 144537 Tweets have been collected, approximately 86 times more than in the Netherlands. This is unexpected since United States has a population of approximately 327.9 million4 and the Netherlands has a population of about 17.1 million5, thus only 19 times more residents. This shows that Twitter way less popular in the Netherlands.

An important thing to note is that during the collection of the data, the maximum size of the polarity list per location was set to 1000. This was not reached in the Netherlands, but it has been reached by most of the states in the United States. This however should be a large enough dataset to calculate the statistics for. These statistics can be seen in table 6.8, here it can be seen that only 15 of the 50 states have not reached 1000 Tweets after these three weeks.

4.3.1 Correlation on State Level

The states have been visualized in a heat map, which can be seen in figure 4.5. However in this figure it is much less clear that there is a correlation between the mean polarity and the population density. For example California and Texas are very positive, while also having a high population density. To test whether this is really true the population size and population density of each state have been gathered from IPL26_{. This data has then be used to calculate a linear}

regression between the mean polarity and the population density in order to find correlations. The graphs resulting from this can be seen in figures 4.6 and 4.7. However here it becomes clear that there is not a correlation between the variables like the one in the Netherlands. This time polarity even increases with the population size. A possible explanation for this is that the states in the United States are much bigger than the states in the Netherlands, thus the population density explains less about the polarity. The small decrease when regressing the mean polarity on the population density is insignificant, which is also shown when calculating the Pearson Correlation in table 4.4.

4_{U.S. and World Population Clock, https://www.census.gov/popclock/ (accessed on 1 June 2018)} 5_Netherlands _Population,

http://www.worldometers.info/world-population/netherlands-population/ (accessed on 1 June 2018)

(29)

Table 4.4: Pearson Correlation between mean polarity, population size and population density in the states of the United States. Also includes the significance of the correlations within the parentheses.

mean polarity population size 0.434 (0.001) population density -0.054 (0.354)

Figure 4.5: Heat map showing the mean polarity per state in the United States

Figure 4.6: Linear regression of the mean polarity on the population size per state in the United States.

Figure 4.7: Linear regression of population the mean polarity on the population density (per squared mile) per state in the United States.

(30)

4.3.2 Correlation on County Level

To get more specific results a similar analysis has been done on the counties. From these counties the population size and density have again been gathered, this time from American FactFinder7_.

This time a lot of the counties had less than 50 Tweets, so these counties have been excluded from the following results. First of all there are the linear regressions of the polarity mean on the population size and density. These can be seen in figure 4.8 and 4.9. Here a negative relation between the polarity and the population size and density can be seen. Thus possibly the positive relation when comparing the same with states is indeed because of the size of the states. It is also important to note that the correlation between the polarity and the population size is significant while this is not the case for the population density. This is according to the Pearson Correlation, which can be seen in table 4.5.

Table 4.5: Pearson Correlation between mean polarity, population size and population density in the counties of the United States. Also includes the significance of the correlations within the parentheses.

mean polarity population size -0.070 (0.092) population density -0.025 (0.317)

Figure 4.8: Linear regression of the mean polarity on the population size per county in the United States.

Figure 4.9: Linear regression of population the mean polarity on the population density (per squared mile) per county in the United States.

4.4 Further Research

To get an even better understanding of the correlation it is possible to do the same regressions on a city level, however for this more Tweets will be needed. This could simply be done by running the program for another few weeks, but another option would be to invest in the Premium Twit-ter APIs[19]. Then you namely have two option. First of all you could use the TwitTwit-ter Firehose API, which provides all Tweets meeting the requirements instead of only a sample. Next to this the API provides more location data and thus more Tweets that can be used in this project. The second option is to use the premium Twitter Searching API, which allows you to search for Tweets in their complete database and returns a set of Tweets meeting the set requirements. Than it is also important to note that it would be an option to do batch processing instead of stream processing. As the Tweets now come in in batches instead of as a realtime stream.

(31)

It also important to remember that these regressions only include a small set of variables, so other factors could also have a large influence. It is for example possible that people in less populated locations are generally happier and thus this also have an influence on the sentiment of the Tweets. To test this another dataset will have to be collected in which all Tweets are accepted instead of only those about physical activity.

(32)

(33)

CHAPTER 5

Conclusion

The first step in this project was to create a clear and easy to understand program for data collection and data processing. The framework behind this program is Apache Storm[1], which is easy to understand due to the topology. This topology also makes it easy to change something in the program without crashing the other parts. After three weeks of gathering data it can be concluded that the program works well. It is however necessary to run it on a relatively good server, as the Raspberry PI 3B+ couldn’t handle it anymore once the dataset got too big. The second step was to collect a dataset containing Tweets with people’s opinions on physi-cal activity. This required the program to run for three weeks and collect a dataset for both the United States and the Netherlands. In the Netherlands this totalled up to 1671 Tweets and in the United States these were 144537 Tweets. These are 86 times more Tweets than in the Netherlands, while the United States only has 19 times more inhabitants. This also shows how much more popular Twitter is in the United States.

The third and final step of this project was to analyze the dataset and search for a correla-tion between the locacorrela-tion of people and their attitude towards physical activity. The analysis of the collected data shows that there is indeed a negative correlation between the attitude to-wards physical activity and how highly populated an area is. This effect can be seen both in the Netherlands and the United States, it is however more significant in the Netherlands.

For further research there are a few things to do. First of all it would be interesting to col-lect more data. This can be done using the current program and run it for a few more weeks. The other option is to use on of the premium Twitter APIs. Another point of interest might be to investigate the correlation between overall sentiment of people and their location. People in less populated areas could for example be more positive than people in populated areas in general. To do this the same program can be used, but this time all Tweets will be accepted instead of only those about physical activity. This research can be used as a check for peoples mental health in certain areas. This would however become a completely new research topic. For this project it could be used to remove one of the factors possibly influencing the sentiment.

(34)

(35)

Bibliography

[1] _{Apache Storm. url: http://storm.apache.org/ (visited on 04/12/2018).} [2] _{Apache Thrift. url: https://thrift.apache.org/ (visited on 04/20/2018).}

[3] _{BaseMap Toolkit. url: https://matplotlib.org/basemap/ (visited on 04/12/2018).} [4] Tom De Smedt and Walter Daelemans. “” Vreselijk mooi!”(terribly beautiful): A

Subjec-tivity Lexicon for Dutch Adjectives.” In: LREC. 2012, pp. 3568–3572.

[5] _{Filter realtime Tweets API. url: https://developer.twitter.com/en/docs/tweets/} filter-realtime/overview (visited on 04/12/2018).

[6] _{GeoNames. url: http://www.geonames.org/ (visited on 04/12/2018).}

[7] _{MatPlotLib. url: https://matplotlib.org/index.html (visited on 04/12/2018).} [8] _{Most Popular Sports in America. url: https://www.ranker.com/crowdranked- list/}

most-popular-american-sports (visited on 04/12/2018).

[9] _{Pattern Library. url: https://www.clips.uantwerpen.be/pattern (visited on 04/12/2018).} [10] _{PAUL Project. url: http : / / www . hva . nl / kc - bsv / gedeelde - content / projecten /}

projecten- kracht- van- sport/playful- data- driven- active- urban- living- paul. html (visited on 04/12/2018).

[11] _{PyPlot. url: https://matplotlib.org/api/pyplot_api.html (visited on 04/12/2018).} [12] _{Reverse Geocoder. url: https://github.com/thampiman/reverse-geocoder (visited on}

04/12/2018).

[13] _{Scalability benchmarks of Apache Storm. url: http : / / storm . apache . org / about /} scalable.html (visited on 04/12/2018).

[14] _{Search API. url: https://developer.twitter.com/en/docs/tweets/search/overview} (visited on 04/12/2018).

[15] Tom De Smedt and Walter Daelemans. “Pattern for python”. In: Journal of Machine Learning Research 13.Jun (2012), pp. 2063–2067.

[16] _{StreamParse. url: https : / / streamparse . readthedocs . io / en / stable / index . html} (visited on 04/20/2018).

[17] _{Teepy Library. url: http://www.tweepy.org/ (visited on 04/20/2018).}

[18] _{Tweet Object. url: https://developer.twitter.com/en/docs/tweets/data-dictionary/} overview/intro-to-tweet-json (visited on 04/12/2018).

(36)

(37)

CHAPTER 6

Appendix

6.1 Keywords

These are the tables with the used physical activity related keywords. They are separated in a Dutch and English section.

6.1.1 Dutch

Table 6.1: Verbs

dansen dans danst gedanst fietsen fiets fietst gefietst rennen ren rent gerend

sporten sport gesport

trainen train traint getraind wandelen wandel wandelt gewandeld

Table 6.2: Sports

atletiek darten roeien turnen badminton fitness rugby voetbal basketbal golf softbal volleybal boksen gym tennis worstelen bowlen honkbal training zeilen

Table 6.3: Other bal marathon oefening spierpijn uithoudingsvermogen workout

(38)

6.1.2 English

Table 6.4: Verbs dance danced dancing climb climbed climbing cycle cycled cycling hike hiked hiking train trained training

Table 6.5: Sports

athletics cycling hockey sports badminton darts rowing tennis baseball fitness rugby volleyball basketball football sailing wrestling bowling golf soccer

boxing gym sport

Table 6.6: Other bicycle bike exercise marathon stamina workout

(39)

6.2 Output

Table 6.7: Extracted data fields from Tweet object[18] tweet id

text refactored text

created at post location GPS post location Country post location Province post location County

post location City user location

user id user verified user statuses count user favourites count

user followers count entities hashtags num

entities urls num entities user mentions

in reply to status id in reply to user id

quoted status id quoted status user id

sentiment results

(40)

Table 6.8: Statistics per state of the United States ordered by the number of Tweets. State Mean Polarity Standard Deviation Polarity Number of Tweets

Ohio 0.245731592667 0.432235753293 1000 Virginia 0.166731597414 0.429439318882 1000 Oregon 0.187101317567 0.401247335781 1000 Mississippi 0.208966281641 0.42042075949 1000 California 0.294497089383 0.438800786245 1000 Kentucky 0.311054632874 0.427766512377 1000 Tennessee 0.201657852313 0.449476182454 1000 West Virginia 0.185810843233 0.423167213084 1000 Kansas 0.21846176616 0.408236697055 1000 Missouri 0.262512113045 0.406902669756 1000 Nevada 0.156895573788 0.403339400099 1000 Illinois 0.215165766105 0.436395470575 1000 Louisiana 0.155506308475 0.418847044092 1000 Washington 0.186547947461 0.40231109109 1000 Indiana 0.218956574069 0.423655559204 1000 Georgia 0.21225969402 0.433168018874 1000 South Carolina 0.195560254309 0.42965607725 1000 Michigan 0.211211755777 0.440725999281 1000 Pennsylvania 0.193138503721 0.428748656066 1000 Massachusetts 0.178330435074 0.42337735784 1000 Wisconsin 0.246747125271 0.405267113569 1000 New York 0.212991125059 0.416708407855 1000 Nebraska 0.238987536882 0.407569143082 1000 Oklahoma 0.219545987161 0.418549453676 1000 Iowa 0.204598394962 0.416333707456 1000 Connecticut 0.227599010682 0.419875363188 1000 Florida 0.251473700622 0.44492750006 1000 Colorado 0.240266995081 0.399690707386 1000 Arizona 0.181952962438 0.42730639541 1000 Minnesota 0.227431762332 0.401552256853 1000 Texas 0.283615115843 0.430346997822 1000 Alabama 0.186741759651 0.403797891243 1000 New Jersey 0.218896858593 0.428094156512 1000 Maryland 0.17154575893 0.431862909743 1000 North Carolina 0.184512488221 0.405941636301 1000 Utah 0.216685305131 0.38968829649 940 New Mexico 0.194606243092 0.417261601096 863 South Dakota 0.162931459658 0.457888781695 643 Hawaii 0.219934136901 0.383176066792 627 Arkansas 0.241675301392 0.399262506005 601 Rhode Island 0.181362376455 0.423638860423 446 Idaho 0.216829115233 0.392761375698 431 Delaware 0.190106249775 0.431661534239 405 New Hampshire 0.244546881932 0.423633311708 348 Maine 0.146014978068 0.384402108814 256 North Dakota 0.227951807328 0.414660304963 243 Alaska 0.182246465613 0.393645591863 199 Montana 0.23837697796 0.404787839558 195 Wyoming 0.226901690039 0.36346667738 126 Vermont 0.125902251825 0.407650765885 118

Advanced analysis of Twitter data: study on the attitude towards physical activity at various locations

Bachelor Informatica