Sensing perceived air quality from social media on a neighbourhood level

(1)

Sensing perceived air quality from social media on a

neighbourhood level

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

SEBASTIAN MEHLDAU

11282517

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

July 17, 2017

Supervisor 2nd_Reader

Dr. Stevan Rudinac Dr. Ilya Markov

(2)

Sensing perceived air quality from social media on a

neighborhood level

Sebastian E. Mehldau

University of Amsterdam

Amsterdam, Netherlands

sebastian.mehldau@gmail.com

ABSTRACT

Urban computing is the process of acquiring, integrating, and analyzing city-wide data to solve common problems associated with urban living. This work proposes a data driven approach for using social media to sense citizen satisfaction in relation to air quality. Data comprises a collection of georeferenced Twitter messages from Mexico City, as well as ambient concentrations of air pollutants and a topic model extracted from news articles related to Mexico City. The approach consists of establishing a relationship between an air quality index and the frequency of social media posts about air pollution. We report results of correlation analysis between the two variables at the city and neighborhood levels. Furthermore, results are presented in a prototype interface where users can explore air quality signals discovered through our analysis.

Author Keywords

Urban computing; social media; data mining; topic modelling.

ACM Classification Keywords

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval---information filtering.

INTRODUCTION

Today, more people live in urban areas than in the countryside. Cities provide infrastructure, social and leisure activities, and a wide range of services, which make people’s lives easier and happier. Urban living is associated with greater access to resources such as health, education, and jobs. Thus, people move to the cities in search of opportunities, wellbeing and other non-trivial reasons. In fact, the United Nations (2014) estimate that by 2050 2.5 billion people will be added to the population of urban areas around the world. This growth is likely to put pressure on cities. At present, rapidly growing cities already face challenges that threaten livability such as increasing traffic, crime, and pollution.

Urban computing is a scientific field that aims at solving some of these problems by harnessing, integrating and analyzing data generated from sensors, citizens, buildings and vehicles – among other sources. The objective of urban computing is improving people’s lives, city operations and

the environment in an unobtrusive and continuous manner (Zheng, Capra, Wolfson, & Yang, 2014). However, this process and the related methods are not always straightforward and involve several aspects that are challenging for researchers, city planners, city officials, and other stakeholders. Collecting citywide data, for example, can be expensive and difficult. Moreover, some factors that influence city livability are hard if not impossible to measure directly (Okulicz-kozaryn, 2013).

In this paper, we propose a method for tapping into social media data streams to explore how residents in big urban areas perceive city livability in relation to air quality. Concretely, we want to know if there is a relationship between changes in air pollution levels and the volume of Twitter messages about air quality in greater Mexico City. To test the validity of our approach, we conduct a correlation analysis between filtered Twitter messages about air quality and ambient concentrations of critical pollutants obtained from 34 air monitors (ground truth). Furthermore, we do this analysis on a neighborhood level. The following research question leads our work:

- How can we measure and capture air quality related citizen satisfaction unobtrusively and in a large scale through Social Media?

The underlying hypothesis is that there is a positive correlation between the dynamics of the air quality index (AQI) and the number of Twitter messages about air pollution - i.e. the variation on the first variable corresponds to the variation of the second one. Moreover, we expect to find a stronger relationship in neighborhoods with higher levels of pollution. We imagine that the findings will help to better understand how residents in Mexico City experience the city in relation to air pollution. As such, the main contributions are:

- The use of geo-tagged social data for hyper-local analysis of city livability.

- A novel implementation of LDA topic modelling for filtering relevant user posts.

- Integration of multiple data sources, i.e. user-generated social media, news, and open data into a single analysis pipeline.

(3)

RELATED WORK

Our approach is made possible thanks to the increasing availability of sensor-rich smartphones and ever-present social media such as Facebook, Instagram, and Twitter. The combination of both ingredients as well as faster and widely available cellular networks have led to an explosion of georeferenced social media messages (Silva, Melo, & Almeida, 2014). People share valuable data such as their opinions and locations in real-time, which can be used to find patterns and to better understand processes in a city (Rudinac, Zahálka, & Worring, 2017). Consequently, users become human sensors (Zheng et al., 2014) that enable participatory sensing networks (PSN) (Silva et al., 2014). PSNs may shed light upon people’s behavior and city dynamics where conventional methods from the social sciences (e.g. surveys) fall short due to time and cost constraints.

Traditional disease surveillance, for instance, relies on clinical data such as patient records from hospital admissions and physicians, laboratory tests, and the like. Major flaws of these data sources are privacy concerns and time lags; on average, it takes one or two weeks for them to become available (Achrekar, Lazarus, & Park, 2010). Therefore, health researchers have started to look for non-clinical, indirect signals from data that is easier to retrieve. A few examples of these data sources include internet news, search queries, online documents, and user-generated content posted on social media (Hwang, Wang, Cao, Padmanabhan, & Zhang, 2013).

Multiple studies suggest that social media is useful in monitoring diseases such as seasonal influenza and detecting changes in behaviors and dynamics that could lead to the prevention of major spreads and unwanted public health consequences (Achrekar et al., 2010; Aslam et al., 2014; Paul, Dredze, Broniatowski, & Generous, 2015) However, social media is not limited to disease surveillance; other studies describe efforts where this data source is used to predict election outcomes (Tumasjan, Sprenger, Sandner, & Welpe, 2010), social and community dynamics (Castro, Zhang, & Telecom, 2013) and many other real-world events (Asur & Huberman, 2010).

Using geo-tagged data from social media has several advantages. Hwang et al. (2013) list five of them:

1. Georeferenced social media data include information about social topics and about how humans interact and communicate over time and space.

2. Social media data provides real-time information of social phenomena,

3. Geo-tagged data carry individual observations with spatiotemporal references.

4. Social media data are accessible via public APIs and free to a certain scale.

5. Users post on social media voluntarily and with less privacy restrictions.

These advantages are also well-known in the urban computing community. Several studies have used social media messages as a data source for measuring different aspects that influence city livability. For example, Jiang, Wang, Tsoy, & Fu (2015) introduce an analytic method for monitoring the dynamics of the air quality index (AQI) and its perception in Beijing using messages from Sina Weibo (the equivalent of Twitter in China). The authors show that there is a strong correlation between topic-based filtered messages and the air quality index.

While Jiang et al. (2015) only target messages that originated from a specific place (Beijing), the study does not use point-locations to reference the social media posts to a particular neighborhood or zone in the city. Instead, the authors rely on the more general user-defined place of registration. To the best of our knowledge, no previous study has proposed to research the dynamics between perceived air pollution and daily AQI records on a neighborhood level, using geo-tagged social media.

APROACH

This section introduces an overview of our approach. It consists of four mayor steps: (1) data acquisition, (2) data integration, (3) data analysis, and (4) visualization of results. Figure 1 illustrates a model of the main process and its key components: the sensors, the topic model, the data storage, and a user-friendly visualization.

(4)

Data acquisition

The data pipeline relies on three different data sources: (1) social media (Twitter), (2) air monitors, and (3) news articles. The following paragraphs provide details on the data collection process.

Twitter

The first data type consists of Twitter messages posted between April 1st_{, 2016 and April 30}th_{, 2017 in Mexico}

City. The messages are collected from Twitter’s data stream through the Gnip PowerTrack API (“PowerTrack API,” 2017). Gnip is Twitter’s enterprise solution that allows real-time and historical access to the social media platform; in contrast to Twitter’s public APIs, Gnip has less restrictions on the quantity and quality of accessible data. We set up a filtering rule so that we only capture messages originated from GPS-enabled devices within the greater Mexico City area. The resulting collection contains a total of 11,301,129 documents, each representing a single ‘tweet’ in JSON format. Every document contains 15 fields and several sub-fields describing the content and metadata of the messages.

Open data

The second data source is open data consisting of air quality records from 34 different air monitors across greater Mexico City. This data is published by the Ministry of Environment through their open data portal http://www.aire.cdmx.gob.mx. The data is structured in CSV format and each file contains hourly measurements for one of 9 different air pollutants. Historical data is available since the 1970s, however, we only focus on the time frame between April 2016 and April 2017. Moreover, we only contemplate the five criteria pollutants which compose the air quality index (AQI): (1) ozone [O3], (2) sulphur dioxide [SO2], (3) nitrogen dioxide [NO2], (4) carbon monoxide [CO], and (5) particles smaller than 10 micrometers [PM10] (“Criteria Air Pollutants,” 2017).

News articles

The third and final source of data are news articles about Mexico City, published online between April 2016 and December 2016. In order to obtain the articles, we use the web crawler platform Webhose.io (“Webhose.io,” 2017). We scrape and store a total of 11,250 articles matching the following query:

{language: spanish AND thread.title: ciudad AND thread.title: mexico AND site_type:news}

The purpose of this last data type is to form a text corpus that will be used to model the topics about Mexico City reported in news media. This topic model is instrumental for filtering and retrieving relevant Twitter messages, i.e. those discussing or representing the perception of air

pollution from the original Twitter collection. Further details on the topic model are presented in the “Data analysis” section.

Feature extraction

Before performing the desired analysis, we prepare and transform the data such that it is better suited for learning. Each data type (social media, air quality records, and news articles) undergoes different pre-processing steps that are described below.

Twitter

Social data retrieved from Twitter’s API is already structured. Every tweet is represented in JSON notation with 15 standard fields. In order to reduce the size of the data, only relevant attributes are kept, among them actor, body, location, and postedTime.

When tweets do not match an exact location (x, y)1, we approximate a point location by calculating the centroid of the geo polygon present in the place field, and add the centroid coordinates as the geolocation; while this is only an estimation of the precise location where the tweet was created, it is a compromise we accept so that we can assign all documents with an AQI value from the closest air monitor.

In addition to the point attribute, two other fields are added to every document: (1) a station_id, representing the closest air monitor, and (2) an AQI value obtained from the corresponding monitor. The shortest distance between every tweet and one of the 34 air monitors is calculated using an KNN algorithm (k=1); Figure 2 shows the locations of the air monitors and the resulting Voronoi regions.

Air quality index

The air contaminants data requires pre-processing as well. The CSV files contain hourly raw measurements of air pollutants concentrations, as obtained from 34 air monitors across the Mexico City Metropolitan Area (see figure 2). We compute an air quality index (AQI) according to Mexico’s air quality standards. The AQI simplifies the communication of air quality to citizens, government officials and the general public; it is useful because it abstracts the concentration of five chemicals into a single number.

1_{More information about the Twitter data format:}

http://support.gnip.com/apis/powertrack2.0/rules.html#Oper ators

(5)

Figure 2. Greater Mexico City map with markers for the 34 air monitors.

To calculate the AQI from ambient concentrations, we use the following equation provided by the Mexican Ministry of Environment (SEDEMA, n.d.): 𝐼"= 𝐼_%&'%− 𝐼_)*+ 𝐵𝑃%&'%− 𝐵𝑃)*+ 𝐶"− 𝐵𝑃)*+ + 𝐼)*+ Where: 𝐼"= 𝑖𝑛𝑑𝑒𝑥 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑝𝑜𝑙𝑙𝑢𝑡𝑎𝑛𝑡 𝐶"= 𝑐𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑝𝑜𝑙𝑙𝑢𝑡𝑎𝑛𝑡 𝐵𝑃%&'%= breakpoint that is greater than or equal to 𝐶" 𝐵𝑃)*+= 𝑏𝑟𝑒𝑎𝑘𝑝𝑜𝑖𝑛𝑡 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑜𝑟 𝑒𝑞𝑢𝑎𝑙 𝐶" 𝐼%&'%= 𝐴𝑄𝐼 𝑣𝑎𝑙𝑢𝑒 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝐵𝑃%&'% 𝐼)*+= 𝐴𝑄𝐼 𝑣𝑎𝑙𝑢𝑒 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝐵𝑃)*+

More information on the breaking points can be found in the appendix. Relevant here is that the higher the AQI, on a scale from 0 to 500, the higher the health risks are for the general population. The AQI for a given day is the highest of the records corresponding to any of the five criteria pollutants (O3, SO2, CO, PM10, and NO2). Thus, the chemical responsible for the largest number is the one reported to the public as the “critical” pollutant. As such, it is enough for one of the pollutants to cross the 151-value threshold to label the air quality as “unhealthy”.

News articles

As mentioned in the previous section “Feature extraction”, we use a corpus of news articles to model the topics about Mexico City discussed in the news. The Webhose.io API returns the documents (online news articles from different sites) in JSON format, with several attributes such as title, text, author, published, site, url, country, id, and more. For this research, we only need three of these features, namely title, text and id. The main advantage of using the Webhose.io service instead of building a custom scraper is that it allows to target very specific content through Boolean keyword search and other match operations. Furthermore, the fact that the retrieved documents are structured means that they are machine-readable. Nevertheless, we have to transform the sequences of characters in the “text” field into useful features. As such, (a) we tokenize the string content with scikit-learn’s (Pedregosa, Weiss, & Brucher, 2011) build_tokenizer method from the feature extraction module; (b) we remove stop-words using NLTK’s (Bird, Loper, & Klein, 2009) build-in stopword-list for the spanish language, and (c) discard all digits as well as tokens that only occur once in the entire collection.

Data storage

Once the data has the expected structure, all three sources are integrated into a single document-based database for efficient querying and retrieval during analysis time. The data store consists of a MongoDB (n.d.) instance with a single database and three collections. Implementing a

(6)

database is not only recommended for integration purposes, but also for efficiency reasons – processing over 11,000,000 documents (around 40GB) and beyond in memory is not scalable. Two of MongoDB’s main advantages are that it does not impose a strict schema and that it stores data in a JSON-like notation. We create an index on the “body” field for faster processing. MongoDB text indexes can be created on any fields containing string content and enables Boolean search as well as a scoring and sort operations

Data analysis

This section presents the methodology for filtering relevant social media messages, and comparing the volume of such messages with pollution levels. The main steps leading the data analysis are: (1) LDA topic modelling and (2) Pearson’s r method for correlation analysis.

Topic model

A topic model is an unsupervised machine learning approach for extracting meaning from a collection of documents. This approach is commonly used for textual data, although it is also applicable for images and other data types. In topic modelling, ‘meaning’ is defined as a set of topics that describe or summarize the content of documents. One of the more popular models is LDA, short for ‘latent Dirichlet allocation’. With this algorithm, documents are represented as distributions over topics, and topics as distributions over words. According to Blei et al. (2003), LDA is useful in performing many tasks, among them (1) classification, (2) novelty detection, (3) summarization, and (4) similarity and relevance judgements.

This paper implements LDA mainly for query expansion. What we want to achieve with this approach is to find synonyms and related words of our initial query term “smog”, since it is probable that different social media users refer to the same problem, air quality, in different ways. If we only retrieve documents containing the keywords “smog”, the likelihood of missing relevant tweets is high. Thus, by expanding our seed query we aim at increasing the retrieval performance in terms of recall.

In practice, the process works as follows: (1) we implement an LDA model with the Gensim package (Řehůřek & Sojka, 2010) for Python; (2) we feed the model with the news article data discussed before; (3) we tune the free parameters until the discovered topics seem reasonable to a human evaluator; (4) we identify the topic that is about air pollution and extract the 30 most relevant words associated with the selected topic; (5) we use the extracted words as query terms to search our Twitter database.

Since all articles are about Mexico City, we expect that the algorithm will discover topics that reflect the interests and

concerns of Mexico City’s publics, among them air quality. Indeed, after several rounds (100 iterations) of training, we found a model that satisfies our needs. Topic 15 is the one best representing the “air pollution” topic and is labeled as such; its 30 most relevant terms are shown in Figure 3.

Figure 3. Top-30 most relevant terms for topic “air quality”.

Topic-based filtering of Twitter messages

Once we have our topic model, we use the words associated with “air quality” to query the database on the “body” field. Only documents matching at least one of the query terms (logical OR) will be retrieved. Of course, not all documents containing a query term are immediately relevant. We therefore add a textScore operation to the query – documents containing more than one query term will have a higher score. Below is an example of the query structure: example_query = db.collection.find({"$text": {"$search": "term_1 term_2 … term_n"}}, {"score": { "$meta": "textScore" } })

The resulting filtered dataset consists of 258,286 documents. We then rank the results according to the assigned relevance score in descending order, from highest to lowest score, and ignore all but the top ~ 25% documents in order to dismiss potentially trivial messages. Consequently, we end up with a dataset of 70,000 relevant (air quality related) documents for the analysis. In subsequent sections, we will refer to the entire dataset of 258,286 air quality related Tweets as “tw_filtered” and to the ranked dataset of size 70,000 as “tw_ranked”.

(7)

Relationship between tweet-frequency and AQI

The hypothesis we want to test in this research is that Mexico City’s air pollution influences how citizens perceive and experience the city. While we can quantify the concentration of pollutants in the air via air monitors and the AQI, we cannot measure satisfaction directly. As argued before, traditional social research instruments, such as surveys, are not well suited for collecting data at a large scale. Instead, we propose to approximate city satisfaction as the number of mentions on social media. Our aim, therefore, is to establish a relationship between the number of Twitter messages related to the topic “air quality” at a given time frame, and the actual AQI during that same period.

In other words, we are looking for evidence that the variation of one variable (AQI) corresponds with the variation of a second variable (frequency of Twitter messages related to air quality). To do so, we use the Pearson’s r method for finding relationships between interval variables (Bryman, 2016). A positive correlation coefficient close to 1 signifies that the two variables are strongly related, while a coefficient close to 0 indicates that there is a weak or no correlation at all. A moderate r value, on the other hand, suggests that there is a connection but that other variables might have some influence, too.

In this paper, we implement the correlation analysis as follows. First, we calculate the Pearson correlation coefficient between the weekly frequency of Twitter messages about air quality and the weekly maximum AQI. Then, we do the same operation between the frequency of

all Twitter posts and the AQI. The assumption here is that

the correlation is stronger for tweets that are about air pollution than about other topics. Thus, a higher r value is expected for the first test.

Additionally, we want to test the correlation between Twitter messages about air quality and AQI on a more granular level. This step is designed to answer the question whether we can measure air pollution at a fine-grained level and, ultimately, whether air quality impacts city livability to different extents across various neighborhoods. If the former is true, we can expect significant differences between the computed r values.

RESULTS

The first correlation analysis is performed on the entire filtered dataset “tw_filtered”. The results indicate a moderate, positive relationship between the two variables, r = .41, p < .001. While the r coefficient is not strong, it does suggest that when the AQI increases, the frequency of tweets tends to increase as well. It is important to note that the dataset “tw_filtered” consists of all documents that contain at least one of the query terms. Thus, it is probable that irrelevant messages skew the results. For example, a message about “air condition” has the same weight as one about “air quality”, although the former is clearly off topic. Next, we perform the same correlation analysis, but this time on the second list of Twitter messages “tw_ranked”. This dataset consists of documents matching terms for the same query, however only the 70,000 tweets with the highest text scores are included. After implementing this simple search algorithm, the correlation coefficient improves by some points. Again, the results of the test reveal a moderate positive link between weekly AQI and number of tweets mentioning air pollution, with a coefficient of r = .49 (p < .001). This provides some evidence that by improving the retrieval model, we might discriminate better between relevant and irrelevant Twitter messages. Figure 4 shows the correlation between weekly

(8)

maximum AQI values and weekly volume of tweets related to air quality from this dataset.

The fact that there is a moderate linear relationship between the variables does not mean that the variation in the frequency of air quality related Twitter messages is caused by air pollution levels (even strong correlation coefficients do not imply causation). Even so, it is reasonable to believe that social media is a reliable source for sensing opinions about air pollution and has the potential to signal elevated pollution levels. As such, an increase in messages containing terms related to the topic might indicate high AQI values. This fact, in turn, could have several real-world implications and applications that are worth exploring further.

Air quality index per zones

The above results indicate a moderate link between AQI and Twitter messages about air quality and are consistent with the findings from other research (Jiang et al., 2015). However, as posited before, we are also interested in exploring the relationship at a local level. Figure 5 compares the proportions of tweets about air quality (relevant tweets) with the proportions of all tweets (relevant and irrelevant). In most cases, the different proportions are equivalent. A noticeable pattern is that 80% of all air quality related tweets occur in the vicinity of 35% of the covered areas in Mexico City; even more, 60% of all tweets are attributed to only 5 air monitors. These five monitoring stations are Benito Juárez, Miguel Hidalgo, Coyoacán and Santa Fe. All five zones are known for high socio-economic levels and, with the exception of Santa Fe, all are located in the center of greater Mexico City.

We look upon these five areas and the correlations between tweets about air quality and the specific AQI values for each of the air monitors. The results are presented in table 1. We observe stronger correlation coefficients for these five areas than in the previous experiments, with r values ranging from .54 to .73. Apparently, people in the proximity of these areas are more sensitive to air quality fluctuations and tend to discuss them more frequently on Twitter.

For example, almost a quarter (24%) of all tweets about air quality in our dataset occurred in Benito Juarez, while the total share of geo-tagged Twitter messages (positive and negative) for the same location is only 16%. Clearly, people in the proximity of Benito Juarez are disproportionately prone to discussing air quality. The question arises whether this is related to the actual air quality in this area. Indeed, our analysis reveals that Benito Juarez had the second worst average AQI (89) during the time frame we studied, only preceded by Ajusco Medio (90).

Coyoacán (COY), Hospital General de México (HGM) and Santa Fe (SFE) rank among the ten most polluted zones in Mexico City, too. As such, it is not a complete surprise that people in these neighborhoods are very active on Twitter. However, it is hard to oversee that the richer center of the city dominates the conversation to such degree. Some densely populated areas in the peripherals of the City such as Nezahualcóyotl, Ecatepec, or Chalco are underrepresented in the current analysis. There are two immediate solutions for balancing this bias: (1) including sociodemographic data and other data provided by the city administration into the analysis, (2) expanding the number of data points through additional, similar data sources. While these measures would enrich the results, and contribute to the current approach, they are out of the reach and scope of the present work.

Zone Avg. weekly AQI Pearson’s r coefficient

BJU 89 .58

HGM 86 .62

MGH 78 .73

COY 88 .54

SFE 82 .57

Table 1. Correlation coefficients for stations with highest Twitter messages volume.

CONCLUSION AND FUTURE WORK

This research explores the value of geo-tagged social media for monitoring air quality. We demonstrate a relationship between the air quality index and the frequency of Twitter messages related to air pollution in Mexico City. Additionally, we show that it is possible to zoom in to the neighborhood level to understand how the environment influences residents’ experience of the city. While the paper focuses on Mexico City as a test-bed, we think that the approach is scalable and transferable to other relevant applications in the domain of urban computing and city livability.

Despite the promising results, the current approach also reflects its limitations and biases. For one, the correlation coefficients are modest. This means that the relationship between the variables we study is not very strong and that there are other factors that might coincide with the variations of the tweet frequency.

(9)

The previous fact is related to a second shortcoming: the implemented text scoring algorithm is an out-of-the-box solution from MongoDB and is not the strongest and most flexible option available. It is possible that irrelevant documents skew the results. To give an example, with the current retrieval model we are not discarding messages that are not authentic expressions of individual opinions about air quality. The filtered dataset is noisy in the sense that it contains tweets from news organization and other entities, such as bots, that do not reflect “the wisdom of the crowd”. Moreover, in its current state, the system only works with historical data. To become useful for the general public, it is desirable to plug the different components to real-time data streams. This way, users might be able to have a better picture about how air quality impacts city livability. Other directions for future work include connecting to more social media data streams (not only Twitter), improving the text classification with different state-of-the-art machine learning techniques, and the analysis of other media, such as images.

Finally, it is important to note that we do not argue in favor of replacing reliable air monitoring methods with the presented approach. Instead, we look for complementary data sources that might shed light on the impact of air pollution on city livability. This is especially relevant facing the proliferation of mega cities in poor countries. A system that relies on social media is robust and cheap, and therefore a potential complement to networks of air monitors.

AKNOWLEDGMENTS

This work would not have been possible without the mentoring and supervision from Dr. Stevan Rudinac from the University of Amsterdam. Additional, invaluable support came from Max Mergenthaler and the Tasty Data organization from Mexico City, who provided the Twitter data and other useful resources.

REFERENCES

Achrekar, H., Lazarus, R., & Park, W. C. (2010). Predicting Flu Trends using Twitter Data. In Proceedings of the

International Conference on Web Intelligence and Intelligent Agent Technology (pp. 492–499). Washington,

DC: IEEE.

Aslam, A. A., Tsou, M., Spitzberg, B. H., An, L., Gawron, J. M., Gupta, D. K., … Hall, S. (2014). The Reliability of Tweets as a Supplementary Method of Seasonal Influenza Surveillance. Journal of Medical Internet Research, 16(11). http://doi.org/10.2196/jmir.3532

Asur, S., & Huberman, B. A. (2010). Predicting the Future With Social Media. In Proceedings of the International

Conference on Web Intelligence and Intelligent Agent Technology (pp. 492–499). Washington, DC: IEEE.

http://doi.org/10.1109/WI-IAT.2010.63

Bird, S., Loper, E., & Klein, E. (2009). Natural Language

Processing with Python. O’Reilly Media.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning

Research, 3, 993–1022.

(10)

Bryman, A. (2016). Social Research Methods. Oxford: Oxford University.

Castro, P. S., Zhang, D., & Telecom, I. M. (2013). From Taxi GPS Traces to Social and Community Dynamics : A Survey. ACM, 46(2).

Criteria Air Pollutants. (2017). Retrieved July 1, 2017, from https://www.epa.gov/criteria-air-pollutants

Hwang, M., Wang, S., Cao, G., Padmanabhan, A., & Zhang, Z. (2013). Spatiotemporal Transformation of Social Media Geostreams: A Case Study of Twitter for Flu Risk Analysis. In Proceedings of the 4th ACM SIGSPATIAL

International Workshop on GeoStreaming (pp. 12–21).

New York: ACM.

Jiang, W., Wang, Y., Tsou, M., & Fu, X. (2015). Using Social Media to Detect Outdoor Air Pollution and Monitor Air Quality Index ( AQI ): A Geo-Targeted Spatiotemporal Analysis Framework with Sina Weibo ( Chinese Twitter ).

PLOS ONE, 10(10), 1–18.

http://doi.org/10.1371/journal.pone.0141185

MongoDB. (n.d.). MongoDB. Retrieved July 5, 2017, from https://docs.mongodb.com/manual/reference/

Okulicz-kozaryn, A. (2013). City Life : Rankings ( Livability ) Versus Perceptions ( Satisfaction ) Author ( s ): Adam Okulicz-Kozaryn Published by : Springer Stable URL : http://www.jstor.org/stable/24718714 City Life : Rankings ( Livability ) Versus Perceptions ( Satisfaction ).

Social Indicators Research, 110(2), 433–451.

Paul, M. J., Dredze, M., Broniatowski, D. A., & Generous, N. (2015). Worldwide Influenza Surveillance through Twitter. Association for the Advancement of Artificial

Intelligence, 6–11.

Pedregosa, F., Weiss, R., & Brucher, M. (2011). Scikit-learn : Machine Learning in Python, 12, 2825–2830. PowerTrack API. (2017). Retrieved June 20, 2017, from http://support.gnip.com/apis/powertrack2.0/

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of

LREC 2010 workshop New Challenges for NLP Frameworks (pp. 46–50). Valletta, Malta: University of

Malta.

Rudinac, S., Zahálka, J., & Worring, M. (2017). Discovering Geographic Regions in the City. International

Conference on Multimedia Modeling, 148–159. http://doi.org/10.1007/978-3-319-51814-5

SEDEMA. (n.d.). ¿Cómo se calcula el Índice de Calidad del Aire? Retrieved July 1, 2017, from http://www.aire.cdmx.gob.mx/

Silva, T. H., Melo, P., & Almeida, J. M. (2014). Revealing the City That We Cannot See. ACM Transactions on

Internet Technology (TOIT), 14(4).

Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting Elections with Twitter : What 140 Characters Reveal about Political Sentiment. In Fourth

International AAAI Conference on Weblogs and Social Media (pp. 178–185). AAAI Publications.

United Nations. (2014). Urbanization Prospects. United Nations.

Webhose.io. (2017). Retrieved June 1, 2017, from https://webhose.io/

Zheng, Y. U., Capra, L., Wolfson, O., & Yang, H. A. I. (2014). Urban Computing: Concepts, Methodologies, and Applications. ACM Transactions on Intelligent Systems and

Technology (TIST), 5(3).

APPENDIX

Figure 6. Zoom in with a look of clusters of tweets.

AQI Category (Range) PM10 NO2 O3 CO SO2 Good (0-50) 0-50 0-40 0-50 0-1.0 0-40 Satisfactory (51-100) 51-100 41-80 51-100 1.1-2.0 41-80 Moderately polluted (101-200) 101-250 81-180 101-168 2.1-10 81-380 Poor (201-300) 251-350 181-280 169-208 10-17 381-800 Very poor (301-400) 351-430 281-400 209-748 17-34 801-1600 Severe 430+ 400+ 748+ 34+ 1600+