Developing a reproducible workflow for batch geoprocessing social media in a cloud environment

(1)

Developing a reproducible workflow for batch

geoprocessing social media in a cloud environment

RICARDO MORALES TROSINO March 2019

SUPERVISORS:

Dr. F.O. Ostermann

Dr. O. Kounadi

(2)

Thesis submitted to the Faculty of Geo-Information Science and Earth

Observation of the University of Twente in partial fulfilment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation.

Specialization: Geoinformatics

SUPERVISORS Dr. F.O.Ostermann Dr. O. Kounadi

THESIS ASSESSMENT BOARD:

Dr. MJ. Kraak (Chair)

Dr. E. Tjong Kim Sang (External Examiner, Netherlands eScience Center, Amsterdam)

Developing a reproducible workflow for batch

geoprocessing social media in a cloud environment

RICARDO MORALES TROSINO

Enschede, The Netherlands, March, 2019

(3)

DISCLAIMER

This document describes work undertaken as part of a programme of study at the Faculty of Geo-Information Science and Earth

Observation of the University of Twente. All views and opinions expressed therein remain the sole responsibility of the author, and do

not necessarily represent those of the Faculty.

(4)

ABSTRACT

The main objective of this research is to deliver workflow scenarios that can process and geoprocess social

media with batch data. The research focused on defining useful tasks and sub-tasks to explore and analyze

batch social media and to deliver a prototype able to reproduce the workflow. Two architectural scenarios

were identified. One scenario designed for newcomers in a local machine and another for more advanced

users in a cloud environment. A local machine scenario developed to explore a stored data set with a

sample of the data set, and a more complex scenario to explore the complete data set in the cloud and with

a big data framework such as Spark. A prototype was designed to test the workflow and to achieve

reproducibility. To test the prototype, a data set was provided with the intention to search for tick bites

events in the Netherlands. The results showed that, following the workflow, the example data set contains

some noisy words and the processing in the cloud environment was relatively cheap and efficient.

(5)

ACKNOWLEDGEMENTS

I want to thank ITC teachers and staff, to share all their knowledge with me. To my fellow students who worked with me side by side.

My supervisors Frank and Rania, without your guidance this work would not be possible, thank you for your advices, but mostly to share with me your time and knowledge. To Luis Calixto and Raùl Zurita that shared with me some of their experience with Hadoop. Rosa and Mowa that help me with to improve my proposal.

To Massyel to be with me in my brightest days and darkest knights, thank you gatita. To my mom, sister, and father that gave me remote spiritual support.

I want to acknowledge my sponsor, CONACyT and Alianza FIIDEM, for giving me the opportunity to join their international scholarship program, and for providing all the necessary resources; to get to the Netherlands and the chance to grow professionally.

And to my friends of GFM, sorry guys to interrupt every class with my silly questions and comments. I

just wanted to break my own bubble to understand and learn more about our World.

(6)

1. Introduction...7

1.1. Motivation and Problem Statement...7

1.2. Research Identification...9

1.3. Research Objectives and Questions...10

1.4. Thesis Outline...11

2. Related Work...12

2.1. Cloud Environment...13

2.2. Big data Framework...15

3. Conceptual Design...17

3.1. General Workflow...18

3.2. Implemented Workflow...20

3.3. Scenarios Introduction...23

3.3.1. Sampled Geotagged Scenario...25

3.3.2. Complete Data set Scenario...28

4. Prototype Description and Case Study Characteristics...31

4.1. Case Study Description and Data set Characteristics...31

4.2. Sample Scenario in a Local Machine...33

4.3. Complete Data set Scenario in a Cloud Environment...36

4.3.1. AWS Introduction...36

4.3.2. Prototype Application...37

5. Implementation Results...41

5.1. Local Machine Scenario with Sample Data set...41

5.2. Cloud Scenario with Complete Data set...46

6. Discussion...53

7. Conclusions...57

7.1. Research Questions Answered...57

7.2. Further Work...60

List of References...61

(7)

LIST OF FIGURES

Figure 1: Brief description of chapter contents...11

Figure 2: General Workflow...20

Figure 3: Implemented Workflow...22

Figure 4: Example of the interactive visualization created by the prototype...35

Figure 5: Map of Geotagged tweets of the data set...42

Figure 6: Language classification...43

Figure 7: Initial searched word mentions in Dutch...44

Figure 8: Map of geotagged tweets in the Netherlands 2015 - 2018...46

Figure 9: Complete data set most mentioned terms...47

Figure 10: Filter with the most mentioned words with ‘camping’...48

Figure 11: Filter with the most mentioned words with ‘tekenbeten’ and 'tekenbeet'...49

Figure 12: Geocoded records by township and provinces...51

Figure 13: Geocoded records of term 'tekenbeten' by township and provinces...52

Figure 14: Comparing the results of 'tekenbeten' map and Tekenradar...54

(8)

LIST OF TABLES

Table 1: Tweet processing example of NLP sub-tasks...27

Table 2: Twitter Object Attributes...32

Table 3: Set of words extracted from Twitter (Initial searched words)...32

Table 4: Example of Language Classification python tools...34

Table 5: Cleaning sub-task Tweet example...39

Table 6: Tweet example with elements split but with the track of the original Tweet...39

Table 7: Example of a unigram gazetteer data...40

Table 8: Example of tokens with matched elements...40

Table 9: Example of initial words counter...40

Table 10: Twitter Initial searched words co-ocurrance matrix...45

Table 11: Complete data set co-ocurrence matrix...50

(9)

LIST OF ABBREVIATIONS

API Application Programming Interface AWS Amazon Web Services

CLI. AWS Command Line

DAG Directed Acyclic Graph DBMS Database Management System

DNS Domain Name System

EC2 Amazon Elastic Compute Cloud EMR Elastic Map Reduce

GPU Graphing Processing Unit HDD Hard Drive Disk

IaaS Infrastructure as a Service IAM. Identity Access Manager NLP Natural Language Processing PaaS Platform as a Service

RAM Random Access Memory

RDD Resilient Distributed data set S3 Amazon Simple Storage Service SaaS Software as a Service

SSH Secure Shell

VM Virtual Machine

VPN Virtual Private Network

(10)

1. INTRODUCTION

1.1. Motivation and Problem Statement

Social media (SM) has become a channel to share information between users, in recent years experiencing exponential growth. Researchers have been interested in the study of this type of information due to its complexity, immediate generation of knowledge, its spatial and time characteristics, and the volume that social media can generate. For a spatial scientist, it is relevant to study how to extract information from this type of source, what type of spatial procedures can be applied and in which platforms and what spatial valuable information can be extracted from this type of data source. It is important to generate usable workflows to make SM information reproducible and more accessible to people with minimum experience in spatial analysis or computer science analysis. Over the last decade, the information generated by social media has evolved; with changes triggered by several factors like mobile phones accessibility, global positioning services or WIFI/4g connectivity. Nowadays, users share immediate data which may be personal, informative or even a geolocated event. Some of these can provide valuable information or services that can be used as a product. Social Networks like Foursquare, Flickr or Twitter can be used as a powerful tools to analyze trends of behavior, disasters, events, health outbreaks with a geographical location context (Mazhar Rathore, Ahmad, Paul, Hong, & Seo, 2017).

Social media and cloud computing conceptually seem to be related, but they are quite distinct. The cloud

is described as a model that has several computing resources like storage, applications, networks or

services. By definition the cloud has five essential characteristics, three service models and four

deployment models (Mell & Grance, 2011). The five characteristics of the cloud are On-demand self-

service, Broad Network Access, Resource Pooling, Rapid elasticity, Measured Service; four deployment

models: private cloud, community cloud, public cloud, hybrid cloud; and three service models Cloud

Software as a Service (SaaS), Cloud Platform as a Service (PaaS), Cloud Infrastructure as a Service (IaaS)

(Rountree & Castrillo, 2014). Social media is defined by the several platforms that exchange information,

the platforms have several channels of interaction like blogging, networking or multimedia content, they

share the same goal, to provide services to exchange information between users. Emerging computing

paradigms like cloud computing show high penetration rates in social and geosocial developments, due to

the structure, costs, and flexibility (Ramos, Mary, Kery, Rosenthal, & Dey, 2017). Big Data as a Service

(BDaaS) is a term conceived by cloud providers to give access to common big data-related services such as

processing or analytics. This type of services are based on the user necessitates such as infrastructure,

platform or software, but with the difference that the framework, tools, and environment are designed to

process massive amounts of information (Neves, Schmerl, Camara, & Bernardino, 2016).

(11)

Organizations and researchers are looking for new models that unify real time and easily access information, providers like Amazon, Azure or Google can offer tools to implement different types of cloud models, depending on the necessities of the users, these can become elastic and scalable with promises of increasing productivity and reduce costs. The high amounts of information that the cloud can storage and process are one of the most significant benefits of the cloud (Chris Holmes, 2018)but the cloud also have some drawbacks just as the security, format inflexibility or the unexpected downtime that the cloud might have (Larkin Andrew, 2018).

In recent years the words Volume, Variety, Velocity, Veracity, and Value are described as the five V’s, these words are associated with the concept of Big Data, which is related to the generation, processing, and storage of vast amounts of information (Neves et al., 2016). At some point the generated data can surpass the capacities of a local machine or even a computing cluster, social media streams usually create this amount of data. This enormous amount of information has popped up some research questions in recent years like, How to process, store and extract valuable information from it, how to make it more efficient, how to make this information more accessible. The social media data produced by the mixed social networks has been termed Social Media Big Data (Stieglitz, Mirbabaie, Ross, & Neuberger, 2018).

Social media data can be georeferenced by the service provider or even by the user, this leads to a new generation of data sets with spatial information generated by users or providers, this is called geosocial media. When the information contains a spatial reference, a spatial researcher can analyze this type of information with a geoprocessing procedure, which can be defined as a framework of tasks and tools that process and automate information from a Geographical Information System (ESRI, n.d.). Previous studies utilized the geoprocessing tasks in their research (Ostermann, García-chapeton, Kraak, & Zurita-milla, 2018) & (Yue, Zhang, Zhang, Zhai, & Jiang, 2015) they suggest an advisable geospatial analysis tasks for points, some of the tasks are: pattern analysis and clustering, methods of spatial association, pattern recognition or classification just to mention a few, this information may be a start point to explore the geoprocessing possibilities.

Geosocial media streams are relatively new in studies of big data on the cloud, and combined with geoprocessing have open unexplored paradigms of research. Due to their characteristics, these type of data have sparked some research questions like what are the possible tools to process social media in spatial context or how to make this information more accessible and reproducible to a wider audience of researchers. (Ostermann & Granell, 2017) reported the replicability and reproducibility in 58 papers through 2008-2009 in which he mention that 58 percent of the papers declare a null or limited reproducibility and replicability. The challenge then consist in define the appropriate tasks and procedures to ensure replicability.

This analysis arises some questions like, which are the top infrastructures to start processing social media

data, what are the methods for geoprocessing available on the cloud, how to process large-scale data sets of

stored data, how to make this research usable for other researchers. To answer these questions, it will be

(12)

necessary to study cloud environments, workflow designing, big data spatial analysis, and analyze the data set on a local machine and in the cloud. Therefore, research of a workflow design and how this could be implemented on different scenarios will allow starting a scouting of important or relevant data that in further research stages can provide automatized platforms already proved that can facilitate the information analysis.

1.2. Research Identification

The main objective of this research is to analyze and design workflow scenarios with capabilities of processing and geoprocessing geosocial media stored data. This research identifies the lack of information about technicalities and limitations when geoprocessing social media is about, what are the options to process this type of information, and how to geoprocess that information on a local machine or in the cloud. In computing analysis, a workflow is used as a way to interpret, communicate and inform a path to achieve a computational analysis (Hettne et al., 2012). A reproducible workflow could provide a trustworthy diagram to explore geosocial stored streams, this workflow will include tasks like, data storing, data management and data analysis, with techniques of filtering, cleaning or tagging. The importance of developing a workflow is to ensure their usability by other users or researchers. This is an opportunity to explore the possibilities of storage and processing into the cloud and also design and implement a workflow with geoprocessing capabilities.

It is important to explore and define the databases and tools to use in each scenario, one of the challenges is to define the parameters of the scenarios, the differences in the filtering and clustering, and the storage of the data set. It is important to make a review of the available information. In literature there exist some researches that explored topics like to model geosocial media with big data, or to identify risks based on crowd sourcing information, or extracting valuable information from social media (Mazhar Rathore, Ahmad, Paul, Hong, & Seo, 2017b; F. O. Ostermann et al., 2015; Sun & Rampalli, 2014), just a few papers incorporate the geoprocessing into their research and they have their own challenges such as insufficient geotagged information to accomplish a scientific analysis (Yue et al., 2015).

Technical challenges like the usage of non relational database, the creation of an efficient batch processing, spatial indexing with big data, generating overviews with sampling reduction strategies, and the flexibility of computing storage resources on the cloud would be encountered during this research. Some of this challenges are described by Yang (2016), that reported some geospatial studies that developed some significant geospatial challenges related to the Big Data V’s and a cloud computing processing or storage.

This study has the potential to contribute and provide a beginners guide to explore unexplored data sets in

different scenarios such as the cloud environment.

(13)

1.3. Research Objectives and Questions

On this section, the objectives and research questions were established. Briefly, the first objective was related to the workflow, the second associated with the infrastructure, a third objective linked with a prototype system and the last one connected with the reproducibility of the research. Therefore some questions were established, in total there are nine questions, and each question try to answer specific challenges of the study.

 To specify the principal tasks and techniques to transform regular social media into geosocial information and incorporate them in a workflow proposal.

 To analyze the relationship between stored geosocial media and infrastructure characteristics to define architecture scenarios.

 To implement a prototype system that analyzes the study case information.

 To evaluate the reproducibility of the workflow and performance of the prototype.

Research Questions

1. 1) Which tasks and techniques are necessary to incorporate geosocial media, geoprocessing and a cloud environment in the same workflow?

2) How to operationalize the workflow integrating the required tasks and techniques?

2. 1) Which scenarios can be defined based on the stored data, geoprocessing tasks, and system infrastructure?

2) Which and why different type of technologies are required for each scenario?

3) What are the advantages and disadvantages between the selected scenarios?

3. 1) Which type of limitations from the study case data set and the proposed scenarios will affect the prototype

4. 1) How do the characteristics of input data and scenarios affect the reproducibility of the workflow and the re-usability of the prototype?

2) Which techniques or benchmarks are the most feasible to evaluate the performance of the prototype?

3) How the results from the social media spatial analysis can be used for further research?

(14)

1.4. Thesis Outline

The project is divided into three main stages, the first one describes the design of the workflow based on the required tasks and techniques, the second phase will be to create the prototype and analyze the available data set, and the last step will be dedicated to evaluate the prototype and the workflow.

The document is divided into seven chapters. The first chapter describes the motivation of the research combined with a brief description of the research and the research questions and objectives. The second chapter depicts the basic concepts and research associated with social media, the cloud environment, and big data. Chapter three report the design of the workflow with detailed information on the suggested tasks and sub-tasks. Chapter four reports the

implementation of the tasks and sub-tasks in the

prototype. In chapter five are described the results of the prototype appliance with a study case data set. Chapter six discuss the outcomes of the research critically.

Finally, chapter sever provides a brief conclusion and

answer each research question. Figure 1: Brief description of chapter contents

(15)

2. RELATED WORK

Social media has become an interesting topic for social science researchers who are seeking to find valuable information provided by the users. To comprehend the importance of this research, there exist some studies where social media data was used and helped to understand stages of an outbreak: in Nigeria 2014, social media information reflected an Ebola Outbreak before the official announcement (Edd & Rn, 2015).

In 2016 a Zika outbreak in Latin America was tacked with Google searches and Twitter information. As it can be appreciated this approach may be useful to track a virus and forecast new spreading areas from social media information and could be incorporated to enhance the usual epidemiological attention of an outbreak (McGough, Brownstein, Hawkins, & Santillana, 2017).These type of researches have a direct impact on communities, but to achieve this, it is necessary to study the characteristics of the data produced by social media. It is essential to explore how to transform the data efficiently, and in which cases is essential to analyze, processing and storage data when big data and social media is in the picture.

Batrinca & Treleaven, 2014 published a review for social science researchers interested in the necessary techniques, platforms, and tools to analyze social media. In the report they describe the initial procedure to explore social media (SM), such as, the file formats expected like HTML or CSV, the main social media providers divided in free and commercial sources,tools for processing and analyze text from a language perspective or even storing data in a file or in a Database Management System (DBMS), just to mention a few examples. In 2013 Croitoru developed a system prototype to explore information from geosocial media in which he reports a conceptual model based on two systems, the first one to ingest feeds into a system and a second multi-step approach to analyze the social media information. For example, an automatic event detection using big data and a machine learning approach developed by Suma (2018) in which concludes that there are some improvements in the management and processing of data, but there are still challenges in the event detection. Both projects are useful, and their methodologies could be applied in this research as an example of a prototype with social media characteristics for processing massive amounts of data.

Twitter is one of the most used social media microblogs sources to obtain information provided by the

users. Some of the analyses incorporate trending words to predict specific behavior of financial markets,

or traffic congestion in conglomerated areas or even analyze data to detect events without any prior

information (Lansley & Longley, 2016). Some of these studies are based on tasks to process text, to

manage and process text generated from the users may be a risky task due to the number of users and the

complexity of each text. Luckily some research has been done on this field. The Association and Journals

such as the American Association for Artificial Intelligence, Stanford Natural Language Processing Group,

Machine Learning or Journal of Artificial Intelligence Research have been working on speech and

(16)

language processing for decades in computing science; this is called Natural Language Processing (NLP) (Jurafsky & Martin, 2007). Some of the techniques and tasks used to process text from social media are related to clean text for a more specific analysis, the tools and platforms vary depending on the purpose of the research or project. Lately, some studies focus around the generation of massive amounts of text from Twitter, studying issues such as data management and query frameworks, or challenges like the necessity of an integrated solution that combine an analysis on Twitter with a big data management perspective (Goonetilleke, Sellis, Zhang, & Sathe, 2014).

Nowadays, there exist several tasks in NLP such as tokenization, language detection or stemming to analyze social media microblogs like Twitter, these powerful tools allow to monitor user activities that have been impossible to observe until now, some of these tools are described by Preotiuc-Pietro (2012) which provide an open source framework to process text, describing some tasks implemented on Twitter. Other researches have focused on tracking fake accounts (bot) on Twitter by trying to clean the information and locate patterns on the fake accounts (Wetstone, Edu, & Nayyar, 2017). These solutions are not bullet proof, but until now some of them has proved to have good performance analyzing text from the web, research continues by addressing the challenges generated in this field(Sun, Luo, & Chen, 2017).

A percent of social media records are geolocated, when a record includes this type of information from the source is called geotagged data, but when this spatial information needs to be translated from a the text source to coordinates is called geocoding. Some studies has used geolocation and geocoding to study mobility and dynamics between border countries (Blanford, Huang, Savelyev, & MacEachren, 2015) use geolocated tweets as proxy of global to determine human mobility (Hawelka et al., 2014), check land uses in cities by their tweet activity (Frias-Martinez, Soto, Hohwald, & Frias-Martinez, 2012).

2.1. Cloud Environment

In computing, it is vital to comprehend some concepts such as node, core, processor, and cluster. A node refers to one individual machine in a collection of machines that are connected and form a cluster (Roca &

Cited, 2001). Each node typically has one Central Processing Unit (CPU), which in turn has one or more

cores. A core collects instructions to perform tasks. The performance of a computing cluster depends on

its components, but the most efficient choice of number of nodes, number of cores per node, and size of

node memory is not always straightforward. A common, practical solution to select the components in a

cluster is to monitor the time spent to achieve one task with a representative sample of the data and check

the usage of nodes in the cluster. Another solution is to compare the size of the data set and calculate the

capacity of the considered nodes and then compare the values (Amazon, 2018). The cloud has proved to

provide the tools to process massive amounts of information, using the cloud services available it is

possible to manage, access, process and analyses social (C. Yang, Yu, Hu, Jiang, & Li, 2017).

(17)

In cloud computing there exist four types of environments public, private, hybrid and community. The public cloud environment is defined as a variety of services offered by organizations which specialize in providing infrastructure and technologies for different purposes, this type of environment is focused on the external consumers (INAP, 2016). The private environment is often deployed within companies, similar to an intranet this type of environment is usually used internally to manage internal affairs, this can be classified as the most secure form of cloud computing. The hybrid approach is a mix between private, public or even a community environment, where each member remained as a unique entity but bounded with different standardized technologies. Jimenez (2018) combined the services of public cloud such as IaaS and SaaS with a secure private network environment to provide multimedia services for mobiles. The community environment is a combination between private and public with the aspiration of combine resources to provide grid computing and sustainability green computing.

Public cloud service providers offer several products and services such as databases, gaming, media service, analytics or computing. A possible appliance is the virtualization of data centers using the infrastructure as a service (Moreno-Vozmediano, Montero, & Llorente, 2012). A clear example of SaaS is pictured in some learning platforms that offer an interface for teaching and learning (Gurunath & Kumar, 2015).

.Garg (2013) developed a index to evaluate different services that are available in public clouds, his evaluation are based on the following attributes: Accountability, agility, cost, performance, assurance, security and privacy, usability; these attributes are may be considered to select a service cloud provider.

One of the services that the cloud provide is related with infrastructure as a service, which include the provision of architecture for processing, the selection of architectural resources is based on the type of necessity of each user. The architecture selection on the cloud is associated with the time of processing, the necessity of scalability, management of groups or profiles, and the availability of particular frameworks or services such as Spark (Kirschnick, Alcaraz Calero, Wilcock, & Edwards, 2010). In recent years Hadoop and Spark have been used to process and analyze massive amounts of data. Hadoop has been an option to process this type of information in the cloud. International Business Machines (IBM) define Apache Hadoop as an “Open source software framework that can be installed on a cluster of commodity machines so the machines can communicate and work together to store and process large amounts of data in a highly distributed manner.” It is common to use Hadoop, and also an object file system for storage and scalability of the clusters or nodes on the cloud. Apache Hadoop relies on a Map/

Reduce model that separate the tasks by mapping and reducing. Map/Reduce developed by Google in

2004 resulted a reliable model to reduce significantly the execution time in a cluster. A combination of a

cloud services and models such as Map/Reduce can reduce utilization of computing resources utilization

and handle workload spikes automatically by increasing the resources when it is required (Z. Li, Yang,

Liu, Hu, & Jin, 2016).

(18)

2.2. Big data Framework

Three characteristics define big data, the massive amount of information, unstructured data and the requirement to process this information in real time, one common generator of this type of information are social media users (Maynard, Bontcheva, & Rout, 2012). There exist different approaches to handle and examine a big data, some important aspects to consider are the data source, the data management, the data analysis tools available, type of analyses and very important the framework to process big data such as Hadoop or Spark (Rao, Mitra, Bhatt, & Goswami, 2018). The big data infrastructure is defined as the mechanisms to collect the information, the computer program and physical storage to collect it, the framework and environment that allows to process and canalize the information, and finally the and the foundation where the result will be backup and stored(Tozzi, 2018).

In 2010 Spark was developed by the AMPlab in the University of Berkeley, is a flexible in-memory framework that allows to process batch and real-time processing. This framework is compatible with MapReduce model and also can be seen as a complement of Hadoop framework. The Spark utilize a Master and Workers schema for the nodes of the clusters allowing parallel computing. The innovation of Spark is the inclusion of a distributed collection of objects in a set of computers named Resilient Distributed data sets (RDDs) and Directed Acyclic Graph (DAG). The main difference between Spark and Hadoop relies on the method of processing information. Spark process the information in-memory (RAM) with the use of RDD and Hadoop read and write the information in an HDD called Hadoop Distributed File System (HDFS). An RDD is a collection of read-only objects partitioned through different machines in a cluster array(Zaharia & Chowdhury, 2010). HDFS reorganize the files in small chunks and distribute the files in different nodes (Verma, Hussain, & Jain, 2016). Both frameworks Spark and Hadoop use as the resource management and job scheduling a technology called YARN, this technology distribute the work between the different cluster nodes.

When spatial data and big data is in the picture some recommendations are: to implement data reduction strategies or to provide, a computational method that minimize the computational necessities in the cluster (Armstrong, Wang, & Zhang, 2018). Some of the technologies available for huge spatial data sets are, Spatial Hadoop which enriches the Map/Reduce framework by adding two levels of spatial indexes and it contains spatial operations such as spatial join or kNN range queries (Eldawy & Mokbel, 2015).

The other is GeoSpark which also provide Map/Reduce model and support for geometrical and spatial

objects with data partitioning and indexing, also supports spatial querying . In practice some researchers

have evaluated the performance between platforms like GeoSpark or Spatial Hadoop testing the

scalability and performance (Yu, Jinxuan, & Mohamed, 2015), some interesting findings are related with

the incorporation of spatial indexes instead of traditional indexing (Eldawy & Mokbel, 2015). Some of

them explore the index efficiency for spatiotemporal frameworks dividing the input file into nodes,

mimicking a spatiotemporal index (Alarabi, Mokbel, & Musleh, 2018). Some of these studies revealed that

(19)

Spark has better performance, scalability, and stability compared with Hadoop (Reynold, 2014). But this

type of processing and analysis is still in a development stage, some challenges remains in, 1) the spatial

indexing of and models to process real-time information, 2) quick assessments to calculate the propagation

of errors and, 3) study of efficient methods to visualize big data sets into an understandable and

communicable displays (S. Li et al., 2015).

(20)

3. CONCEPTUAL DESIGN

The workflow design is described in two steps, a general and the implemented workflow. The first workflow in this document describes the social media processing and analysis in a conceptual manner, embracing general concepts that apply to a broad social data sets and user affairs which is called “General Worklfow”. The second step focuses on to go look, and report the importance of techniques and technologies and how to implement a workflow to the current study case, called “Implemented Workflow”. The first workflow tries to provide simplicity, predictability and reproducibility in the design, providing suggestions of implementation in the cloud environment depending on the user necessities, in comparison to the second workflow which the structure is more rigid and more specific in the sub-tasks and technologies suggestions.

A workflow give infrastructure to initialize a process in an ordered way, with the promise of increase the productivity, reduce errors, diagnose and cut down errors of the process among other benefits (Integrify, 2017). The workflow design should focus on achieve clarity, simplicity, recordability, reportability and reusability, with this characteristics a problem can be analyzed in a systematic manner and also integrate intensive computing analyses (McPhillips, Bowers, Zinn, & Ludäscher, 2009). The usage of workflows is a common practice to pictorially express an abstraction for an automated process in sequential order. The generation of a workflow provides a tool for further analysis and with a good design opens the possibility to reproduce the work for new research. In social media, cloud computing and big data some researchers have developed workflows that express some of the problems shown above. Some of them developed workflows to understand some geosocial challenges, like the developed by (Zhang, Bu, & Yue, 2017) in which he developed a workflow and a tool to provide an open environment for geoprocessing raster and vectors in a friendly context. The WIFIRE tool which implements a workflow model to analyze fires based on data-driven modeling and geospatial information Altintas (2015) and Wachowicz (2016) who develop a workflow with geotagged tweets focused on the data ingestion, data management and, data querying. (Suma et al., 2018) research focused on the event detection in cities using Twitter for spatiotemporal events, in his study he developed a workflow for this type of event detection.

To develop a workflow is necessary to incorporate and define tasks and sub-tasks, in this case storage,

process and analyze the geosocial media data will be carried out. As part of the workflow is essential to

evaluate some characteristics of storage and processing environments, in general, some features like the

supported services of the cloud, networking capabilities and performance of workflow solution needed to

be reviewed and evaluated in the context of geosocial media. One crucial reflection based on the work of

Hettne (2012), is based on the idea to avoid workflow decay, described as the missing factors to make a

workflow executed or reproducible, one recommendation to avoid the decay is to compare functional

(21)

workflows and reuse them, extracting the main idea and restate it in your current work. The main tasks proposed on the general and implemented workflows are based on the research of Wachowicz (2016), Suma (2018) & Mazhar (2017), in which the principal tasks are, collection or data pool, preprocessing, processing or data management, data querying or data analysis. Previous work visualizes four different main tasks, for this research the main tasks will be data management, pre-processing, processing, and analysis.

The reason of designing two workflows is to propose separate perspectives to describe the geosocial media geoprocessing in a cloud environment. A general and more broad applicable approach was developed in the general workflow, which is a simpler, flexible and more robust proposal. The implemented workflow describes in a detailed scheme the technical and technological tools to fulfill the analysis.

The design of both workflows consist in two scenarios with a sample and complete data set and several branches on a local machine or the cloud environment. One scenario on both workflows is based on a sampled data set and an optional local machine architecture; the second is designed to implement an analysis with the complete data set and an optional cloud environment. The branches are designed for specific users, a novice user which objective is to explore a data set with limited sources and knowledge, and another scenario for a user who has more computational resources and is familiar with cloud computing, big data frameworks and social media conceptual analysis. The workflow branches have different degrees of difficulty when implemented, the degrees are related to the architecture selected, the data set characteristics and the research question.

3.1. General Workflow

This section describes a general overview of the workflow, in a conceptual form this workflow is defined in two ways, one perspective thinking in the availability of resources such as architectural resources or skills, and the second one with the type of data evaluated in the volume, variety, and velocity. This project considers two types of workflows, a general workflow which is designed for a general case with a broad spectrum of appliances, and a second appliance workflow where the workflow turns into a more exhaustive and detailed flow; the workflow appliance tasks are described in the following sections.

The workflow (Figure 2) contemplates a stored social media data set; the main process looks up to answer

a research question within the data set. Briefly, the questions are designed to split the workflow into three

sections, one section for a local machine processing and analysis, a second section that contemplates the

use of the cloud, and a third one that uses the cloud but also integrates a big data environment. The first

question is related to the current resources of each user, a second question focuses on the skills and the

financial means, at this point the workflow is divided in two branches, a cloud branching which seeks for

specific characteristics of the data set for example, volume, variety and velocity necessities and a sampled

branch which looks for answering the research question with a portion of the data set; the third question

(22)

in both scenarios seem to find an answer for the research question. The research question can be defined as a question that needs to be answered concerning the data set, some questions may have a specific topic, or some questions might be more general. Also, the question should be related to the fields or characteristics of the data set.

First Question: The first question provides a branching in which the user has the possibility of process and analyze all the information in a local machine, and continue the analysis in his own machine, this scenario is designed for relatively small data set.

Second Question: Depending on the characteristics of each data set the second question provides two branches one with a sampled data set with limited resources, and another with the complete data set with extra resources, depending on budget and experience. The sampled branch uses the same methodology as the local machine branch.

In the case of selecting a cloud environment, the branching asks three new questions related with some characteristics of the data set, since each data set and user has very specific necessities the workflow only suggest some scenarios on the cloud, based on these scenarios the next branching will be applied.

In the cloud division, three factors affect all the questions, time, resources and user experience, depending on each data set and user necessities, these factors influence the decision of branching, the following questions in the workflow are also affected by these three factors. They are related with the characteristics as the volume (a) of the data set which in this case the contemplated volumes are gigabytes, terabytes, and petabytes, following by the structure of the data(b), and finally a time-related question (c). The resulted answers characteristics split the workflow between a cloud processing and analysis or the big data environment. It is important to mention that the characteristics are only suggestions due to the specificity of each user data set, these suggested characteristics may change. The workflow does not contemplate a semi-structured data set in the suggested characteristics, due to the complexity of combinations that the addition of this option represent, but it is a depicted in the question and suggested options.

Figure 2 shows the complete workflow; the third question split the image in two possible results, the

upper part designed for a local machine analyses, and the lower part of the model designed for a cloud

approach. The cloud approach branching was designed to give a solution to the processing and analysis

with additional computational resources and a second approach in the case that data is massive and

unstructured.

(23)

The Third question: this question provides an optional resolution related with the data set topic, in the case of a negative answer the suggestion is to start to collect data with the new insights from the previous analysis. In the case that the analysis is not successful in the sample branch, then the recommendation is to save resources to test the complete data set in the cloud.

3.2. Implemented Workflow

The main tasks of the implemented workflow are displayed in Figure 3 which are data management, pre- processing, processing and analysis, each task contemplate one or several sub-tasks. The tasks and sub- tasks are different from the general to the implemented workflows. Both workflows have the same objective but different purposes, the general workflow purpose is to provide a route to pick the appropriate architecture for each user, while the implemented workflow purpose is to provide details for Figure 2: General Workflow

The general questions are numbered from 1 to 3 and the cloud questions have letters in the bottom, from a to c.

(24)

processing and analysis of the data. The architectural limitations and necessities established the idea of two scenarios, one scenario uses a sample, only the geotagged tweets on a local machine, and the other scenario includes the complete data set within a cloud environment.

The collected data set contains tweets with selected keywords that may represent a topic or an event, one

of the objectives of this work is to verify if these words are related to the initial research question. The

data set proposed for this workflow is classified as a Retrospective Event Detection (RED) because it was

stored and then analyzed in a batch processing mode. A common approach to explore a data set topic is

by using a frequency of words analysis, or adopting an unsupervised methodology by exploiting NLP

methodologies to analyze topics or applying a supervised method to classify topics. On this workflow, the

frequency of words and an unsupervised methodology are suggestions to be incorporated into the

workflow workload.

(25)

Figure 3: Implemented Workflow

(26)

The following workflow sections describe the scenarios, geotagged sample and complete data set, with a local and cloud environment characteristics. Each part describe the scenarios in detail, adding sub-tasks and technologies suggestions to the workflow. First with a introduction to the scenarios, followed by a data management description, and finally the detailed description of each scenario.

3.3. Scenarios Introduction

Two scenarios are described in the following paragraphs, one scenario in a local machine, with a sample of the data set designed to explore the data set, and a second scenario in the cloud with the complete data set and designed to process massive amount of social media stored information.

The workflow represented in Figure 3 is divided in two branches. The first branch of Figure 3 assumes a local machine scenario (depicted with a laptop logo in each workflow form), within a sampled geotagged data set (depicted with letter “a” in each workflow form), while the second branch represents the cloud environment (depicted with a cloud logo in each workflow form), within the complete data set (depicted with letter “b” in each workflow form). Some sub-task are dependent on the result of the previous task while others could be reordered or even skipped. But, on the workflow the sub-tasks have been numerated as a suggestion to follow.

The local machine scenario design contemplates querying and filtering data, the use of NLP techniques, and appliance of basic analysis techniques. The data set was sampled by their spatial characteristics or the geotagged records, typically only 1 – 3% of tweets have been geotagged, this is the information used in the sample scenario. The complete scenario has a similar design but with some differences in the tasks and sub-tasks. Initially, the purpose of the complete scenario is to extract spatial information embedded in the text to increase the number of geotagged tweets. Finally, use an improved geotagged data set to look up for the research question with new spatial information that may provide extra spatial details to answer the research question.

The implemented workflow consists in a few questions, the main purpose of the questions are to separate

the scenarios and provide options to each user; the first question focused on the management of the data,

the second question focused on the data set size, the third question ( only in the complete data set branch)

refers to a quest of spatial information within the text, and the final question looks to answer the research

question. At the end of the last question there is an option to provide insights, this is a loop question, in

case of a negative answer this question was designed to improve the data set by providing insight from the

first loop analysis, this is represented with a dotted line in Figure 3. Since is relatively easier to apply the

scenario with only geotagged tweets, running this scenario might provide valuable information without

the necessity of apply the complete scenario.

(27)

Data Management Task

First Question: The first question splits the management task. The question tries to find if the current system can load the information collected from the social media source. In both branches the data management is composed of storage and retrieval, both sub-tasks are different due to the necessities of each user, the current system and the size of the data set. In the workflow, the data management is divided between a local machine and the cloud management which is divided in structured data management and unstructured data management. A database management system is usually used to retrieve information;

this can be installed in a local machine, cluster or the cloud, depending on the resources and the volume of the data set.

The characteristics or each system define the necessity to the usage of a DBMS and the technology associated. In some cases, the use of a DBMS is not required, and the information can be processed from a unique file, however, in the workflow the first question open the possibility to load the data set in a local machine or a cloud environment, depending on the characteristics of the data set and the requirement of the user. Most of the cloud providers have a DBMS integrated a services, structured and unstructured, also the files can be stored and retrieved from the cloud without a DBMS.

A DBMS organize the data set on logical structures defined by a model, this allows to query a database and filter information from the source, which are two sub-tasks of the workflow. Without a database, query and filter task become more complex and time consuming. The disadvantages of a DBMS are the complexity, size and performance, some DBMS require additional resources and even an administrator, designer and developer, this type of databases are expensive and complex.

Since social media is associated with the generation of massive amounts of information a DBMS would be a good option to store and retrieve information. There exist some considerations to select a DBMS such as the structure of the data set. There exist three types of data sets, structured, semi-structure and unstructured. A structured data set is typically associated with a SQL relational structure, a unstructured data set is associated with a non relational NoSQL structure, the semi-structured data set can be associated with both. A NoSQL is classified in types of databases that follows a data model such as column, document, key-value or graph. A SQL approach is associated with a traditional rows and columns schema.

In the case of a structured data set some recognized technologies are Oracle SQL or PostgreSQL, and in the case of a non-structured data set MongoDB or Cassandra are one of the most popular.

The cloud provide some services that make it relatively easy to set up, maintain and administrates a

DBMS. The advantages of a DBMS on the cloud are the scalability and performance. A DBMS in the

cloud is dynamic and allowing simple or complex data sets, the resources on the cloud are easily scalable,

unlike a physical server or cluster which a regular database upgrade and administration may have

expensive costs. Some of the disadvantages of the cloud is that the administrator loose the complete

(28)

control of the servers, the data set is totally dependent on the provider, to transfer information require a good internet connection, and to switch between providers have several issues.

The workflow contemplate the cloud storage and retrieval as a service, the complete data set scenario have two options a structured and unstructured data management, the selection depends on the characteristics of the data set such as type of structure, volume and velocity of retrieval.

3.3.1. Sampled Geotagged Scenario

Second Question: The second question of the implemented workflow split the scenario in the sampled and complete scenarios. This scenario contains three sub-tasks; each sub-tasks is focused on process and provide inputs for the following tasks; the tasks in here are pre-processing, NLP and Analysis. The geotagged information represent a fraction of the complete data set, this scenario contemplates the information collected from the streaming API, depending on the characteristics of the search and keywords the information can be distributed in different parts of the globe. The main goal of this scenario is to seek for an answer to the research question or provide insights for further analysis.

Pre-processing Task

The pre-processing task focus on querying, filtering, removing, and cleaning. In the workflow, these tasks are numbered in the superior left corner, this numbering is only a recommendation and may be subject to changes depending on each data set characteristics.

The first sub-task consist in querying the database or file to get only the records that have valid coordinates of latitude and longitude. The advantage of querying is to reduce the amount of records by soliciting only the information that is required; this saves time in the following procedures. A query may be simple or complex, depending on the user necessities, in some cases an efficient query may help to solve the research question almost in the first stages of the workflow. It is important to mention that in the situation where there are no geotagged tweets, it will be required to collect more information.

The sub-task named removing, focus on locating some specific patterns from some social media users.

Some users are excluded from the data set analysis. Some accounts may be fake or may be skewing the

sample by adding the same tweet several times with the same sentence and words, because of this reason it

is necessary to do locate this type of users or records, the analysis may focus on the users and their

behavior.

(29)

The last suggested sub-task on the workflow is called cleaning, which look up for special characters, symbols, numbers and URL’s that are not essential for the NLP analysis and remove them from the text.

It is important to remember that the data set comes from social media, this type of information has several sources, and for this reason tends to be noisy. Tweets may have several types of words, symbols, URL’s, emoticons or numbers that do not represent any contextual message on the tweet, removing and cleaning this type of data might be challenging. Fortunately people in several institutions have been working on the comprehension of human language by computers, this work includes tasks to analyze text, and remove the additional unnecessary information from it.

Natural Language Processing Task

The processing task will use NLP techniques such as tokenization, stemming, stop words and lemmatizer, the following are a brief explanation of each method and their use on this workflow. It is essential to clean and filter the processed words before this task is applied.

The language detection sub-task classify the information depending on the the language of each sentences or document. The language detection is applied prior the stop word and tokenization, the reason is that the stop word sub-task look for specific words in a language list, prior apply a stop word list, the language should be detected. There exist some words that do not contribute with any information to the analysis, these words are removed from each line of text in the tweets.

The tokenization sub-task is applied to separate every word in each tweet; this is a necessary step to simplify the text and analyze the message that each tweet contains. The tokenization purpose is to separate each word into different terms; this technique can separate specific symbols, on microblogs such as twitter. There exist several pre-trained tools that are capable of fulfilling the tokenization of a sentence.

Typically after the tokenization and language detection, the stop words or stop-list technique is applied, the removed words are widespread and do not contribute to the semantic analysis. On the workflow, it is suggested to use it after the tokenization.

The stemming and lemmatizer (Figure 3) techniques have the same order numbering on the top left

corner; this denotes that the order is not significant when applied this procedure in the workflow. Both

techniques are used to normalize the text in each tweet, in the workflow they are suggested after the stop

word technique. The stemming procedure shortens or reduces the verbs into their morphological root

while the lemmatizer analyses specific dictionaries to return the lemma of each word. The Table 1 display

examples of the suggested sub-tasks for the workflow.

(30)

Analysis Task

The task contains three sub-tasks, frequencies, top modeling, and visualization, numbered and applied in the same order. One of the sub-tasks in the workflow is called frequencies, the goal to analyze the frequencies is to understand the distribution of the words collected examples of these are the most mentioned words, the most common hashtags, words combinations with n-grams, and by sentence with a co-occurrence matrix. The expected result is to visualize and comprehend the weight and behavior of each queried word in the data set.

The Frequency provides a first insight of the data set, it is a powerful tool to look up for information related to the research question, in this case, the suggested analyses with the most frequented mention words are: hashtags, N-grams and a specific combination of words, this analyses will help to understand what are the most frequent terms in the data set but also their relationship with the initially queried keywords. The co-occurrence matrix is part of the frequency analysis, this type of analysis look up to find the most common terms with a specific word.

The workflow incorporates a machine learning unsupervised clustering as a suggestion, to analyze the data set text and test if the data set contains valuable information for the case study. Based on the experience of Yang (2016), this method proved to be adequate to cluster all the information and find a topic giving the distribution of the words. Due to the lack of information about the data set topic, the workflow sub-task topic modeling will categorize all the terms in different topics. The LDA model is a suggestion which typically is used in NLP for unsupervised topic modeling. It uses documents as a collection of words which categorize in semantic groups, due to their distribution of each semantic group a word is classified into a set of words that contains a clustered topic; the purpose of usage is to find the diversity of topics that the data contains. Nevertheless, this technique require some understanding of probability distributions such as Poisson and defining the input parameters of the model is not an easy task (Blei, Jordan, & Ng, 2003).

It is essential to visualize the information generated by the frequency analyses and also the spatial information, typically social media data is skewed by cities where the population density is high. The main purpose of the visualization sub-task is to represent the information from the social media with a space and time context. Nevertheless visualize this type of information may reflect a different behavior in

Sub-task Example

Cleaning This tweet is an examples of Natural Processing Language

Language Detection english

Tokenization ['this', 'tweet', 'is', 'an', 'examples', 'of', 'natural', 'processing', 'language']

Stop List ['tweet', 'examples', 'natural', 'processing', 'language']

Lemmantizer ['tweet', 'example', 'natural', 'processing', 'language']

Stemming ['tweet', 'exampl', 'natur', 'process', 'languag']

@miley This tweet is an examples of Natural Processing Language 12345, Http://www.google.com.mx

Table 1: Tweet processing example of NLP sub-tasks

(31)

space and time, this graphical visualization is necessary to see different patterns , this sub-task expect to display some extra information on the research question.

3.3.2. Complete Data set Scenario

This scenario is divided by the second workflow question, which denotes the user preference of a data set and type of environment. This scenario contemplates the complete data set within a cloud architecture due to a necessity to explore the data with additional computational resources. This branch focus on the same research questions that the previous scenario but including other frameworks and techniques integrated into new sub-tasks. Some of the differences with the sample scenario are the inclusion of sub- tasks, the order of appliance of some sub-tasks, and the inclusion of the “Extract of Spatial Information”

as a main task. This addition is focused on spatial users who look to go further in search of spatial information within the micro-blog text. The main tasks of the complete data set scenario are pre- processing, NLP, Extract of Spatial Information and Analyses (Figure 3). It is essential to mention that the workflow contemplates insights from the geotagged scenario that can be interpreted as valuable information to initialize the complete scenario. In the case that both scenarios fail to answer the research question, the suggested option in the workflow is to improve the collection of words for further research.

The cloud environment can provide access to more resources, in this case, the necessity of increase the processing and storage capacity is implied in the second question. The scenario was designed for the users with means and knowledge that desire to explore a stored data set in the cloud. The cloud offers several services such as financial, processing and, storage. The workflow focuses on two components on the cloud, the storage, and the processing component, both have specific characteristics and are necessary to review and define specific configurations to provide the amount of processing power required to complete some of the tasks and sub-tasks of the workflow.

All tasks and sub-task are numbered in order of appliance, depending on the case some of them may not apply, and they can be skipped. The workflow design contemplates information from micro-blogging social network, other types of social media such as those that contain videos or images as their main objects of analysis may not apply for this workflow.

Pre-processing Task

Two tasks were defined on this part of the workflow, removing and cleaning, both tasks are designed to

remove records and characters in order to make the processing more efficient and also to have confident

inputs for the following models. In the cloud and with the complete data set this type of sub-tasks may

(32)

take some time without the right implementation. The removing sub-task consist in identify specific users or records and remove them, and the cleaning part keep some characters out of the analyses. The main difference with the local machine scenario is that the data set size on this scenario may be enormous.

The difference between the pre-processing part of the workflow and the previous in the geotagged section is that with the complete data set it is suggested to implement a big data processing approach such as Map/

reduce or in-memory processing. In the case of an unstructured and massive data set, one possibility is to utilize Map/Reduce. The map phase is responsible for extracting the message field from each tweet and further processing it with the purpose of obtaining a “bag of words” representation of the tweet. In the bag of words model, a text (such as a sentence or a document) is represented as a multi-set of its words, disregarding grammar and even word order but keeping multiplicity (Moise, 2016). Each map task reads its assigned input block and processes each line of the block, corresponding to a single tweet. The in- memory approach is different but not exclusive from the map/reduce approach; this model uses the RAM to process the information avoiding the disk access, allowing to increase the speed of the processing.

This type of techniques combined with some regular expressions may find and remove special characters in the data set. A regular expression can be defined as a pattern defined by a sequence of characters, this type of patterns are commonly used in programming as tools. A regular expression can locate and remove patterns such as a) filter out all non-latin characters, b) remove numeric characters c) special characters or user-defined characters.

Natural Language Processing Task

The Natural Language Processing task in the complete scenario will focus on the same tasks that the previous scenario, but targeting to implement the task ans sub-tasks in a cloud environment with a regular framework or with a big data model framework. The task is split in 5 sub-tasks language detection, tokenizaition, stop words, lemmatizer and stemming. However the sub-tasks are very similar to the previous scenario, the main difference are in the the language detection that may require a different approach to define the language per record or the lemmatizer that in the big data framework require to be trained. One option is just to implement NLP on a cloud environment, with regular programming schema or to apply NLP for big data which require specific methods of programming and order, there are some libraries that are designed for this type of framework.

To apply this type of tasks within a cloud service have different levels of difficulty, depending on the

architecture selected but mostly on the framework. It is important to distinguish that the framework add

technical complexity to the sub-tasks, an example is the language detection which needs to be trained for

before employ it or stemming which need specific designed libraries to be applied in a Spark framework.

(33)

Extract Spatial Information Task (Optional)

In order to collect spatial information from a text message it is important to identify the different entities to extract place names making use of geonames gazetteer, on the workflow this process may be simple or complex depending on the requirement of each case study. Parsing is defined as the process of analyzing text sentences composed by tokens, to determine the grammatical structure in agreement with formal grammar rules. The parsing sub-task is suggested to identify the type of form from a sequence of characters, some approaches or algorithm are more practical and other more powerful, this tool is usually performed previous to the geocoding in order to select only the words classified as nouns.

In the workflow a sub-task suggestion is geocoding, which can be defined as the procedure of transform a country, township, place or addresses into geographic coordinates (Google, 2019). The objective of this sub-task is to add spatial information to records that do not have it, this extra information is extracted from the context of each text message.

Nowadays there exist some tools that automatically apply both tools called geoparser. The geoparser has two main components, the geotagger, which is responsible for place name recognition, and the georesolver, which is responsible for georeferencing (Grover et al., 2010). This type of analysis it is not an easy task and requires a connection with a gazetteer with geographical names and the identification of the geographical entities in each Tweet.

AnalysisTask

This part of the analysis tries to answer the research question, the following sub-tasks are implemented, frequency, topic modeling, geoprocessing and visualization. There exist technical differences between the two scenarios, due to the complete scenario present a technical challenge since use a variety of frameworks or technologies. The frequency tasks suggest the word count, bigram analysis, hashtag counting and co- occurrence matrix, all this analyses are the first frontier to display the content of the information.

On the workflow the topic modeling utilize LDA as a suggestion to find the different topics in the data

set, however implement this type of analysis on a big data framework may be challenging. The sub-task of

geoprocessing on the workflow purpose is to process the geocoded information and provide inputs for the

visualization sub-task. The visualization sub-task purpose is to provide to the user valuable and

understandable information of the data set.

(34)

4. PROTOTYPE DESCRIPTION AND CASE STUDY CHARACTERISTICS

A prototype is defined as a preliminary interactive model based on an idea, the objective of a prototype is to explore and express an idea with an ordered structure (Houde & Hill, 1997). This research is considered a prototype because is expressing the idea of exploring social media represented with a sequence of stages or tasks, to explore the possibilities of adding a geoprocess in a structured sequence.

One objective of this research is to implement a prototype able to analyze the case study data set. To accomplish this objective, a workflow was designed with the purpose of checking the stored data set relation with a research question. Some investigations have generated prototypes associated with geoinformation. Some of these prototypes create information from social media feeds by the integrating, harvesting, processing and modeling geosocial information in a prototype (Croitoru et al., 2013). Another example related with geosocial media was developed by Chang who developed a real-time geocollaboration prototype in a geospatial environment (Chang & Li, 2013). The mentioned examples are presented as evidence that a social media idea or concept can be implemented in a prototype.

The prototype consists in two stages, one stage that analyze the collected information with a sample of the data in a local machine, and another stage to make a similar approach but in the cloud, with the complete data set, and with a big data infrastructure. The following sections describe the characteristics of the case study data set and the prototypes in a local machine and the cloud.

4.1. Case Study Description and Data set Characteristics

The case study focuses on answering whether a collected twitter data set contains relevant information on actual tick bites and tick bites risk. These tweets were collected using a query with Dutch keywords that may be related with a tick bites or outdoor activities that increase tick bite risk. The research question of the study case is “The data set contain information related with tick bite events?”, the prototype may answer the question or provide insights related with the data set.

The data set is composed by a set of Tweets that have been collected since 2015, the initial aim of the

collection is to track tick bite events in the Netherlands. The data set used for the case study comes from

the Twitter Streaming and Search Application Programming Interface (API), this standard API allows to

search and collect tweets queried by location or specific keywords. The API collects information as Tweet

Objects which are composed by more than fifty attributes such as tweet id, user id, text, date of creation,

Developing a reproducible workflow for batch geoprocessing social media in a cloud environment

Developing a reproducible workflow for batch

geoprocessing social media in a cloud environment

RICARDO MORALES TROSINO March 2019

SUPERVISORS:

Dr. F.O. Ostermann

Dr. O. Kounadi

Thesis submitted to the Faculty of Geo-Information Science and Earth

Observation of the University of Twente in partial fulfilment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation.

Specialization: Geoinformatics

SUPERVISORS Dr. F.O.Ostermann Dr. O. Kounadi

THESIS ASSESSMENT BOARD:

Dr. MJ. Kraak (Chair)

Dr. E. Tjong Kim Sang (External Examiner, Netherlands eScience Center, Amsterdam)

Developing a reproducible workflow for batch

geoprocessing social media in a cloud environment

RICARDO MORALES TROSINO

Enschede, The Netherlands, March, 2019

DISCLAIMER

This document describes work undertaken as part of a programme of study at the Faculty of Geo-Information Science and Earth

Observation of the University of Twente. All views and opinions expressed therein remain the sole responsibility of the author, and do

not necessarily represent those of the Faculty.

ABSTRACT

The main objective of this research is to deliver workflow scenarios that can process and geoprocess social

media with batch data. The research focused on defining useful tasks and sub-tasks to explore and analyze

batch social media and to deliver a prototype able to reproduce the workflow. Two architectural scenarios

were identified. One scenario designed for newcomers in a local machine and another for more advanced

users in a cloud environment. A local machine scenario developed to explore a stored data set with a

sample of the data set, and a more complex scenario to explore the complete data set in the cloud and with

a big data framework such as Spark. A prototype was designed to test the workflow and to achieve

reproducibility. To test the prototype, a data set was provided with the intention to search for tick bites

events in the Netherlands. The results showed that, following the workflow, the example data set contains

some noisy words and the processing in the cloud environment was relatively cheap and efficient.

ACKNOWLEDGEMENTS

I want to thank ITC teachers and staff, to share all their knowledge with me. To my fellow students who worked with me side by side.

To Massyel to be with me in my brightest days and darkest knights, thank you gatita. To my mom, sister, and father that gave me remote spiritual support.

I want to acknowledge my sponsor, CONACyT and Alianza FIIDEM, for giving me the opportunity to join their international scholarship program, and for providing all the necessary resources; to get to the Netherlands and the chance to grow professionally.

And to my friends of GFM, sorry guys to interrupt every class with my silly questions and comments. I

just wanted to break my own bubble to understand and learn more about our World.

TABLE OF CONTENTS

1. Introduction...7

1.1. Motivation and Problem Statement...7

1.2. Research Identification...9

1.3. Research Objectives and Questions...10

1.4. Thesis Outline...11

2. Related Work...12

2.1. Cloud Environment...13

2.2. Big data Framework...15

3. Conceptual Design...17

3.1. General Workflow...18

3.2. Implemented Workflow...20

3.3. Scenarios Introduction...23

3.3.1. Sampled Geotagged Scenario...25

3.3.2. Complete Data set Scenario...28

4. Prototype Description and Case Study Characteristics...31

4.1. Case Study Description and Data set Characteristics...31

4.2. Sample Scenario in a Local Machine...33

4.3. Complete Data set Scenario in a Cloud Environment...36

4.3.1. AWS Introduction...36

4.3.2. Prototype Application...37

5. Implementation Results...41

5.1. Local Machine Scenario with Sample Data set...41

5.2. Cloud Scenario with Complete Data set...46

6. Discussion...53

7. Conclusions...57

7.1. Research Questions Answered...57

7.2. Further Work...60

List of References...61

LIST OF FIGURES

Figure 1: Brief description of chapter contents...11

Figure 2: General Workflow...20

Figure 3: Implemented Workflow...22

Figure 4: Example of the interactive visualization created by the prototype...35

Figure 5: Map of Geotagged tweets of the data set...42

Figure 6: Language classification...43

Figure 7: Initial searched word mentions in Dutch...44

Figure 8: Map of geotagged tweets in the Netherlands 2015 - 2018...46

Figure 9: Complete data set most mentioned terms...47

Figure 10: Filter with the most mentioned words with ‘camping’...48

Figure 11: Filter with the most mentioned words with ‘tekenbeten’ and 'tekenbeet'...49