Developing a reproducible workflow for batch
geoprocessing social media in a cloud environment
RICARDO MORALES TROSINO March 2019
SUPERVISORS:
Dr. F.O. Ostermann
Dr. O. Kounadi
Thesis submitted to the Faculty of Geo-Information Science and Earth
Observation of the University of Twente in partial fulfilment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation.
Specialization: Geoinformatics
SUPERVISORS Dr. F.O.Ostermann Dr. O. Kounadi
THESIS ASSESSMENT BOARD:
Dr. MJ. Kraak (Chair)
Dr. E. Tjong Kim Sang (External Examiner, Netherlands eScience Center, Amsterdam)
Developing a reproducible workflow for batch
geoprocessing social media in a cloud environment
RICARDO MORALES TROSINO
Enschede, The Netherlands, March, 2019
DISCLAIMER
This document describes work undertaken as part of a programme of study at the Faculty of Geo-Information Science and Earth
Observation of the University of Twente. All views and opinions expressed therein remain the sole responsibility of the author, and do
not necessarily represent those of the Faculty.
ABSTRACT
The main objective of this research is to deliver workflow scenarios that can process and geoprocess social
media with batch data. The research focused on defining useful tasks and sub-tasks to explore and analyze
batch social media and to deliver a prototype able to reproduce the workflow. Two architectural scenarios
were identified. One scenario designed for newcomers in a local machine and another for more advanced
users in a cloud environment. A local machine scenario developed to explore a stored data set with a
sample of the data set, and a more complex scenario to explore the complete data set in the cloud and with
a big data framework such as Spark. A prototype was designed to test the workflow and to achieve
reproducibility. To test the prototype, a data set was provided with the intention to search for tick bites
events in the Netherlands. The results showed that, following the workflow, the example data set contains
some noisy words and the processing in the cloud environment was relatively cheap and efficient.
ACKNOWLEDGEMENTS
I want to thank ITC teachers and staff, to share all their knowledge with me. To my fellow students who worked with me side by side.
My supervisors Frank and Rania, without your guidance this work would not be possible, thank you for your advices, but mostly to share with me your time and knowledge. To Luis Calixto and Raùl Zurita that shared with me some of their experience with Hadoop. Rosa and Mowa that help me with to improve my proposal.
To Massyel to be with me in my brightest days and darkest knights, thank you gatita. To my mom, sister, and father that gave me remote spiritual support.
I want to acknowledge my sponsor, CONACyT and Alianza FIIDEM, for giving me the opportunity to join their international scholarship program, and for providing all the necessary resources; to get to the Netherlands and the chance to grow professionally.
And to my friends of GFM, sorry guys to interrupt every class with my silly questions and comments. I
just wanted to break my own bubble to understand and learn more about our World.
TABLE OF CONTENTS
1. Introduction...7
1.1. Motivation and Problem Statement...7
1.2. Research Identification...9
1.3. Research Objectives and Questions...10
1.4. Thesis Outline...11
2. Related Work...12
2.1. Cloud Environment...13
2.2. Big data Framework...15
3. Conceptual Design...17
3.1. General Workflow...18
3.2. Implemented Workflow...20
3.3. Scenarios Introduction...23
3.3.1. Sampled Geotagged Scenario...25
3.3.2. Complete Data set Scenario...28
4. Prototype Description and Case Study Characteristics...31
4.1. Case Study Description and Data set Characteristics...31
4.2. Sample Scenario in a Local Machine...33
4.3. Complete Data set Scenario in a Cloud Environment...36
4.3.1. AWS Introduction...36
4.3.2. Prototype Application...37
5. Implementation Results...41
5.1. Local Machine Scenario with Sample Data set...41
5.2. Cloud Scenario with Complete Data set...46
6. Discussion...53
7. Conclusions...57
7.1. Research Questions Answered...57
7.2. Further Work...60
List of References...61
LIST OF FIGURES
Figure 1: Brief description of chapter contents...11
Figure 2: General Workflow...20
Figure 3: Implemented Workflow...22
Figure 4: Example of the interactive visualization created by the prototype...35
Figure 5: Map of Geotagged tweets of the data set...42
Figure 6: Language classification...43
Figure 7: Initial searched word mentions in Dutch...44
Figure 8: Map of geotagged tweets in the Netherlands 2015 - 2018...46
Figure 9: Complete data set most mentioned terms...47
Figure 10: Filter with the most mentioned words with ‘camping’...48
Figure 11: Filter with the most mentioned words with ‘tekenbeten’ and 'tekenbeet'...49
Figure 12: Geocoded records by township and provinces...51
Figure 13: Geocoded records of term 'tekenbeten' by township and provinces...52
Figure 14: Comparing the results of 'tekenbeten' map and Tekenradar...54
LIST OF TABLES
Table 1: Tweet processing example of NLP sub-tasks...27
Table 2: Twitter Object Attributes...32
Table 3: Set of words extracted from Twitter (Initial searched words)...32
Table 4: Example of Language Classification python tools...34
Table 5: Cleaning sub-task Tweet example...39
Table 6: Tweet example with elements split but with the track of the original Tweet...39
Table 7: Example of a unigram gazetteer data...40
Table 8: Example of tokens with matched elements...40
Table 9: Example of initial words counter...40
Table 10: Twitter Initial searched words co-ocurrance matrix...45
Table 11: Complete data set co-ocurrence matrix...50
LIST OF ABBREVIATIONS
API Application Programming Interface AWS Amazon Web Services
CLI. AWS Command Line
DAG Directed Acyclic Graph DBMS Database Management System
DNS Domain Name System
EC2 Amazon Elastic Compute Cloud EMR Elastic Map Reduce
GPU Graphing Processing Unit HDD Hard Drive Disk
IaaS Infrastructure as a Service IAM. Identity Access Manager NLP Natural Language Processing PaaS Platform as a Service
RAM Random Access Memory
RDD Resilient Distributed data set S3 Amazon Simple Storage Service SaaS Software as a Service
SSH Secure Shell
VM Virtual Machine
VPN Virtual Private Network
1. INTRODUCTION
1.1. Motivation and Problem Statement
Social media (SM) has become a channel to share information between users, in recent years experiencing exponential growth. Researchers have been interested in the study of this type of information due to its complexity, immediate generation of knowledge, its spatial and time characteristics, and the volume that social media can generate. For a spatial scientist, it is relevant to study how to extract information from this type of source, what type of spatial procedures can be applied and in which platforms and what spatial valuable information can be extracted from this type of data source. It is important to generate usable workflows to make SM information reproducible and more accessible to people with minimum experience in spatial analysis or computer science analysis. Over the last decade, the information generated by social media has evolved; with changes triggered by several factors like mobile phones accessibility, global positioning services or WIFI/4g connectivity. Nowadays, users share immediate data which may be personal, informative or even a geolocated event. Some of these can provide valuable information or services that can be used as a product. Social Networks like Foursquare, Flickr or Twitter can be used as a powerful tools to analyze trends of behavior, disasters, events, health outbreaks with a geographical location context (Mazhar Rathore, Ahmad, Paul, Hong, & Seo, 2017).
Social media and cloud computing conceptually seem to be related, but they are quite distinct. The cloud
is described as a model that has several computing resources like storage, applications, networks or
services. By definition the cloud has five essential characteristics, three service models and four
deployment models (Mell & Grance, 2011). The five characteristics of the cloud are On-demand self-
service, Broad Network Access, Resource Pooling, Rapid elasticity, Measured Service; four deployment
models: private cloud, community cloud, public cloud, hybrid cloud; and three service models Cloud
Software as a Service (SaaS), Cloud Platform as a Service (PaaS), Cloud Infrastructure as a Service (IaaS)
(Rountree & Castrillo, 2014). Social media is defined by the several platforms that exchange information,
the platforms have several channels of interaction like blogging, networking or multimedia content, they
share the same goal, to provide services to exchange information between users. Emerging computing
paradigms like cloud computing show high penetration rates in social and geosocial developments, due to
the structure, costs, and flexibility (Ramos, Mary, Kery, Rosenthal, & Dey, 2017). Big Data as a Service
(BDaaS) is a term conceived by cloud providers to give access to common big data-related services such as
processing or analytics. This type of services are based on the user necessitates such as infrastructure,
platform or software, but with the difference that the framework, tools, and environment are designed to
process massive amounts of information (Neves, Schmerl, Camara, & Bernardino, 2016).
Organizations and researchers are looking for new models that unify real time and easily access information, providers like Amazon, Azure or Google can offer tools to implement different types of cloud models, depending on the necessities of the users, these can become elastic and scalable with promises of increasing productivity and reduce costs. The high amounts of information that the cloud can storage and process are one of the most significant benefits of the cloud (Chris Holmes, 2018)but the cloud also have some drawbacks just as the security, format inflexibility or the unexpected downtime that the cloud might have (Larkin Andrew, 2018).
In recent years the words Volume, Variety, Velocity, Veracity, and Value are described as the five V’s, these words are associated with the concept of Big Data, which is related to the generation, processing, and storage of vast amounts of information (Neves et al., 2016). At some point the generated data can surpass the capacities of a local machine or even a computing cluster, social media streams usually create this amount of data. This enormous amount of information has popped up some research questions in recent years like, How to process, store and extract valuable information from it, how to make it more efficient, how to make this information more accessible. The social media data produced by the mixed social networks has been termed Social Media Big Data (Stieglitz, Mirbabaie, Ross, & Neuberger, 2018).
Social media data can be georeferenced by the service provider or even by the user, this leads to a new generation of data sets with spatial information generated by users or providers, this is called geosocial media. When the information contains a spatial reference, a spatial researcher can analyze this type of information with a geoprocessing procedure, which can be defined as a framework of tasks and tools that process and automate information from a Geographical Information System (ESRI, n.d.). Previous studies utilized the geoprocessing tasks in their research (Ostermann, García-chapeton, Kraak, & Zurita-milla, 2018) & (Yue, Zhang, Zhang, Zhai, & Jiang, 2015) they suggest an advisable geospatial analysis tasks for points, some of the tasks are: pattern analysis and clustering, methods of spatial association, pattern recognition or classification just to mention a few, this information may be a start point to explore the geoprocessing possibilities.
Geosocial media streams are relatively new in studies of big data on the cloud, and combined with geoprocessing have open unexplored paradigms of research. Due to their characteristics, these type of data have sparked some research questions like what are the possible tools to process social media in spatial context or how to make this information more accessible and reproducible to a wider audience of researchers. (Ostermann & Granell, 2017) reported the replicability and reproducibility in 58 papers through 2008-2009 in which he mention that 58 percent of the papers declare a null or limited reproducibility and replicability. The challenge then consist in define the appropriate tasks and procedures to ensure replicability.
This analysis arises some questions like, which are the top infrastructures to start processing social media
data, what are the methods for geoprocessing available on the cloud, how to process large-scale data sets of
stored data, how to make this research usable for other researchers. To answer these questions, it will be
necessary to study cloud environments, workflow designing, big data spatial analysis, and analyze the data set on a local machine and in the cloud. Therefore, research of a workflow design and how this could be implemented on different scenarios will allow starting a scouting of important or relevant data that in further research stages can provide automatized platforms already proved that can facilitate the information analysis.
1.2. Research Identification
The main objective of this research is to analyze and design workflow scenarios with capabilities of processing and geoprocessing geosocial media stored data. This research identifies the lack of information about technicalities and limitations when geoprocessing social media is about, what are the options to process this type of information, and how to geoprocess that information on a local machine or in the cloud. In computing analysis, a workflow is used as a way to interpret, communicate and inform a path to achieve a computational analysis (Hettne et al., 2012). A reproducible workflow could provide a trustworthy diagram to explore geosocial stored streams, this workflow will include tasks like, data storing, data management and data analysis, with techniques of filtering, cleaning or tagging. The importance of developing a workflow is to ensure their usability by other users or researchers. This is an opportunity to explore the possibilities of storage and processing into the cloud and also design and implement a workflow with geoprocessing capabilities.
It is important to explore and define the databases and tools to use in each scenario, one of the challenges is to define the parameters of the scenarios, the differences in the filtering and clustering, and the storage of the data set. It is important to make a review of the available information. In literature there exist some researches that explored topics like to model geosocial media with big data, or to identify risks based on crowd sourcing information, or extracting valuable information from social media (Mazhar Rathore, Ahmad, Paul, Hong, & Seo, 2017b; F. O. Ostermann et al., 2015; Sun & Rampalli, 2014), just a few papers incorporate the geoprocessing into their research and they have their own challenges such as insufficient geotagged information to accomplish a scientific analysis (Yue et al., 2015).
Technical challenges like the usage of non relational database, the creation of an efficient batch processing, spatial indexing with big data, generating overviews with sampling reduction strategies, and the flexibility of computing storage resources on the cloud would be encountered during this research. Some of this challenges are described by Yang (2016), that reported some geospatial studies that developed some significant geospatial challenges related to the Big Data V’s and a cloud computing processing or storage.
This study has the potential to contribute and provide a beginners guide to explore unexplored data sets in
different scenarios such as the cloud environment.
1.3. Research Objectives and Questions
On this section, the objectives and research questions were established. Briefly, the first objective was related to the workflow, the second associated with the infrastructure, a third objective linked with a prototype system and the last one connected with the reproducibility of the research. Therefore some questions were established, in total there are nine questions, and each question try to answer specific challenges of the study.
To specify the principal tasks and techniques to transform regular social media into geosocial information and incorporate them in a workflow proposal.
To analyze the relationship between stored geosocial media and infrastructure characteristics to define architecture scenarios.
To implement a prototype system that analyzes the study case information.
To evaluate the reproducibility of the workflow and performance of the prototype.
Research Questions
1.
1) Which tasks and techniques are necessary to incorporate geosocial media, geoprocessing and a cloud environment in the same workflow?
2) How to operationalize the workflow integrating the required tasks and techniques?
2.
1) Which scenarios can be defined based on the stored data, geoprocessing tasks, and system infrastructure?
2) Which and why different type of technologies are required for each scenario?
3) What are the advantages and disadvantages between the selected scenarios?
3.
1) Which type of limitations from the study case data set and the proposed scenarios will affect the prototype
4.
1) How do the characteristics of input data and scenarios affect the reproducibility of the workflow and the re-usability of the prototype?
2) Which techniques or benchmarks are the most feasible to evaluate the performance of the prototype?
3) How the results from the social media spatial analysis can be used for further research?
1.4. Thesis Outline
The project is divided into three main stages, the first one describes the design of the workflow based on the required tasks and techniques, the second phase will be to create the prototype and analyze the available data set, and the last step will be dedicated to evaluate the prototype and the workflow.
The document is divided into seven chapters. The first chapter describes the motivation of the research combined with a brief description of the research and the research questions and objectives. The second chapter depicts the basic concepts and research associated with social media, the cloud environment, and big data. Chapter three report the design of the workflow with detailed information on the suggested tasks and sub-tasks. Chapter four reports the
implementation of the tasks and sub-tasks in the
prototype. In chapter five are described the results of the prototype appliance with a study case data set. Chapter six discuss the outcomes of the research critically.
Finally, chapter sever provides a brief conclusion and
answer each research question. Figure 1: Brief description of chapter contents
2. RELATED WORK
Social media has become an interesting topic for social science researchers who are seeking to find valuable information provided by the users. To comprehend the importance of this research, there exist some studies where social media data was used and helped to understand stages of an outbreak: in Nigeria 2014, social media information reflected an Ebola Outbreak before the official announcement (Edd & Rn, 2015).
In 2016 a Zika outbreak in Latin America was tacked with Google searches and Twitter information. As it can be appreciated this approach may be useful to track a virus and forecast new spreading areas from social media information and could be incorporated to enhance the usual epidemiological attention of an outbreak (McGough, Brownstein, Hawkins, & Santillana, 2017).These type of researches have a direct impact on communities, but to achieve this, it is necessary to study the characteristics of the data produced by social media. It is essential to explore how to transform the data efficiently, and in which cases is essential to analyze, processing and storage data when big data and social media is in the picture.
Batrinca & Treleaven, 2014 published a review for social science researchers interested in the necessary techniques, platforms, and tools to analyze social media. In the report they describe the initial procedure to explore social media (SM), such as, the file formats expected like HTML or CSV, the main social media providers divided in free and commercial sources,tools for processing and analyze text from a language perspective or even storing data in a file or in a Database Management System (DBMS), just to mention a few examples. In 2013 Croitoru developed a system prototype to explore information from geosocial media in which he reports a conceptual model based on two systems, the first one to ingest feeds into a system and a second multi-step approach to analyze the social media information. For example, an automatic event detection using big data and a machine learning approach developed by Suma (2018) in which concludes that there are some improvements in the management and processing of data, but there are still challenges in the event detection. Both projects are useful, and their methodologies could be applied in this research as an example of a prototype with social media characteristics for processing massive amounts of data.
Twitter is one of the most used social media microblogs sources to obtain information provided by the
users. Some of the analyses incorporate trending words to predict specific behavior of financial markets,
or traffic congestion in conglomerated areas or even analyze data to detect events without any prior
information (Lansley & Longley, 2016). Some of these studies are based on tasks to process text, to
manage and process text generated from the users may be a risky task due to the number of users and the
complexity of each text. Luckily some research has been done on this field. The Association and Journals
such as the American Association for Artificial Intelligence, Stanford Natural Language Processing Group,
Machine Learning or Journal of Artificial Intelligence Research have been working on speech and
language processing for decades in computing science; this is called Natural Language Processing (NLP) (Jurafsky & Martin, 2007). Some of the techniques and tasks used to process text from social media are related to clean text for a more specific analysis, the tools and platforms vary depending on the purpose of the research or project. Lately, some studies focus around the generation of massive amounts of text from Twitter, studying issues such as data management and query frameworks, or challenges like the necessity of an integrated solution that combine an analysis on Twitter with a big data management perspective (Goonetilleke, Sellis, Zhang, & Sathe, 2014).
Nowadays, there exist several tasks in NLP such as tokenization, language detection or stemming to analyze social media microblogs like Twitter, these powerful tools allow to monitor user activities that have been impossible to observe until now, some of these tools are described by Preotiuc-Pietro (2012) which provide an open source framework to process text, describing some tasks implemented on Twitter. Other researches have focused on tracking fake accounts (bot) on Twitter by trying to clean the information and locate patterns on the fake accounts (Wetstone, Edu, & Nayyar, 2017). These solutions are not bullet proof, but until now some of them has proved to have good performance analyzing text from the web, research continues by addressing the challenges generated in this field(Sun, Luo, & Chen, 2017).
A percent of social media records are geolocated, when a record includes this type of information from the source is called geotagged data, but when this spatial information needs to be translated from a the text source to coordinates is called geocoding. Some studies has used geolocation and geocoding to study mobility and dynamics between border countries (Blanford, Huang, Savelyev, & MacEachren, 2015) use geolocated tweets as proxy of global to determine human mobility (Hawelka et al., 2014), check land uses in cities by their tweet activity (Frias-Martinez, Soto, Hohwald, & Frias-Martinez, 2012).
2.1. Cloud Environment
In computing, it is vital to comprehend some concepts such as node, core, processor, and cluster. A node refers to one individual machine in a collection of machines that are connected and form a cluster (Roca &
Cited, 2001). Each node typically has one Central Processing Unit (CPU), which in turn has one or more
cores. A core collects instructions to perform tasks. The performance of a computing cluster depends on
its components, but the most efficient choice of number of nodes, number of cores per node, and size of
node memory is not always straightforward. A common, practical solution to select the components in a
cluster is to monitor the time spent to achieve one task with a representative sample of the data and check
the usage of nodes in the cluster. Another solution is to compare the size of the data set and calculate the
capacity of the considered nodes and then compare the values (Amazon, 2018). The cloud has proved to
provide the tools to process massive amounts of information, using the cloud services available it is
possible to manage, access, process and analyses social (C. Yang, Yu, Hu, Jiang, & Li, 2017).
In cloud computing there exist four types of environments public, private, hybrid and community. The public cloud environment is defined as a variety of services offered by organizations which specialize in providing infrastructure and technologies for different purposes, this type of environment is focused on the external consumers (INAP, 2016). The private environment is often deployed within companies, similar to an intranet this type of environment is usually used internally to manage internal affairs, this can be classified as the most secure form of cloud computing. The hybrid approach is a mix between private, public or even a community environment, where each member remained as a unique entity but bounded with different standardized technologies. Jimenez (2018) combined the services of public cloud such as IaaS and SaaS with a secure private network environment to provide multimedia services for mobiles. The community environment is a combination between private and public with the aspiration of combine resources to provide grid computing and sustainability green computing.
Public cloud service providers offer several products and services such as databases, gaming, media service, analytics or computing. A possible appliance is the virtualization of data centers using the infrastructure as a service (Moreno-Vozmediano, Montero, & Llorente, 2012). A clear example of SaaS is pictured in some learning platforms that offer an interface for teaching and learning (Gurunath & Kumar, 2015).
.Garg (2013) developed a index to evaluate different services that are available in public clouds, his evaluation are based on the following attributes: Accountability, agility, cost, performance, assurance, security and privacy, usability; these attributes are may be considered to select a service cloud provider.
One of the services that the cloud provide is related with infrastructure as a service, which include the provision of architecture for processing, the selection of architectural resources is based on the type of necessity of each user. The architecture selection on the cloud is associated with the time of processing, the necessity of scalability, management of groups or profiles, and the availability of particular frameworks or services such as Spark (Kirschnick, Alcaraz Calero, Wilcock, & Edwards, 2010). In recent years Hadoop and Spark have been used to process and analyze massive amounts of data. Hadoop has been an option to process this type of information in the cloud. International Business Machines (IBM) define Apache Hadoop as an “Open source software framework that can be installed on a cluster of commodity machines so the machines can communicate and work together to store and process large amounts of data in a highly distributed manner.” It is common to use Hadoop, and also an object file system for storage and scalability of the clusters or nodes on the cloud. Apache Hadoop relies on a Map/
Reduce model that separate the tasks by mapping and reducing. Map/Reduce developed by Google in
2004 resulted a reliable model to reduce significantly the execution time in a cluster. A combination of a
cloud services and models such as Map/Reduce can reduce utilization of computing resources utilization
and handle workload spikes automatically by increasing the resources when it is required (Z. Li, Yang,
Liu, Hu, & Jin, 2016).
2.2. Big data Framework
Three characteristics define big data, the massive amount of information, unstructured data and the requirement to process this information in real time, one common generator of this type of information are social media users (Maynard, Bontcheva, & Rout, 2012). There exist different approaches to handle and examine a big data, some important aspects to consider are the data source, the data management, the data analysis tools available, type of analyses and very important the framework to process big data such as Hadoop or Spark (Rao, Mitra, Bhatt, & Goswami, 2018). The big data infrastructure is defined as the mechanisms to collect the information, the computer program and physical storage to collect it, the framework and environment that allows to process and canalize the information, and finally the and the foundation where the result will be backup and stored(Tozzi, 2018).
In 2010 Spark was developed by the AMPlab in the University of Berkeley, is a flexible in-memory framework that allows to process batch and real-time processing. This framework is compatible with MapReduce model and also can be seen as a complement of Hadoop framework. The Spark utilize a Master and Workers schema for the nodes of the clusters allowing parallel computing. The innovation of Spark is the inclusion of a distributed collection of objects in a set of computers named Resilient Distributed data sets (RDDs) and Directed Acyclic Graph (DAG). The main difference between Spark and Hadoop relies on the method of processing information. Spark process the information in-memory (RAM) with the use of RDD and Hadoop read and write the information in an HDD called Hadoop Distributed File System (HDFS). An RDD is a collection of read-only objects partitioned through different machines in a cluster array(Zaharia & Chowdhury, 2010). HDFS reorganize the files in small chunks and distribute the files in different nodes (Verma, Hussain, & Jain, 2016). Both frameworks Spark and Hadoop use as the resource management and job scheduling a technology called YARN, this technology distribute the work between the different cluster nodes.
When spatial data and big data is in the picture some recommendations are: to implement data reduction strategies or to provide, a computational method that minimize the computational necessities in the cluster (Armstrong, Wang, & Zhang, 2018). Some of the technologies available for huge spatial data sets are, Spatial Hadoop which enriches the Map/Reduce framework by adding two levels of spatial indexes and it contains spatial operations such as spatial join or kNN range queries (Eldawy & Mokbel, 2015).
The other is GeoSpark which also provide Map/Reduce model and support for geometrical and spatial
objects with data partitioning and indexing, also supports spatial querying . In practice some researchers
have evaluated the performance between platforms like GeoSpark or Spatial Hadoop testing the
scalability and performance (Yu, Jinxuan, & Mohamed, 2015), some interesting findings are related with
the incorporation of spatial indexes instead of traditional indexing (Eldawy & Mokbel, 2015). Some of
them explore the index efficiency for spatiotemporal frameworks dividing the input file into nodes,
mimicking a spatiotemporal index (Alarabi, Mokbel, & Musleh, 2018). Some of these studies revealed that
Spark has better performance, scalability, and stability compared with Hadoop (Reynold, 2014). But this
type of processing and analysis is still in a development stage, some challenges remains in, 1) the spatial
indexing of and models to process real-time information, 2) quick assessments to calculate the propagation
of errors and, 3) study of efficient methods to visualize big data sets into an understandable and
communicable displays (S. Li et al., 2015).
3. CONCEPTUAL DESIGN
The workflow design is described in two steps, a general and the implemented workflow. The first workflow in this document describes the social media processing and analysis in a conceptual manner, embracing general concepts that apply to a broad social data sets and user affairs which is called “General Worklfow”. The second step focuses on to go look, and report the importance of techniques and technologies and how to implement a workflow to the current study case, called “Implemented Workflow”. The first workflow tries to provide simplicity, predictability and reproducibility in the design, providing suggestions of implementation in the cloud environment depending on the user necessities, in comparison to the second workflow which the structure is more rigid and more specific in the sub-tasks and technologies suggestions.
A workflow give infrastructure to initialize a process in an ordered way, with the promise of increase the productivity, reduce errors, diagnose and cut down errors of the process among other benefits (Integrify, 2017). The workflow design should focus on achieve clarity, simplicity, recordability, reportability and reusability, with this characteristics a problem can be analyzed in a systematic manner and also integrate intensive computing analyses (McPhillips, Bowers, Zinn, & Ludäscher, 2009). The usage of workflows is a common practice to pictorially express an abstraction for an automated process in sequential order. The generation of a workflow provides a tool for further analysis and with a good design opens the possibility to reproduce the work for new research. In social media, cloud computing and big data some researchers have developed workflows that express some of the problems shown above. Some of them developed workflows to understand some geosocial challenges, like the developed by (Zhang, Bu, & Yue, 2017) in which he developed a workflow and a tool to provide an open environment for geoprocessing raster and vectors in a friendly context. The WIFIRE tool which implements a workflow model to analyze fires based on data-driven modeling and geospatial information Altintas (2015) and Wachowicz (2016) who develop a workflow with geotagged tweets focused on the data ingestion, data management and, data querying. (Suma et al., 2018) research focused on the event detection in cities using Twitter for spatiotemporal events, in his study he developed a workflow for this type of event detection.
To develop a workflow is necessary to incorporate and define tasks and sub-tasks, in this case storage,
process and analyze the geosocial media data will be carried out. As part of the workflow is essential to
evaluate some characteristics of storage and processing environments, in general, some features like the
supported services of the cloud, networking capabilities and performance of workflow solution needed to
be reviewed and evaluated in the context of geosocial media. One crucial reflection based on the work of
Hettne (2012), is based on the idea to avoid workflow decay, described as the missing factors to make a
workflow executed or reproducible, one recommendation to avoid the decay is to compare functional
workflows and reuse them, extracting the main idea and restate it in your current work. The main tasks proposed on the general and implemented workflows are based on the research of Wachowicz (2016), Suma (2018) & Mazhar (2017), in which the principal tasks are, collection or data pool, preprocessing, processing or data management, data querying or data analysis. Previous work visualizes four different main tasks, for this research the main tasks will be data management, pre-processing, processing, and analysis.
The reason of designing two workflows is to propose separate perspectives to describe the geosocial media geoprocessing in a cloud environment. A general and more broad applicable approach was developed in the general workflow, which is a simpler, flexible and more robust proposal. The implemented workflow describes in a detailed scheme the technical and technological tools to fulfill the analysis.
The design of both workflows consist in two scenarios with a sample and complete data set and several branches on a local machine or the cloud environment. One scenario on both workflows is based on a sampled data set and an optional local machine architecture; the second is designed to implement an analysis with the complete data set and an optional cloud environment. The branches are designed for specific users, a novice user which objective is to explore a data set with limited sources and knowledge, and another scenario for a user who has more computational resources and is familiar with cloud computing, big data frameworks and social media conceptual analysis. The workflow branches have different degrees of difficulty when implemented, the degrees are related to the architecture selected, the data set characteristics and the research question.
3.1. General Workflow
This section describes a general overview of the workflow, in a conceptual form this workflow is defined in two ways, one perspective thinking in the availability of resources such as architectural resources or skills, and the second one with the type of data evaluated in the volume, variety, and velocity. This project considers two types of workflows, a general workflow which is designed for a general case with a broad spectrum of appliances, and a second appliance workflow where the workflow turns into a more exhaustive and detailed flow; the workflow appliance tasks are described in the following sections.
The workflow (Figure 2) contemplates a stored social media data set; the main process looks up to answer
a research question within the data set. Briefly, the questions are designed to split the workflow into three
sections, one section for a local machine processing and analysis, a second section that contemplates the
use of the cloud, and a third one that uses the cloud but also integrates a big data environment. The first
question is related to the current resources of each user, a second question focuses on the skills and the
financial means, at this point the workflow is divided in two branches, a cloud branching which seeks for
specific characteristics of the data set for example, volume, variety and velocity necessities and a sampled
branch which looks for answering the research question with a portion of the data set; the third question
in both scenarios seem to find an answer for the research question. The research question can be defined as a question that needs to be answered concerning the data set, some questions may have a specific topic, or some questions might be more general. Also, the question should be related to the fields or characteristics of the data set.
First Question: The first question provides a branching in which the user has the possibility of process and analyze all the information in a local machine, and continue the analysis in his own machine, this scenario is designed for relatively small data set.
Second Question: Depending on the characteristics of each data set the second question provides two branches one with a sampled data set with limited resources, and another with the complete data set with extra resources, depending on budget and experience. The sampled branch uses the same methodology as the local machine branch.
In the case of selecting a cloud environment, the branching asks three new questions related with some characteristics of the data set, since each data set and user has very specific necessities the workflow only suggest some scenarios on the cloud, based on these scenarios the next branching will be applied.
In the cloud division, three factors affect all the questions, time, resources and user experience, depending on each data set and user necessities, these factors influence the decision of branching, the following questions in the workflow are also affected by these three factors. They are related with the characteristics as the volume (a) of the data set which in this case the contemplated volumes are gigabytes, terabytes, and petabytes, following by the structure of the data(b), and finally a time-related question (c). The resulted answers characteristics split the workflow between a cloud processing and analysis or the big data environment. It is important to mention that the characteristics are only suggestions due to the specificity of each user data set, these suggested characteristics may change. The workflow does not contemplate a semi-structured data set in the suggested characteristics, due to the complexity of combinations that the addition of this option represent, but it is a depicted in the question and suggested options.
Figure 2 shows the complete workflow; the third question split the image in two possible results, the
upper part designed for a local machine analyses, and the lower part of the model designed for a cloud
approach. The cloud approach branching was designed to give a solution to the processing and analysis
with additional computational resources and a second approach in the case that data is massive and
unstructured.
The Third question: this question provides an optional resolution related with the data set topic, in the case of a negative answer the suggestion is to start to collect data with the new insights from the previous analysis. In the case that the analysis is not successful in the sample branch, then the recommendation is to save resources to test the complete data set in the cloud.
3.2. Implemented Workflow
The main tasks of the implemented workflow are displayed in Figure 3 which are data management, pre- processing, processing and analysis, each task contemplate one or several sub-tasks. The tasks and sub- tasks are different from the general to the implemented workflows. Both workflows have the same objective but different purposes, the general workflow purpose is to provide a route to pick the appropriate architecture for each user, while the implemented workflow purpose is to provide details for Figure 2: General Workflow
The general questions are numbered from 1 to 3 and the cloud questions have letters in the bottom, from a to c.
processing and analysis of the data. The architectural limitations and necessities established the idea of two scenarios, one scenario uses a sample, only the geotagged tweets on a local machine, and the other scenario includes the complete data set within a cloud environment.
The collected data set contains tweets with selected keywords that may represent a topic or an event, one
of the objectives of this work is to verify if these words are related to the initial research question. The
data set proposed for this workflow is classified as a Retrospective Event Detection (RED) because it was
stored and then analyzed in a batch processing mode. A common approach to explore a data set topic is
by using a frequency of words analysis, or adopting an unsupervised methodology by exploiting NLP
methodologies to analyze topics or applying a supervised method to classify topics. On this workflow, the
frequency of words and an unsupervised methodology are suggestions to be incorporated into the
workflow workload.
Figure 3: Implemented Workflow
The following workflow sections describe the scenarios, geotagged sample and complete data set, with a local and cloud environment characteristics. Each part describe the scenarios in detail, adding sub-tasks and technologies suggestions to the workflow. First with a introduction to the scenarios, followed by a data management description, and finally the detailed description of each scenario.
3.3. Scenarios Introduction
Two scenarios are described in the following paragraphs, one scenario in a local machine, with a sample of the data set designed to explore the data set, and a second scenario in the cloud with the complete data set and designed to process massive amount of social media stored information.
The workflow represented in Figure 3 is divided in two branches. The first branch of Figure 3 assumes a local machine scenario (depicted with a laptop logo in each workflow form), within a sampled geotagged data set (depicted with letter “a” in each workflow form), while the second branch represents the cloud environment (depicted with a cloud logo in each workflow form), within the complete data set (depicted with letter “b” in each workflow form). Some sub-task are dependent on the result of the previous task while others could be reordered or even skipped. But, on the workflow the sub-tasks have been numerated as a suggestion to follow.
The local machine scenario design contemplates querying and filtering data, the use of NLP techniques, and appliance of basic analysis techniques. The data set was sampled by their spatial characteristics or the geotagged records, typically only 1 – 3% of tweets have been geotagged, this is the information used in the sample scenario. The complete scenario has a similar design but with some differences in the tasks and sub-tasks. Initially, the purpose of the complete scenario is to extract spatial information embedded in the text to increase the number of geotagged tweets. Finally, use an improved geotagged data set to look up for the research question with new spatial information that may provide extra spatial details to answer the research question.
The implemented workflow consists in a few questions, the main purpose of the questions are to separate
the scenarios and provide options to each user; the first question focused on the management of the data,
the second question focused on the data set size, the third question ( only in the complete data set branch)
refers to a quest of spatial information within the text, and the final question looks to answer the research
question. At the end of the last question there is an option to provide insights, this is a loop question, in
case of a negative answer this question was designed to improve the data set by providing insight from the
first loop analysis, this is represented with a dotted line in Figure 3. Since is relatively easier to apply the
scenario with only geotagged tweets, running this scenario might provide valuable information without
the necessity of apply the complete scenario.
Data Management Task
First Question: The first question splits the management task. The question tries to find if the current system can load the information collected from the social media source. In both branches the data management is composed of storage and retrieval, both sub-tasks are different due to the necessities of each user, the current system and the size of the data set. In the workflow, the data management is divided between a local machine and the cloud management which is divided in structured data management and unstructured data management. A database management system is usually used to retrieve information;
this can be installed in a local machine, cluster or the cloud, depending on the resources and the volume of the data set.
The characteristics or each system define the necessity to the usage of a DBMS and the technology associated. In some cases, the use of a DBMS is not required, and the information can be processed from a unique file, however, in the workflow the first question open the possibility to load the data set in a local machine or a cloud environment, depending on the characteristics of the data set and the requirement of the user. Most of the cloud providers have a DBMS integrated a services, structured and unstructured, also the files can be stored and retrieved from the cloud without a DBMS.
A DBMS organize the data set on logical structures defined by a model, this allows to query a database and filter information from the source, which are two sub-tasks of the workflow. Without a database, query and filter task become more complex and time consuming. The disadvantages of a DBMS are the complexity, size and performance, some DBMS require additional resources and even an administrator, designer and developer, this type of databases are expensive and complex.
Since social media is associated with the generation of massive amounts of information a DBMS would be a good option to store and retrieve information. There exist some considerations to select a DBMS such as the structure of the data set. There exist three types of data sets, structured, semi-structure and unstructured. A structured data set is typically associated with a SQL relational structure, a unstructured data set is associated with a non relational NoSQL structure, the semi-structured data set can be associated with both. A NoSQL is classified in types of databases that follows a data model such as column, document, key-value or graph. A SQL approach is associated with a traditional rows and columns schema.
In the case of a structured data set some recognized technologies are Oracle SQL or PostgreSQL, and in the case of a non-structured data set MongoDB or Cassandra are one of the most popular.
The cloud provide some services that make it relatively easy to set up, maintain and administrates a
DBMS. The advantages of a DBMS on the cloud are the scalability and performance. A DBMS in the
cloud is dynamic and allowing simple or complex data sets, the resources on the cloud are easily scalable,
unlike a physical server or cluster which a regular database upgrade and administration may have
expensive costs. Some of the disadvantages of the cloud is that the administrator loose the complete
control of the servers, the data set is totally dependent on the provider, to transfer information require a good internet connection, and to switch between providers have several issues.
The workflow contemplate the cloud storage and retrieval as a service, the complete data set scenario have two options a structured and unstructured data management, the selection depends on the characteristics of the data set such as type of structure, volume and velocity of retrieval.
3.3.1. Sampled Geotagged Scenario
Second Question: The second question of the implemented workflow split the scenario in the sampled and complete scenarios. This scenario contains three sub-tasks; each sub-tasks is focused on process and provide inputs for the following tasks; the tasks in here are pre-processing, NLP and Analysis. The geotagged information represent a fraction of the complete data set, this scenario contemplates the information collected from the streaming API, depending on the characteristics of the search and keywords the information can be distributed in different parts of the globe. The main goal of this scenario is to seek for an answer to the research question or provide insights for further analysis.
Pre-processing Task
The pre-processing task focus on querying, filtering, removing, and cleaning. In the workflow, these tasks are numbered in the superior left corner, this numbering is only a recommendation and may be subject to changes depending on each data set characteristics.
The first sub-task consist in querying the database or file to get only the records that have valid coordinates of latitude and longitude. The advantage of querying is to reduce the amount of records by soliciting only the information that is required; this saves time in the following procedures. A query may be simple or complex, depending on the user necessities, in some cases an efficient query may help to solve the research question almost in the first stages of the workflow. It is important to mention that in the situation where there are no geotagged tweets, it will be required to collect more information.
The sub-task named removing, focus on locating some specific patterns from some social media users.
Some users are excluded from the data set analysis. Some accounts may be fake or may be skewing the
sample by adding the same tweet several times with the same sentence and words, because of this reason it
is necessary to do locate this type of users or records, the analysis may focus on the users and their
behavior.
The last suggested sub-task on the workflow is called cleaning, which look up for special characters, symbols, numbers and URL’s that are not essential for the NLP analysis and remove them from the text.
It is important to remember that the data set comes from social media, this type of information has several sources, and for this reason tends to be noisy. Tweets may have several types of words, symbols, URL’s, emoticons or numbers that do not represent any contextual message on the tweet, removing and cleaning this type of data might be challenging. Fortunately people in several institutions have been working on the comprehension of human language by computers, this work includes tasks to analyze text, and remove the additional unnecessary information from it.
Natural Language Processing Task
The processing task will use NLP techniques such as tokenization, stemming, stop words and lemmatizer, the following are a brief explanation of each method and their use on this workflow. It is essential to clean and filter the processed words before this task is applied.
The language detection sub-task classify the information depending on the the language of each sentences or document. The language detection is applied prior the stop word and tokenization, the reason is that the stop word sub-task look for specific words in a language list, prior apply a stop word list, the language should be detected. There exist some words that do not contribute with any information to the analysis, these words are removed from each line of text in the tweets.
The tokenization sub-task is applied to separate every word in each tweet; this is a necessary step to simplify the text and analyze the message that each tweet contains. The tokenization purpose is to separate each word into different terms; this technique can separate specific symbols, on microblogs such as twitter. There exist several pre-trained tools that are capable of fulfilling the tokenization of a sentence.
Typically after the tokenization and language detection, the stop words or stop-list technique is applied, the removed words are widespread and do not contribute to the semantic analysis. On the workflow, it is suggested to use it after the tokenization.
The stemming and lemmatizer (Figure 3) techniques have the same order numbering on the top left
corner; this denotes that the order is not significant when applied this procedure in the workflow. Both
techniques are used to normalize the text in each tweet, in the workflow they are suggested after the stop
word technique. The stemming procedure shortens or reduces the verbs into their morphological root
while the lemmatizer analyses specific dictionaries to return the lemma of each word. The Table 1 display
examples of the suggested sub-tasks for the workflow.
Analysis Task
The task contains three sub-tasks, frequencies, top modeling, and visualization, numbered and applied in the same order. One of the sub-tasks in the workflow is called frequencies, the goal to analyze the frequencies is to understand the distribution of the words collected examples of these are the most mentioned words, the most common hashtags, words combinations with n-grams, and by sentence with a co-occurrence matrix. The expected result is to visualize and comprehend the weight and behavior of each queried word in the data set.
The Frequency provides a first insight of the data set, it is a powerful tool to look up for information related to the research question, in this case, the suggested analyses with the most frequented mention words are: hashtags, N-grams and a specific combination of words, this analyses will help to understand what are the most frequent terms in the data set but also their relationship with the initially queried keywords. The co-occurrence matrix is part of the frequency analysis, this type of analysis look up to find the most common terms with a specific word.
The workflow incorporates a machine learning unsupervised clustering as a suggestion, to analyze the data set text and test if the data set contains valuable information for the case study. Based on the experience of Yang (2016), this method proved to be adequate to cluster all the information and find a topic giving the distribution of the words. Due to the lack of information about the data set topic, the workflow sub-task topic modeling will categorize all the terms in different topics. The LDA model is a suggestion which typically is used in NLP for unsupervised topic modeling. It uses documents as a collection of words which categorize in semantic groups, due to their distribution of each semantic group a word is classified into a set of words that contains a clustered topic; the purpose of usage is to find the diversity of topics that the data contains. Nevertheless, this technique require some understanding of probability distributions such as Poisson and defining the input parameters of the model is not an easy task (Blei, Jordan, & Ng, 2003).
It is essential to visualize the information generated by the frequency analyses and also the spatial information, typically social media data is skewed by cities where the population density is high. The main purpose of the visualization sub-task is to represent the information from the social media with a space and time context. Nevertheless visualize this type of information may reflect a different behavior in
Sub-task Example
Cleaning This tweet is an examples of Natural Processing Language
Language Detection english
Tokenization ['this', 'tweet', 'is', 'an', 'examples', 'of', 'natural', 'processing', 'language']
Stop List ['tweet', 'examples', 'natural', 'processing', 'language']
Lemmantizer ['tweet', 'example', 'natural', 'processing', 'language']
Stemming ['tweet', 'exampl', 'natur', 'process', 'languag']
@miley This tweet is an examples of Natural Processing Language 12345, Http://www.google.com.mx