Tweetviz: Presenting Automatically Analyzed Tweets for the Extraction of Business Intelligence

(1)

Tweetviz: Presenting Automatically Analyzed Tweets for the

Extraction of Business Intelligence

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

B

AS

S

IJTSMA

10206612

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

July 10, 2015

1st_Supervisor ₂nd_Supervisor ₃rd_Supervisor

Dr. Francine Chen Dr. Pernilla Qvarfordt Dr. Frank Nack

(2)

Tweetviz: Presenting Automatically Analyzed Tweets for

the Extraction of Business Intelligence

Master Thesis for MSc IS - HCM

Bas Sijtsma

University of Amsterdam Science Park 904 1098 XH

email@private.com

ABSTRACT

Social media offers substantial opportunities for businesses to extract business intelligence. This paper presents Tweet-viz, a social media analytics tool to help businesses extract actionable information from a large set of noisy Twitter mes-sages. We evaluate the system’s ability to provide a valid overview of a companies issues and sentiment. We demon-strate no significant improvement on a number of informa-tion retrieval tasks. However, we find that the more interac-tions users perform with topics, the better their performance. We believe that topic sorting is a promising visualization technique worth further exploring.

1. INTRODUCTION

Today’s widespread use of social media offers new oppor-tunities for businesses to extract business intelligence. Con-sumers freely share their opinions about products and ser-vices at a large scale on platforms such as Facebook and Twitter. This provides a valuable resource that businesses can leverage for a competitive advantage when mining cus-tomer data. In particular, marketers can dig into the vast amount of data to detect and discover new knowledge, such as insights into changing customer interests, understanding what competitors are doing, or detecting problems with a service, and use these insights to realize value and compet-itive intelligence. Traditionally, businesses spend a lot of effort obtaining customer opinions through focus groups, in-terviews, and so on. Hence, social data monitoring tools that assess the consumers’ opinion are of major interest to busi-nesses that recognize the benefits of social media.

However, despite the benefits of having access to customer generated social media data, extracting practical and use-ful information is challenging. The high volume of data makes manual evaluation infeasible. Twitter, a social net-work where users can send 140 character messages called ’tweets’, generates over 500 million messages every day1. Automated tools are necessary that assist in mining, filtering, ordering and visualizing data. In addition, the unstructured and noisy nature of social media data makes it difficult for companies to identify actionable areas of improvement.

1_{http://www.twitter.com/about}

Many of the commercial social media tracking products such as Tableau2_{, Hootsuite}3_{, and Sproutsocial}4 _{focus on}

offer-ing customer interaction statistics. They track metrics re-lated to the number of responses to a company’s messages, attempt to define the most successful response length, and other types of aggregated daily activity statistics. While this information is valuable from an online marketing perspec-tive, this does not help businesses understand their customers’ expressions at the venue level.

In this research, we explore how automatically analyzed and noisy geo-tagged tweets can be presented to simplify the task of examining a large number of tweets. To address this problem, we present Tweetviz, an interactive social me-dia analytics system for exploring large collections of geo-tagged Twitter data. This system was developed based on requirements obtained from interviews with business own-ers and social media marketeown-ers. Tweetviz addresses three major problems: 1) identifying potential issues at a com-pany’s stores; 2) obtaining demographic information based on social media behavior of its customers’; and 3) finding in-formation about visits to competitors among their customers.

The remainder of the paper is organized as follows: we will begin section 2 by providing a review of related work in so-cial media analytics. Section 3 describes the research goals and contributions of this thesis. Section 4 describes Tweet-viz, a tool developed to help businesses with their social me-dia analytics. Section 5 discusses the system evaluation, scribing the experiment design and procedure. Section 6 de-tails the results of the interface evaluation. The discussion in section 7 provides insight into the implications of the out-come. Finally, the conclusion and future research follows in section 8.

2. RELATED WORK

2.1 Social media Analytics

The recent growth of social media data has led to an expand-ing body of research. As a consequence, social media

ana-2_{http://www.tableau.com/} 3_{https://hootsuite.com/} 4_{http://sproutsocial.com/}

(3)

lytics has emerged as a new field of research [8]. Social me-dia analytics is concerned with developing and evaluating in-formatics tools and frameworks to collect, monitor, analyze, summarize and visualize social media data. The field aims to facilitate interaction between communities and to extract useful patterns and intelligence from social data [17]. Social media analysis incorporates a variety of elements from dif-ferent fields, such as sentiment analysis and home location estimation, that are applied in this research.

2.2 Sentiment Analysis

One of the main technologies behind many existing analysis tools is sentiment classification. Sentiment classification is concerned with the automatic extraction of emotional sen-timent in text using computational semantic analysis [15]. Generally, sentiment classification aims to compute the strength of the sentiment expressed in a word, sentence or text and classify it with a positive, neutral or negative label. Vari-ous methods have been developed, often relying on machine learning techniques [15].

Sentiment classification has been utilized in a number of ap-plications to enhance business intelligence. Lipizzi et al.[11] included sentiment classification in their comparison of Twit-ter users’ reaction to the launch of Apple and Samsung ucts. They find that the sentiment between the two prod-ucts can be informative about the likelihood of early product adoption and subsequent market success. Bollen et al.[2] have looked at the predictive power of tweets for stock mar-ket behavior, and found that Twitter mood can predict stock trends in certain scenarios with good accuracy. Abrahams et al.[1] attempted to discover vehicle defects in discussion posts sampled from a car-enthousiast forum. These exam-ples of leveraging sentiment analysis for business perfor-mance insights subsequently inspired the design of Tweet-viz.

2.3 Estimating Home Location

Cheng et al. [5] have proposed and evaluated a framework that can estimate a Twitter user’s location on a city-level based on the content of their tweets. Their framework is able to place 51% of Twitter users within 100 miles of their actual location. Mahmud et al. [12] have developed an algorithm that is able to infer the location of Twitter users on different levels, including city, state, time zone or geographic region. It is based on an statistical and heuristic classifiers, using a geographic gazetteer dictionary to identify place-name enti-ties. By using a classifier that predicts if a user was travelling during a certain period of time, they have further improved the algorithm. However, these methods attempt to find users’ home location on a city level. A more granular level of in-formation, such as home locations by neighborhood, allows more specific analysis and potential targeting of customers. The method proposed in this paper addresses this problem specifically.

2.4 Social Media Analysis Tools

A branch of research particularly relevant includes social media analysis tools. Diakopoulos et al. built a time-line based graphical interface displaying current events mined from Twitter [7]. Their research intends to support journal-ists in extracting news from aggregated social media data. The interface is specifically designed to allow for journalis-tic investigation of real-time responses to news events. Sim-ilarly, Marcus et al.[13] created TwitInfo, a platform for ex-ploring real-time events occurring on Twitter. Both these systems are time-line based, and extract noteworthy elements based on peaks of tweet volume and word-frequency based heuristics within a small time frame.

O’Connor et al. developed Tweetmotif [14], an application to extract topics and summarize sentiment from Tweets. This system allows exploratory search of any query given by the user. The three systems above contain elements that are po-tentially valuable for a business analyst. However, none of these systems consider the geographical origin of messages, thereby losing a substantial level of context that could be used to gather business intelligence. The goal of this re-search is to leverage the geographical information to provide actionable location specific information.

Conversely, Chen et al. [4] used social media data mining techniques to profile businesses at specific locations. In their work, the authors matched geo-tagged tweets against venues from Foursquare. Chen et al. propose setting a distance thresh-old and performing density-based clustering to identify the specific business location of a tweet. Then, using a logistic-regression based sentiment classifier, the average sentiment in tweets at each location is computed, thereby providing a sentiment profile of the business. This research has received much of its inspiration from Chen et al., and integrates their methods with interactive visualization and additional infor-mation extracted from a large volume of social media data. Additionally, the data set used in their research serves as a basis for the interface presented in this paper.

There are a number of commercial products to help busi-nesses manage their reputation and customer interaction on social media, such as Tableau5, Hootsuite6, and Sproutso-cial7. These products offer mostly interaction statistics and identification of significant contributors to the social media network. Instead of offering general statistics about cus-tomer interaction, the Tweetviz system provides location spe-cific actionable information that analysts can use to improve their business.

Finally, Keim[10] offers a number of information visualiza-tion techniques valuable for designing data exploravisualiza-tion tools. Notably, he describes the Information Seeking Mantra: pro-vide an overview of the data, enable drilling into the data through zooming and filtering techniques, and provide de-tails on demand. This three step process serves as a basis for

5_{http://www.tableau.com/} 6_{https://hootsuite.com/} 7_{http://sproutsocial.com/}

(4)

the design process.

3. RESEARCH GOALS

As described, businesses deal with a number of problems trying identify negative and positive aspects in their business expressed on social media, and current commercial applica-tions fail to fill this gap. This research aims to provide a system to support businesses deal with these problems, and focuses on the following problems:

• How can we give businesses an overview of issues and sentiment voiced by customers with automatically an-alyzed and noisy topic categorization?

• How willing are users’ to reduce noise and correct er-roneous estimations made by sentiment classification in order to perform social media analysis?

• How can we extract and present relevant demographic information about a business’ customers based on their localized social media behavior?

• How can we identify and present our customers’ visits to competitors based on localized social media behav-ior?

4. TWEETVIZ

This section describes the Tweetviz design process and im-plementation, beginning with the requirements obtained af-ter a number of field inaf-terviews with the system’s target users. This is followed by a description of the data sets used for the visualizations. Finally, this section concludes with a sum-mary of the features and methods used to develop the func-tionality.

4.1 Design Process

The Tweetviz interface was built to allow multi-level analy-sis of content, in consideration of Keim’s Information Seek-ing Mantra [10], . StartSeek-ing with an overview of all available data, users can gradually dig deeper into the available infor-mation. Rapid prototyping was used during the development to test functionality and iterate and refine based on a testers’ feedback. An additional tool was prototyped to test the per-formance of home location estimation functionality, which proved invaluable for selecting the appropriate algorithm pa-rameters.

4.2 System Requirements

To identify the system’s requirements, we interviewed two business owners and two social media marketeers involved in decision-making or social media management. They were asked to describe the problems they face dealing with social media, and to clarify what tasks they would reasonably re-quire from a social media analytics system. Three key take-aways were identified in these interviews: 1) It’s difficult to prioritize issues because of the large amount of data that’s generated. Is the opinion expressed on social media merely a single occurrence from a vocal visitor, or are the voiced

issues shared by a large number of customers? 2) A cus-tomer’s purchase life cycle is divided into multiple stages. Identifying problems as well as positive aspects during a customer’s store experience helps businesses identify con-crete improvements to their service. 3) Companies can ob-tain great value from knowing who their customers are and what they like. This helps them tweak their service and tar-get the right customer.

We identify 2 requirements following these takeaways. First, the system must support the analyst to discover the most im-portant topics discussed at their business, both positive and negative. Second, the system must help the analyst obtain a profile of their customers and their behavior. This must be supported irregardless of the volume of messages available.

4.3 Datasets

Various datasets were collected in order to support the func-tionality provided by the Tweetviz system. To present venue-level issues, geo-tagged Twitter messages and business loca-tions from Foursquare were obtained. Fortunately, a large dataset of Tweets and venues was made available by Chen et al. This dataset was used in their work on profiling business locations[4].

The Twitter messages in this dataset were also used for the estimation of a visitor’s home location. To identify the neigh-borhoods that visitors live in, the coordinates of each vis-itor’s estimated home location had to be associated with a neighborhood. Zillow and Flickr provided the polygon shape-files of neighborhood boundaries. Finally, for businesses to become more familiar with their customers through neigh-borhood information, Zillow’s property value estimations for the respective neighborhoods were retrieved.

4.3.1 Tweets and Venues

The original dataset gathered by Chen et al.[4] contained around 16 million geo-tagged tweets, collected between June 4th, 2013 to April 7, 2014. This dataset was extended with additional tweets mined until March 23rd, 2015. In total, the dataset contains over 24 million geo-tagged tweets, mined using the Twitter streaming API8_{. All of the tweets were}

posted by users inside latitude [37.10, 38.15] and longitude between [-122.6, -121.6], which covers a large portion of the San Francisco Bay Area. In total, 656098 distinct Twitter users have been recorded in the dataset.

Foursquare venues are locations crowd-sourced by users that check-in to the platform. These venues can be retrieved through Foursquare’s publicly accessible venue search API

9_{. The dataset used in this research, contributed by Chen et}

al. contains a total of 337991 venues in the San Francisco Bay Area.

4.3.2 Neighborhood Data

8_{https://dev.twitter.com/streaming/overview} 9_{https://developer.foursquare.com/overview/}

(5)

Figure 1: Screen capture of the Tweetviz interface, numbered with functionality labels.

Zillow10 has made a high quality dataset available contain-ing 809 records of neighborhood polygons in California. A shapefile is a dataformat containing geospatial vector data, generally intended for use in geographic information system software. Unfortunately, the dataset does not contain neigh-borhood boundaries for the entire San Francisco Bay Area. Hence, this dataset was supplemented with crowd-sourced boundary files from Flickr.

Zillow also provides an API to retrieve region specific de-mographic information, such as an average property value of houses sold in each region. This information was col-lected for each neighborhood in the Tweetviz database. This data is based on the median estimated value of houses sold in the region.

In 2008, Flickr released a large database of shapefile, gener-ated from metadata associgener-ated with photographs. Each pho-tograph’s metadata contained a Where On Earth ID, an ID created and maintained by Yahoo that corresponds to a place a photo was taken. Flickr found that by plotting all of the photographs in their database, they could generate approxi-mate contours of geographic regions11_{. They made this data}

publicly available through their API.

10_{http://www.zillow.com/howto/api/}

11_{http://code.flickr.net/2008/10/30/the-shape-of-alpha/}

Then, a community project called Zetashapes12 launched, built on Flickr’s original boundary shapes. Through crowd-sourcing, the accuracy of estimated boundaries improved, and as a very complete dataset it is suitable to supplement Zillow’s missing neighborhood data.

4.4 Interface Implementation

Tweetviz is developed as a web application for modern web browsers. The front-end was built using the Angular JavaScript framework, supported by a back-end written in Node.js. All data is stored in a MySQL server. The system is designed to allow browsing the available data with direct updates with-out reloading the page whenever an interaction is performed. Figure 1 displays a screen capture of the system.

4.4.1 Venue Selection and Filtering

Users select the company they wish to analyze from the pany dropdown (see Figure 1, item 1 ). Then, all the com-pany’s venues and their tweet statistics are aggregated and displayed, giving an overview of the available data (item

4 ).

Locations can be viewed by city, neighborhood or branch (item 3 ). The user can group multiple branches together, 12_{http://zetashapes.com/}

(6)

or zoom into single businesses to view its available data. There are a multitude of factors that the analyst can decide to group, filter and search (item 2 branches by, such as the number of (positive or negative) tweets at each location, or average sentiment.

4.4.2 Tweet Sentiment at Business Locations

Each selected location is displayed on the map with a col-ored marker that indicates the location’s average sentiment on Twitter (item 8 ). The average sentiment is based on the sentiment scores of the Tweets at the location, as was given in the dataset supplied by Chen et al. [4]. The user can lever-age the averlever-age sentiment of venues to compare performance and identify stores that may need improvements or that can serve as exemplars for other locations.

The bottom level of the interface (item 11 ) displays Tweets sent from the selected venues. Having identified tweets at each location, a number of possibilities open up for deeper analysis. Specifically, users can filter on venues that have a poor (or excellent) average sentiment score, and learn of topics that need to be addressed.

4.4.3 Topic identification

To assist in the identification of positive and negative as-pects about the business’ service repeatedly voiced on Twit-ter, Tweetviz displays tweets at a business grouped by topic (item 10 ).

Topics were automatically generated by identifying popular nouns in Foursquare tips. Thus some of the identified top-ics may be what people talk about when at a business with-out being relevant to a business. The text in each tweet was parsed using the Stanford NLP parser13_{, and the nouns were}

matched against the topic terms. When a match occurred, the tweet was categorized as that topic. As tweets can con-tain multiple terms, they can be tagged with multiple topics. Any tweets that do not contain a topic term are grouped into a ’Miscellaneous’ category.

When a topic is clicked, the list of tweets (item 11 ) re-freshes to show only those tweets belonging to that topic. Topics are split into positive, negative, and irrelevant Tweets ( 9 ), and users view each sentiment class separately. If a topic does not contain any Tweets relevant to the business, an analyst can set the topic and all its tweets to irrelevant.

The sentiment classification is not perfect: negative tweets may have been classified as positive, and vice versa. There-fore, the user can reclassify tweets into the opposite senti-ment class. Individual tweets can be reclassified irrelevant as well. This functionality allows an analyst to organize and improve the quality of their company’s data in order to wade through the high volume of tweets.

13_{http://nlp.stanford.edu/software/lex-parser.shtml}

4.4.4 Estimating Home Locations

An analyst can select the demographic information tab (item 6 ) to view the neighborhoods their customers live in to profile their customers demographic. To obtain this informa-tion, an experimental algorithm was designed for the estima-tion of a users’ home locaestima-tion based on their tweet behavior.

First, a users’ geo-tagged tweets are retrieved. Density based clustering (DBSCAN) is performed on the longitude and lat-itude values of their Tweets. The cluster centroids hence rep-resent the average location (in longitude and latitude) of the tweets within each cluster.

Figure 2: Visualization for home location estimation.

A prototype interface, shown in Figure 2 was made to ex-periment with the parameters of the clustering and to visual-ize the users’ Twitter behavior, showing the location of each tweet and the computed cluster centroids. The majority of users had 1 to 4 areas in which the bulk of their Tweets were located. Through experimentation of the parameters, we found that requiring a minimum of 10 Tweets to form a cluster resulted in centroids that represented these 1 to 4 areas.

After the clusters for the Tweets were computed, the distri-bution of tweet frequency in each cluster was considered. Two simple assumptions were made in the estimation of a users’ home location: 1) A user posts a lot of Tweets from home; and 2) A user is at home more often between 5:00 PM and 9:00AM than between 9:00AM and 5:00PM.

Comparing a user’s clusters, we tested if a single cluster con-tained over 40% of the user’s Tweets, while the other clus-ters contain less than 15%. If so, the cluster centroid repre-sents the users’ home location. These numbers were selected through experimentation with the prototype interface in Fig-ure 2, and is in line with the first assumption. When this condition was not met, the number of tweets with a times-tamp between 5:00PM and 9:00AM within each cluster were counted and aggregated. The cluster with the highest num-ber of tweets between this time period was selected as the

(7)

Table 1: Verification of estimated home address

Measure users % users

Home location details found online 63 63% Home location details not online 37 37% Estimated address match 38 60.3% Estimated address mismatch 25 39.7%

users home location, in line with the second assumption.

The algorithm estimated 131,288 home locations out of 656,098 distinct users. In order to protect users’ privacy, the granu-larity of the estimation was reduced: the coordinates of each of the estimated home locations was matched to a neighbor-hood in the Zillow and Flickr neighborneighbor-hood shapefiles.

The assumptions used in this algorithm are biased, and likely do not hold up for a not insignificant part of the users in our database. In addition, if a user has not Tweeted at least 10 times from home, their home location will be misrepre-sented. In an attempt to evaluate the algorithm, 100 users with an estimated home location were randomly sampled. Their Twitter accounts were observed, and any details per-taining to their home locations were registered. In many cases, users publicly reveal their city in their profile. In some cases, a users’ real name was used on Twitter, and search engines were used to attempt to find information about the respective users’ home address. The neighborhood of the users’ estimated home location was subsequently matched to the location found online. The results are described in Table 1.

This verification is not perfect: the cities that were matched are not neighborhoods. Secondly, users’ location indicated on their profile may have changed after the majority of their tweets were generated. However, the estimated home loca-tion represents a localoca-tion that a user has spent considerable time at, which may still have value for a business analyst. This is represented in the interface by not exaggerating the accuracy of the home location.

Previously discussed papers [5] [6] review methods to find home locations for the average social media user. The rel-evant social media users incorporated in this study however is much more narrow: only Twitter users who post with geo-tags were incorporated. This allows the exploration of esti-mating home location based on the geo-coordinates in their Tweets, data that research previously mentioned did not have access to.

4.4.5 Identifying Competitor Visits

Tweetviz defines competitor visits (item 7 ) as any visit to a business that is not the current company being analyzed. Al-though clearly not all businesses are competing in the same field, finding out what other places your customers go to can be very valuable information.

To determine what other businesses their customers visit, all tweets for each distinct customer are retrieved. Tweets that are not sent from a business location are discarded. Then, the number of visits to competitor businesses are extracted and aggregated. The system keeps track of both the unique number of customers visiting competitors as the total num-ber of visits. The results are filtered based on the list of selected venues (item 4 ), displaying only information for customers to those respective branches.

5. SYSTEM EVALUATION

In a between-subject design, we monitored participant’s re-sponse to five information extraction tasks. These tasks served to assess: 1) The systems ability to provide a valid overview of issues and sentiment voiced by customers and its required workload for users; 2) The system’s usability, measured by effectiveness, efficiency and satisfaction; 3) The users’ will-ingness to correct erroneous classifications made by the sys-tem; and 4) The type of inferences that participants draw given the available demographic and competitor informa-tion;

There were two conditions. In the experimental condition, participants were faced with the Tweetviz system discussed in section 4.4. The baseline condition was modified to ex-clude the grouping of tweets by topic (item 10 in Figure 1). Instead, the full list of tweets for each business is shown, along with a search query input field for filtering tweets. The dependant variables include usability measures, task perfor-mance and workload assessment. The systems and its data sources are identical apart from these features. The interface in the baseline condition represents the large collection of tweets that analysts are normally faced with when collecting tweets related to their business.

5.1 Experimental Design

Two types of tasks tasks were designed: four tasks focused on finding positive or negative topics within tweets for dif-ferent businesses, and one task asked users to profile a busi-ness’ customers using the available demographic and com-petitor information. The experiment was designed to rep-resent real scenarios a business analyst may be working to-wards, as indicated by interview subjects. All questions are open-ended and required the participant to dive into the in-terface with multiple interactions. The tasks were defined as follows:

1. Safeway headquarters is interested in the complaints customers have about their stores in San Francisco. Please describe in a few words the top 2 complaints.

2. Starbucks is interested in improving branches with poor customer sentiment. They are interested in the branches in the entire San Francisco Bay Area with an average sentiment of less than 0.2. Describe in a few words the 3 most common complaints.

(8)

3. Starbucks is interested in finding out what customers are happy about. They are interested in the 5 most pos-itive branches in the San Francisco Bay Area. Briefly describe in a few words the 3 most positive aspects of those branches.

4. Philz Coffee is interested in what their customers like. Please describe in a few words the top 3 positive as-pects about Philz Coffee in Palo Alto.

5. You are doing customer profiling for Philz coffee. They would like to know more about their customers at their San Francisco stores. Use the ’demographic informa-tion’ and ’competitor informainforma-tion’ tabs to describe a typical Philz coffee customer.

The companies in each task were selected for their abun-dance of data available. The tasks were assigned to partic-ipants using a latin-square to avoid the excessive numbers required to test all possible permutations of the tasks and yet to control for the variation in the order of execution to bal-ance a potential learning effect.

To rate the answers, all tweets relevant to the tasks were queried from the database, and a list of complaints and pos-itive aspects was manually compiled from the full set of tweets. Each topic ranked according to their frequency, with the most occurring issue given a rank of 1. When multiple topics have the same frequency, they share the same rank.

The response time was recorded for each task, complemented by a self-reported assessment of the required workload us-ing the NASA-Task Load Index questionnaire (TLX). The NASA-TLX is a subjective six dimensional scale, widely used in the evaluation of interface design [9]. For the pur-pose of this study, the simplified Raw TLX (RTLX) was used, eliminating the lengthy and tedious weighting process of subscales to reduce the time required and to make the process simpler for participants. Evidence has been found supporting this shortened version as potentially increasing experimental validity [9].

In addition, participants were asked to fill in a System Us-ability Scale (SUS) questionnaire after completing all five tasks. SUS measures usability along three dimensions: ef-fectiveness, efficiency and satisfaction. SUS was selected since it is easy to administer, its good reliability and validity even in relatively small sample sizes, and its capability to ef-fectively differentiate between usable and unusable systems [3]. In a systematic comparison study with five commonly used website assessment questionnaires, Tullis & Stetson [16] stated that SUS yields the most reliable results across sample sizes.

5.2 Procedure

The baseline and experimental version of the interface were alternated between participants. All of the experiments were performed in presence of the researcher. After the partici-pants sat down, they were told to read the study instructions,

which briefed them about the procedure and duration of the study, and informed them that the researcher would not assist them during their tasks.

The experiments were performed on a Mac Thunderbolt dis-play. The Tweetviz interface was presented in a large win-dow on the left side of the screen. The experiment instruc-tions were displayed in a smaller window on the right side of the screen.

Participants performed a number of training tasks to intro-duce the functionality of the interface. The instructions were identical for both conditions, apart for the topic classification in the experimental condition and using the search queries to filter tweets in the baseline condition. Users were instructed not to spend longer than approximately 10 minutes on the training.

Following the training, the users were shown one task de-scription at a time. The instructions informed them not to spend more than 10 minutes per task. To view the input field, users were required to press a ’Start task’ button when they finished reading the description, in order to ensure that the time recorded was related to solving the task, not reading the instructions. The NASA-TLX questionnaire was shown after each task. The last task was followed up with the SUS questionnaire. Finally, participants provided feedback on potential features for a future version of the application as well as any comments about the test procedure they want to share.

5.3 Participants

24 participants (5 females, 19 males) were recruited to eval-uate the interface. The participants included interns and re-searchers at the FXP research lab. This industry research lab focuses on working with many new types of technology. Hence, it is fair to say that the participants were at least com-petent with technology in general. Participants were aged 21 to 45. Out of these participants, 20 or 83.3% of users indi-cated that they felt familiar with Twitter and the language used on the platform.

5.4 Data Collection and Analysis

The system logged interactions with the interface in order to track differences between conditions and to measure the willingness of users to correct erroneous classification. All clicks on topic categories were logged in the experimental condition. All search terms were recorded in the baseline condition. For both conditions, the system logged tweet re-classification. Actions were logged for all but the first four participants, due to erroneous logging code.

6. RESULTS

All 24 participants performed all five tasks within the given time limit.

(9)

Figure 3: Distribution of topic ranks retrieved per condition.

Figure 4: Time spent on topic identification tasks.

The first four tasks, related to the identification of topics, are similar in goal and are therefore discussed separately from the fifth task.

6.1.1 Task Completion Time

The total time spent on all four tasks for the experimental interface was a minimum of 556.5s, maximum of 1247.4s, mean (µ) of 947.5s and standard deviation (σ) of 209.3; versus a minimum of 519.7s, maximum of 1390.3s, µ = 1017.1s and σ= 253.6 for the baseline interface. A break-down of the completion times for each task is shown in Fig-ure 4. Although on average the tasks were completed faster on the experimental interface, an independant samples t-test revealed that the difference is not statistically significant (t(22)= 0, 734, ns).

6.1.2 Task Performance

The answers provided by the participants were scored based on the rank of the retrieved item: if the participant found the top 3 most occurring issues, he was given the scores of 1, 2 and 3, thereby obtaining a mean of 2. If no answer was given (2 instaces), or an answer was given that is not represented in any of the tweets (1 instace), the participant received the score of the lowest attainable topic rank. Figure 3 displays the distribution of answered topic ranks over all 4 tasks.

Figure 5: Task scores per condition (lower is better)

The experimental interface performed slightly better on the mean task performance over all tasks with µ = 4.24 and σ = 0.86, versus the baseline µ = 4.51, σ = 0.86. However, the mean scores do not differ significantly between interface conditions (F(1) = 0.6036, ns). Figure 5 displays the mean scores obtained in each task.

Exploring how many times the top three highest ranked top-ics are found reveals a potentially interesting pattern. Apart from task 4, participants perform better using the experi-mental interface, as displayed in table 2. However, a Chi-square test reveals that the null hypothesis can not be re-jected (χ2(3)= 2.0521, ns).

The same results are repeated when the scores are compared over all four tasks combined. Table 3 shows how often users were able to find the top three most occurring topics for all tasks combined. A Chi-square test showed that the dif-ferences between the two conditions were not significant (χ2(2)= 0.082, ns).

According to these results, grouping tweets by topic does not significantly increase the task performance. However, there are clear indications in final comments left by partic-ipants that this is caused by the performance of the topic classifier, rather than the functionality of grouping by top-ics itself. One participant, reacting to the experimental in-terface’s functionality of topic grouping, mentioned in their final comment: “For most categories of tweet information the top topic was Misc topics - that was annoying because I didn’t want to read through all the tweets that were not prop-erly categorized "(Participant 12). By not reading through the Miscellaneous category, he has missed important topics not represented in a unique category. Many participants (7 out of 12) in the baseline condition requested functionality to organize tweets: “It would be nice if there were a feature that grouped together tweets with similar semantics to make it easier to identify the top concerns or praise of customers, rather than requiring the user to examine every individual tweet when trying to find common trends in the data." (P3).

(10)

Table 2: Number of rank 1, 2 and 3 topics answered per task.

Task 1 Task 2 Task 3 Task 4

Baseline 13 13 9 20

Experimental 18 17 14 16

Table 3: Number of top topics answered per rank.

Rank 1 Rank 2 Rank 3

Baseline 31 19 6

Experimental 35 24 5

One could assume that time spent on tasks would have a positive influence on task scores. More time spent reading all the tweets allows getting a better overview of the topics discussed. However, an ANOVA showed no significant rela-tionship between these two factors (F(1)= 1.1909, ns).

6.1.3 System Usability Scale and Comments

The results of the SUS questionnaire revealed that partici-pants testing the baseline interface had a more positive atti-tude towards the system (overall µ= 66.5, σ = 15.6) than the participants testing the experimental interface (overall µ = 57.7, σ = 16). An ANOVA showed however that the difference is not statistically significant (F(1)= 1.8254, ns). Individually, only the question: I thought there was too much inconsistency in this systemdiffered significantly (F(1) = 6.4101, p < 0.05), µ= 1.83 for baseline versus µ = 2.25 in the experimental interface). The overall mean SUS score in itself indicates an average performance on usability.

The results of the SUS questionnaire indicate that users did not rate the usability of the experimental interface signifi-cantly lower than the baseline interface. Given that the ex-perimental interface had additional functionality built in that could potentially complicate the user experience, this can be considered a relatively positive outcome.

One participant in the experimental condition who rated the system particularly low (SUS score of 20) stated: “The sen-timent classification and clustering provided very noisy in-formation. Most of the time, either sentiment classification was wrong, or it was irrelevant for the recommended clus-ter" (P2). This is another indication that by improving the accuracy of the classification by topics, the functionality of grouping tweets by topics itself could be rated more posi-tively.

6.1.4 Assessment of Workload

The physical workload dimension was excluded in our eval-uation of the NASA-TLX scale since including this dimen-sion would substantially skew the mean workload in Tweet-viz’ favor. This was considered appropriate on account of Tweetviz not containing any physical activities. This is a commonly performed practise when using the Raw-TLX ques-tionnaire [9].

Figure 6: Mean workload score per task (lower is better).

Responses to the NASA-TLX scale show similar evalua-tion for both condievalua-tions, with the baseline scoring µ= 54.3 (σ= 14.3), versus µ = 55.6 (σ = 16.2) for the experimental interface on all four tasks combined. A breakdown of task workload is shown in Figure 6. An ANOVA showed no sig-nificant difference in the mean workload for all four tasks (F(1)= 0.0453, ns).

6.1.5 System Interaction

On average, participants in the experimental condition clicked on 36.6 categories, with σ = 23.2. Looking at the num-ber of searches in the baseline condition reveals that the search functionality was not well used, with µ = 4.1 (σ = 4.87) searches performed per participant. Participants were trained on the search functionality, so this was not for a lack of explanation (confirmed by some of the comments left by users: “The excersices are a great way to get to know the system" (P22)). This could demonstrate that a simple search filter was not deemed useful by participants for wad-ing through a large collection of tweets. This was declared by 3 out of 12 of participants with comments such as: “[...] Also, an advanced search where we could select all the vari-ables (like, in SF, average sentiment 0.2, positive) and then just hit search would be a good thing." (P5).

We find a strong significant relationship (r = −0.7781, p < 0.05), in a correlation test between topic clicks and task per-formance. Hence, participants who clicked on more topics, found on average more of the most important topics, and ob-tained a lower (better) task score. This is evident in some of the comments left by participants: “It was good at interact-ing with a bunch of tweets at once" (P6). The implications are that showing tweets by topic has potential to be of great value to discover the most important topics from a large col-lection of tweets, even with noisy classification.

One participant who clicked a total of only 11 topic over all four tasks, showed no indication of effort to explore the different topics. Interestingly, this user scored better than the condition mean, with a task score of µ = 3.58. What this could mean is that he simply selected the top categories

(11)

shown in the list of topics, and estimated what they would be about without reading the actual tweets themselves. This could indicate that the list of topic categories actually pro-vides a reasonably good overview of the expressions posted in tweets.

The functionality to classify tweets into the irrelevant or op-posite sentiment class was not commonly used (experimen-tal µ = 3.5, σ = 8.6 versus baseline µ = 8.4, σ = 8.7) with no statistical significance found in an ANOVA test be-tween conditions (F(1) = 2.62, ns). However, as the rela-tively large standard deviation indicates, some participants made heavy use of it, and in fact seemed to enjoy it: “I did find it useful to switch the irrelevant label to rellevant and see the results update based on that, I liked that and I think I got better results that way" (P12).

The sparse use of the reclassification functionality indicates that participants were apparently not willing to invest efforts organizing their data. A possible explanation for this could be found in the study context. As one participant remarked: “Categorizing and reclassifying tweets felt rewarding. In such a short session I don’t reap the full benefits of it but I can clearly see the value for long term use. It would be great if I cuold reorganize topic categories". This indicates that the functionality may have potential in real business sce-nario’s where long-term use of the system is appropriate.

6.2 Analyzing Demographics and Competitors

We summarize the types of conclusions that participants drew the available data in the final task. As the demographic and competitor information is similar for both conditions, the differences between groups are not considered.

17 out of 24 participants did not attempt to perform analysis beyond retrieving the top statistics from the available charts and tables and stating simple observations. They looked for the neighborhood with the highest number of customers, it’s respective median home value, and the top one-to-three com-petitors. This gives limited insight into the information that users could potentially extract when more of the informa-tion would be explored. However, this is not because of a lack of effort on the participants’ part. Examining the mean time spent on task 5 (µ = 262.0, σ = 124.84) showed no significant difference with any of the other tasks duration. In a worst case scenario, this may indicate that the interface was not appropriate for digging deeper into the data. Alter-natively, the participants may not have found the data very interesting. Another explanation is that the manner in which the question was phrased did not encourage participants to find more valuable information. A final explanation could be that because this task was given last and the participant had at this point spent over 40 minutes, giving the simplest answer was satisfying enough.

Some interesting observations can be made from the answers of the remaining participants. 3 participants tried finding the average distance customers were traveling to reach business

locations. Participants experimented with selection and de-selecting branches in various parts of the region to deter-mine which branches had the most customers who lived fur-ther away. 3 participants did a more thorough analysis of competitors, finding not only the most often visited loca-tion (which was easily observed from the table), but also finding the least visited competitor in the same field of busi-ness. 3 participants tried profiling a business’ customers, with descriptions such as “[..] middle/high-class income, heavy coffee-user "(P8)and “Customers are trendy and like to spend time in coffee shops, the park, and shops at whole Foods"(P11)and “They have some brand loyalty (more check-ins at Philz than any other coffee companies), but they do also check in at competitors, such as [...]"(P6).

These observations indicate that there’s potential for the func-tionality to extract valuable information when effort is spent digging deeper into the data. The large number of partici-pants who have not attempted any further analysis may indi-cate that this functionality is not yet fully explored.

7. DISCUSSION

Tweetviz offers a method to exploring large volumes of so-cial media data using topic categories combined with tweet sentiment, estimated customer demographics and informa-tion about customers’ visits to competitors.

We offer future designers of social media analytics systems a number of takeaways. First, there is no improvement in performance in information retrieval tasks when grouping tweets by noisy categories. However, the performance of participants does improve as they invest more efforts into in-teracting with topic categories. Therefore, participants should be encouraged to explore topics thoroughly. Finally, even though topic grouping introduces more complexity, it does not negatively impact the usability score.

We found that users were not willing to reclassify erroneous data points. However, we believe that the short-term nature of the study has an important impact on the participants’ be-havior. The efforts required to reclassify the data in this sce-nario does not outweigh the short-term benefits obtained by having better organized data. Furthermore, the possibility for reorganization was limited: users could not relabel topic categories, or change a tweets’ category. This may have in-creased the possibility for organization, although the added complexity and potential reduction in usability must be care-fully weighted against the benefits.

Overall, the functionality of the demographic and competi-tor info was not incredibly valuable. This may have to do with tasks that were given. It’s possible that more specific tasks could provide better value, such as finding competitors for a single business location. The system would have to support different interactions, such as search and filtering on competitors, in order to support this.

(12)

user study makes it difficult to understand how people may use the interface in the long term. Participants have indicated that organizing the data and reclassifying tweets would ben-efit them in the long run, but leads to overhead in the short-term context of the study. Secondly, the average participant in this study may not represent the target audience of busi-ness owners and analysts. Data collection with analysts over a longer period of time may yield different results with re-spect to performance and reclassification behavior.

8. CONCLUSION AND FUTURE RESEARCH

In this paper we have presented Tweetviz, a social media analytics tool designed to help businesses explore large vol-umes of geographical Twitter data in order to extract action-able information. In a user study, we evaluated the systems ability to provide a valid overview of a companies issues and sentiment expressed by its customers; a users’ willingness to correct erroneous classifications, and summarized the infer-ences that users draw when looking at estimated home loca-tions and visits to a company’s competitor. We demonstrate that by interacting with the topic categories, a users’ ability to discover the most important issues in the data improves. This shows that grouping of tweets is valuable even in sce-nario’s where the topic categories are very noisy. When in-troducing additional complexity in the form of topic cate-gories, the workload and usability assessment does not suf-fer.

We believe that sorting tweets by categories is a promising visualization technique. However, based on comments left by participants, more attention should be focused towards improving the topic classification. In it’s current state, the grouping of topics does not outperform a traditional search bar on task performance, usability or required workload.

A number of limitations are identified in this research. Firstly, a single occurrence user study makes it difficult to under-stand how people may use the system to manage their busi-ness’ analytics in the long term. This subsequently impacts the results regarding a person’s willingness to reorganize their data. It’s possible that the short-term nature of the study leads users not to want to invest in the effort and overhead required to reorganize data. Furthermore, a business owner or social marketeer may be more aware of potential issues at their business than the participants. They could therefore be much more capable of identifying issues in their business. An analyst may also use the system differently from the par-ticipants in this study.

In the future, we believe that helping users organize their data by reclassification of topic labels is worth exploring. Additionally, the value of demographic and competitor in-formation is not fully explored. The use case in which this proves beneficial to businesses requires further attention. Fi-nally, a longitudinal experiment with business owners or so-cial media marketeers to study difference in behavior and willingness to reorganize data would prove valuable for fu-ture social media analytics systems.

9. ACKNOWLEDGEMENTS

The author would like to thank Francine Chen and Pernilla Qvarfordt for their invaluable efforts and supervision. Ad-ditionally, we would like to thank FX Pal for providing the opportunity of a lifetime to perform research at the state of the art research lab in Palo Alto. Finally, we would like to thank Frank Nack for his guidance and support.

References

1. Abrahams, A. S., Jiao, J., Wang, G. A., and Fan, W. Vehi-cle defect discovery from social media. Decision Support Systems, 54(1), 2012: 87–97.

2. Bollen, J., Mao, H., and Zeng, X. Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 2011: 1–8.

3. Brooke, J. Sus-a quick and dirty usability scale. Usability evaluation in industry, 189(194), 1996: 4–7.

4. Chen, F., Joshi, D., Miura, Y., and Ohkuma, T. Social media-based profiling of business locations. Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applica-tions in Multimedia. 2014, 1–6.

5. Cheng, Z., Caverlee, J., and Lee, K. A content-based ap-proach to geo-locating twitter users. Proceedings of the 19th ACM international conference on Information and knowl-edge management. 2010, 759–768.

6. Cheng, Z., Caverlee, J., Lee, K., and Sui, D. Z. Exploring millions of footprints in location sharing services. ICWSM, 2011, 2011: 81–88.

7. Diakopoulos, N., Naaman, M., and Kivran-Swaine, F. Dia-monds in the rough: social media visual analytics for jour-nalistic inquiry. Visual Analytics Science and Technology (VAST), 2010 IEEE Symposium on. 2010, 115–122. 8. Fan, W., and Gordon, M. D. The power of social media

ana-lytics. Communications of the ACM, 57(6), 2014: 74–81. 9. Hart, S. G. Nasa-task load index (nasa-tlx); 20 years later.

Proceedings of the human factors and ergonomics society annual meeting. 2006, 904–908.

10. Keim, D. A. Information visualization and visual data min-ing. Visualization and Computer Graphics, IEEE Transac-tions on, 8(1), 2002: 1–8.

11. Lipizzi, C., Iandoli, L., and Marquez, J. E. R. Extracting and evaluating conversational patterns in social media. Interna-tional Journal of Information Management, 35(4), 2015: 490–503.

12. Mahmud, J., Nichols, J., and Drews, C. Home location iden-tification of twitter users. ACM Transactions on Intelligent Systems and Technology (TIST), 5(3), 2014: 47.

13. Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R., Mad-den, S., and Miller, R. C. Twitinfo: aggregating and visual-izing microblogs for event exploration. Proceedings of the SIGCHI conference on Human factors in computing systems. 2011, 227–236.

14. O’Connor, B., Krieger, M., and Ahn, D. Tweetmotif: ex-ploratory search and topic summarization for twitter. ICWSM. 2010.

15. Pang, B., and Lee, L. Opinion mining and sentiment analy-sis. Foundations and trends in information retrieval, 2(1-2), 2008: 1–135.

16. Tullis, T. S., and Stetson, J. N. A comparison of question-naires for assessing website usability. Usability Professional Association Conference. 2004, 1–12.

17. Zeng, D., Chen, H., Lusch, R., and Li, S.-H. Social media analytics and intelligence. Intelligent Systems, IEEE, 25(6), 2010: 13–16.

Tweetviz: Presenting Automatically Analyzed Tweets for the Extraction of Business Intelligence

Tweetviz: Presenting Automatically Analyzed Tweets for the

Extraction of Business Intelligence

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

B

AS

S

IJTSMA

10206612

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN­

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

July 10, 2015

Tweetviz: Presenting Automatically Analyzed Tweets for

the Extraction of Business Intelligence

Master Thesis for MSc IS - HCM

Bas Sijtsma

email@private.com

ABSTRACT

1.

INTRODUCTION

2.

RELATED WORK

2.1

Social media Analytics

2.2

Sentiment Analysis

2.3

Estimating Home Location

2.4

Social Media Analysis Tools

3.

RESEARCH GOALS

4.

TWEETVIZ

4.1

Design Process

4.2

System Requirements

4.3

Datasets

4.3.1

Tweets and Venues

4.3.2

Neighborhood Data

4.4

Interface Implementation

4.4.1

Venue Selection and Filtering

4.4.2

Tweet Sentiment at Business Locations

4.4.3

Topic identification

4.4.4

Estimating Home Locations

4.4.5

Identifying Competitor Visits

5.

SYSTEM EVALUATION

5.1

Experimental Design

5.2

UMAN