• No results found

The social domain on Twitter: exploring social media as a data source for policy research.

N/A
N/A
Protected

Academic year: 2021

Share "The social domain on Twitter: exploring social media as a data source for policy research."

Copied!
48
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

The social domain on Twitter: exploring social media as a

data source for policy research.

MA thesis: New Media and Digital Cultures Name: David de Bie

Student number: 10121021

E-mail: david.debie@student.uva.nl Word Count: 17.295

Supervisor: Erik Borra

(2)

Abstract

Since social data are more and more produced by digital media, and often embedded in (private) technological infrastructures, empirical sociology is claimed to be facing a crisis. This calls for new research methods to complement, or be implemented in, the traditional ones. Therefore, this study maps the possibilities for using Twitter as a data source for empirical social research. First, qualitative and quantitative methods for analyzing online (social) media, and Twitter more specifically, are discussed. Second, a case study was conducted in collaboration with the Netherlands Institute for Social Research (SCP). A relevant social theme was translated into search queries to retrieve and analyze Twitter data through a variety of tools. A comparison of the outcomes of the methods used traditionally at the SCP with the methods used to consult Twitter data, outlines the role that the latter can play in empirical social research.

Keywords: Social media, Twitter, Big Data, Social research, Empirical sociology, SCP, Policy research.

(3)

Table of contents

Introduction... 4

Theoretical framework...6

The crisis in empirical sociology...6

Big Data... 7

Social Media... 8

Privacy... 9

Twitter...10

Online content analysis...12

Sentiment analysis...14

Research approaches at SCP...17

Survey research...17

In-depth interviews... 17

Focus groups...17

SCP and social media... 18

Why should the SCP want to use social media?...18

Methodology... 19 Scope... 19 Research methods...19 Tools... 20 Query design...22 Quantitative approaches...23 Qualitative approaches...24 Results... 25 Datasets... 25 Quantitative analysis...25 Qualitative analysis...31 Conclusion... 33 Discussion... 35 Bibliography...36 Appendix... 41

(4)

Introduction

With the emergence and development of digital media, the amount of digital data has increased exponentially. Since these data are digital, and, from the 1990s onward,

increasingly online, they can quite easily be collected by digital devices and software. Among these digital data, an increasing share consists of ‘social’ data, partly due to the ever growing popularity of social media. While for a long period, empirical sociology has had a monopoly on collecting and analyzing social data, the fact that social data are now to a large extent produced by digital (social) media has shifted the social data landscape in which lots of social data are now owned by private companies and can be tracked by numerous third parties. This ‘crisis’ in empirical sociology calls for new perspectives and methods that can complement or be implemented in current (traditional) research methods.

The traditional methods that are currently used in empirical sociology were founded somewhere in the early 1960s and have the survey research as their most important asset. In survey research, a questionnaire is presented to a group of respondents (via e-mail, telephone, a web page, or face to face). These respondents form a sample that is

representative of the entire target group of the researcher, so that conclusions about the target group can be drawn based on the answers of the smaller group of respondents. Traditionally, survey research has been subject to complications that may harm the validity and completeness of survey results, such as falling response rates (less of the respondents return a completed questionnaire) and sampling frame issues (the (technical) means of setting up and reaching a representative sample start to fail). While these complications are not contemporary phenomena per se, they enforce the call for a reconsideration of the methods that are currently used in empirical sociology, or at least to look at additional approaches that can complement the currently used methods.

To back up this research, a case study at The Netherlands Institute for Social Research was conducted. The SCP is a governmental organization that conducts socio-economic research and functions as an advisory body for the Dutch government. It is

officially part of the Ministry of Health, Welfare and Sport, but functions interdepartmentally and independently. Its most important tasks are describing and predicting the social and cultural situation in the Netherlands, assisting in setting well-considered policy objectives, and evaluating the execution of government policy on the area of social and cultural welfare (About SCP).

The SCP relies heavily on empirical sociology in conducting social research, and uses surveys and other traditional methods such as in-depth interviews to collect social data. But now that social data are ubiquitous on the web and already used in all kinds of commercial and public fields – such as marketing – there is growing pressure to investigate if and how the SCP can use these relatively easy to collect data in conducting social research. From the start, it is clear that data gathered from social media will, for now, at best play a supportive role to the traditional means of data collection, since there are certain standards to quality and completeness of data that the SCP upholds to which social media data up to now cannot conform. Therefore, the goal of this study is to investigate if and how social media data can play such a supportive role, and what kinds of useful information they can provide to add value or new perspectives to data that is gathered via traditional means. To limit the scope of

(5)

this study, the focus will be on one specific social platform: Twitter. Twitter as a research medium offers some major advantages over other social media, which will be further explained in the theoretical part of this paper.

Twitter, and digital media in general, generates large amounts of quantitative data. In fact, they mutually produce and capture their own data, with web pages that contain

numerous counters and other trackers that continuously aggregate data about users’ behavior (for example Google Analytics and Facebook’s Like button). Social networks keep track of all user statistics to analyze their own business, but also to provide insights into social behavior. These social data are often sold for commercial purposes, but can also be used for research purposes. Practically all social networks have Application Programming Interfaces (APIs) via which their data are accessible to be used for various purposes. Different (online) tools use these APIs to capture and visualize these data, and in that way provide insights in social human behavior. These tools primarily allow quantitative analysis, such the amount of users over time, often used (key)words, or most popular users. Social networks also produce various demographic data, such as age, gender, nationality, and education; data that are generally recorded on users’ profile pages. These data can also be captured to provide insights in social behavior, such as in which country people talk most about a particular subject.

The SCP mainly conducts quantitative research, but is increasingly exploring the added value of a more qualitative approach. Therefore, this study will also investigate the possibilities to conduct qualitative research with Twitter data. It is difficult to define a sharp division between qualitative and quantitative information. One could say that surveys are quantitative and in-depth interviews, for example, are qualitative. However, questions in surveys can be very qualitative or subjective, such as: “On a scale from 1 to 10, how happy are you?”. On the one hand, the outcome is a number used to eventually calculate

intermediates (thus quantitative), but the information it asks of the respondent is based on feeling or opinion, making it qualitative and subjective. Still, this study will focus on

quantitative and qualitative information somewhat separately. In the context of Twitter data, quantitative information will regard countable (meta-)data, such as the amount of tweets about a particular subject, the most used hashtags, the most active users, etcetera.

Qualitative data will focus on the textual content of tweets.

Although extracting qualitative information out of the massive amount of Twitter data is a challenging task, the value this information potentially has cannot be ignored. Therefore, different methods for online content analysis will be discussed and attempted to apply to Twitter data.

This paper is separated into two main sections. First, the theoretical framework will be outlined. The current situation in empirical sociology will be discussed, after which the role of Big Data and social network analysis in relation to social research will be described. After that, two methods for qualitative content analysis of online media will be explained and their pros and cons will be discussed. The second part consists of the case study for the Netherlands Institute for Social Research. After a brief introduction of the institute and their current methods for social research and standards for data quality, this paper will attempt to apply the earlier discussed quantitative and qualitative social media analysis methods to current themes at the SCP. This will be done using two online analysis tools, OBI4wan and TCAT, which will be introduced briefly. After performing the analysis and presenting the

(6)

results, it will be discussed whether the used methods and results are of any value to the SCP, and how the results can fit in their current structures and methods. The general research question that will be used is as follows:

RQ: To what extent can Twitter provide useful data in conducting quantitative and qualitative social empirical research?

Theoretical framework

The crisis in empirical sociology

Already in 2007, Savage and Burrows warned for a coming crisis in empirical sociology. According to them, sociology and its methods have acquired a leading position in access to social knowledge in the second half of the 20th century. They point out the sample survey and the in-depth interview as empirical methods that have put sociology in its leading position. However, from the 1990s onward, social data are more and more produced by, and

embedded in (private) technological infrastructures, and used for commercial purposes instead of social research – a phenomenon they call ‘knowing capitalism’ (886). Now that social data is gathered by numerous other parties and collected in very large amounts, they state that the role of the empirical sociologist has become uncertain, since the methods that are applied within empirical sociology are not aimed at capturing and analyzing these new forms of social data.

Similarly, but somewhat less radical, Venturini and Latour state that digital media mark the beginning of a necessary shift in social sciences. They compare the emergence of digital media with the introduction of the printing press in the 15th century, and how this triggered the Scientific Revolution of the 16th century. But, above all, they emphasize that this revolution did not happen overnight, and that it took some time for the printing press to mature before it started to have a real impact in natural sciences. They state that social sciences are now in a similar position with the emergence of digital media: “[Social sciences] are still trying to pass them (digital data, ed.) off as new terrains for old methods. […] And yet, the very speed at with which digital technologies infiltrate modernity makes such resistance more and more untenable” (2). In order to revolutionize social sciences to really benefit from these new forms of (social) data, they state that social sciences should realize that digital media should not be seen as another field to apply existing methods, but as a possibility to restructure the social sciences.

Additional to the ‘threat’ (or, more positively, the possibilities) of digital media for traditional empirical sociology, are the more classical issues with traditional methods, such as survey research. According to Hill et al., survey research is imperiled by “several

coincidental trends” (11) that result in declining data quality. One of these trends is the falling response rate. This means that the number of completed surveys or interviews, divided by the total eligible sample population, is declining. This is a direct threat to the quality of survey data because not having all members of a sample participate in the survey can result in biased estimates and thus flawed conclusions. A second major potential source of error in survey research is what they call “frame coverage errors” (12). This means that

(7)

the means for assembling and reaching a representative sample are starting to fail, mostly because of changing infrastructures and technologies. They take the shift of landline telephone towards the cell phone, and as a result the increasing number of cell-phone-only households, as an example. The traditional method of using a random-digit-dial on landline telephones to reach potential members of a sample, now runs the risk of missing entire segments of the population of interest (i.e. the subgroups that have typically abandoned their landline telephones). There are different ways to try to limit these errors, but they remain inherent to survey research. For example, in the Netherlands, representative samples can be drawn from the population register. However, this is only possible in joint projects with Statistics Netherlands. SCP is involved in a number of these projects, but even then a number of groups, e.g. elderly in institutions, recent migrants, young people not living at their registered address, remain out of scope.

Marres and Weltevrede elaborate on the new methods that are aimed at capturing the social data that are ubiquitous on the web. One of these methods is scraping, which, according to them, is widely seen as offering new opportunities for digital social research (313). The method of scraping is performed by ‘scrapers’, which are “bits of software code that makes it possible to automatically download data from the Web, and to capture some of the large quantities of data about social life that are available on online platforms like Google, Twitter and Wikipedia” (313). According to Marres and Weltevrede, scraping as a technique is able to show how the often problematized ‘dirty data’ on the web, i.e. data that comes with a lot of unsolicited meta-data, can be turned into a “virtue” (324), since these meta-data are actually “ordering devices” (326) such as links, date/time stamps, and hashtags.

Indeed, it cannot be ignored that the ways of gathering social data have shifted with the rise of digital media and a large part of the worldwide population now being connected to the internet. While questioning the role of the empirical sociologist may be a bit overdone, and, as Venturini and Latour stated, a revolution does not happen overnight, it is important to look at the methods used in social research institutions and how they can be improved or complemented by more modern methods and the availability of new types of social data. This also means looking at the quality and completeness of the data that is gathered via these more modern methods, and comparing that to the data that are collected via traditional means as surveys and in-depth interviews on which the current empirical sociology is built.

Big Data

The increasing amount of social data fits in a trend that has developed with the rise of digital media: Big Data. Although the term ‘Big Data’ is older and broader than the meaning of data produced by digital media, it has become a somewhat distinct field ever since it was

associated with digital media data. As humans make more and more use of different digital media – first on personal computers, and later also on mobile devices – they leave behind a trail of data containing useful information for all kinds of purposes. The increasing

popularity of social media, which have also extended from the personal computer to mobile devices, has further augmented this “digital footprint” (Madden et al.).

(8)

The idea of Big Data has evoked a new belief in the power of quantitative research. According toAnderson (2008), Big Data could even end the need of theory, hypotheses and models, because the massive amount of data and ever increasing computing power ‘speak for themselves’. Massive amounts of data would be available to relatively zero financial costs and could be used in all kinds of research fields, such as economics, computer science, the health sector and sociology. In an attempt to somewhat temper these high expectations and dreams, boyd and Crawford set up six statements that shine a critical light on (the promises of) Big Data. One of the most important critiques they provide on Big Data is on the quality and reliability of the data, as they state: “Bigger data are not always better data” (668) and “Taken out of context, Big Data loses its meaning” (670). The questionable quality of Big Data is also addressed by Rogers, as he states: “Questions now arise about the robustness of so-called user-generated data such as social bookmarks, tags, comments, likes, and shares” (Digital Methods 204). As he elaborates on Mike Thelwall and David Lazer, he states that issues surrounding web data are messiness, wholeness, offline grounding of data and anonymization.

The quality of web data is not the only obstacle that has to be overcome in order to turn Big Data into usable and valuable data. While the trail of data that is created by digital devices and social media has provided new opportunities for collecting information, the methods for doing this have proven to be difficult. Since all data are digital, they can be captured relatively easily, and in big amounts, in little time. These big amounts are, however, part of the difficulty. The overwhelming amount of data that are produced by digital devices contain information of all sorts, of which often only a fraction is needed. In other words, these data contain a lot of noise. Dealing with this noise is one of the most challenging tasks in gathering useful information via Big Data. Besides being easy to capture, another

advantage of the data being digital, is their ‘searchability’. This means that by using search queries, particular information can be extracted while other data are omitted. This is already a big step in dealing with the noise. The more specific the search query, the more specific and selective the remaining data will be.

To Lewis, Zamith and Hermida, however, this ‘computational’ operation is only part of selecting and analyzing the content of Big Data. They suggest a hybrid approach toward new media content analysis and Big Data:

“We argue that in many cases, scholars may see more fruitful results from a blended approach to content-analyzing Big Data—one that combines computational and manual methods throughout the process. Such an approach can retain the strengths of traditional content analysis while maximizing the accuracy, efficiency, and large-scale capacity of algorithms for examining Big Data” (36).

This hybridity of online content analysis has earlier been emphasized by Bauer: “In the quantity/quality divide in social research, content analysis is a hybrid technique that can mediate in this unproductive dispute over virtues and methods” (132). Content analysis of Big Data should thus be a combination of computational methods, like search queries, to make a rough distinction between usable and unusable information; and manual methods to make a final selection and start making sense of the data.

(9)

Social Media

The increasing amount of social data can to a large extent – but among other developments on the web such as search engines, trackers, and cookies – be assigned to the increasing popularity of social networks (or social media). The use of social media has rapidly risen since they first appeared. In January 2014, 74% of all worldwide adults with an internet connection used some kind of social networking site, compared to 8% in 2005 (Pew Research Center, ND).

The term ‘social media’ has had numerous different descriptions over the years, and has thus proven to be a term that is hard to define. Hill et al. state that there is no single agreed-upon definition of social media, but, for the purposes of surveys, they define a specific working definition for social media as “the collection of websites and web-based systems that allow for mass interaction, conversation, and sharing among members of a network” (3). Since this study is also aimed at using social media data in the context of surveys, Hill et al.’s definition is referred to when the term ‘social media’ is used.

Social media as a data source provide a number of key advantages that make them interesting and valuable to social research. Web data in general, but social media in

particular, contain many personal pieces of information. While this amount largely depends on the users’ and social network’s privacy settings, there is always a minimum amount of public information available. For example, Facebook will always provide the full name and miniature profile picture of a user, regardless of their privacy settings. Besides, the default settings of social networks are often configured to disclose more information than the minimum, and thus it requires additional steps from users to limit this amount of public personal information. The personal information on users’ profiles may contain name, age, gender, geographic location, information about hobbies, interests, attitudes, and opinions. This results in a potential goldmine of data that is easily accessible, and could be harvested to provide insights in the manner that survey data have traditionally been employed (Hill et al. 18). Another key advantage of using social media as a data source is the directness. Posts on social media are online within a matter of seconds after the user writes them, and can be captured nearly immediately after appearing online. This means the data is always up-to-date.Using the right tools, the latest social media data can be captured at any moment, while data acquired via traditional surveys may be several weeks or even months old.This is not only beneficial for the timeliness of the data, but it also means that research can be done more efficiently, since there are no periods in which researchers have to wait for their data.

A third advantage of using social media as a data source, is the elimination of the so-called intrusion effect. This means the researcher does not need to intrude into the research field to acquire the data, which often changes respondents’ behavior, for example because they want to look good in the eyes of the researcher or because they get tired of filling in questionnaires. Not having to intrude into the research context (also called ‘unobtrusive measures’) allows the researcher to observe behavior in its natural flow (Research Methods Knowledge Base).

Privacy

A big concern with using social media data is always the question of privacy. Are we allowed to just use people’s information without even informing them? The legal answer to this

(10)

question is relatively simple, since privacy is determined on the side of the user who has agreed to the social network’s terms and conditions. This means the data that is collected can only be collected because it isn’t private. Twitter users can decide not to put their tweets out in the open, but to, for example, only send direct messages. But that explicitly means that this data will not be collected, since the social network does not allow it.

Besides the legal question, there is the question of the moral and ethical correctness of using and publishing social media data. Even if a user on a particular social network has set their privacy settings to ‘public’, would it be acceptable to broadcast the user’s profile on national television? And in what context would a user’s public post be placed? Maybe one that completely opposes the author’s intentions. As boyd and Crawford state in their

aforementioned critique on Big Data: “Just because it is accessible does not make it ethical” (671).

One of these ethics concerns looking at users individually, and using data about individuals to support the research. One might, for example, use a particular post or biography as an example to fortify certain claims or arguments. While there are no specific rules regarding this concern, it is generally seen as unacceptable to use recognizable

individuals as examples, as described by this attempt to set up an ethics code for social data: “Members of Big Boulder Initiative believe that, in addition to honoring explicit privacy settings, organizations should honor implicit privacy preferences where possible. This may mean broadcasting a post without attribution, or with a blurring of the name. Specifically, the best practice is to preserve content within its original context so as not to surprise the user, and first and foremost respecting the end user’s voice” (Gowans 2014).

Researchers in social media have always tried to anonymize their data as much as possible, as a moral duty, but this has proven to be difficult, especially when researchers want to share their dataset for further research. A striking example of this process is addressed by Zimmer (2010). He describes an earlier study in which a dataset containing around 1700 Facebook profiles was gathered from students of “an anonymous, northeastern American university” (315). The dataset was then publicly released for other researchers to work with, accompanied by a “Terms and Conditions of Use” statement that obligated other researchers to only use the dataset for research purposes, along with a few other, more specific rules and regulations. The researchers also set up a codebook, in which they described how they attempted to ensure privacy of the subjects in the dataset, for example by storing the students’ names and identification numbers on a secured local server only, promising to delete these data immediately after they were processed. According to Zimmer, however, they were being overly optimistic, because it turned out the source of the dataset (Harvard University) was identified only a few days after the publication of the dataset. The researchers also made publicly data that was normally only visible for Harvard students, while in their opinion the data was ‘already public’ (318).

Twitter

Since this study will focus on Twitter as a data source, it is important to understand the workings and basic characteristics of this social platform. Although social media have been introduced above, Twitter distinguishes itself from other social media in a number of ways that form its key characteristics. These characteristics also form the fundamentals of why

(11)

Twitter lends itself for research better than other social networks, and why this study focuses on Twitter specifically.

What is Twitter and how is it used?

Twitter is a microblogging platform that was launched at the end of 2006. Since then, it has grown significantly, with 302 million monthly active users after the first quarter of 2015 (Twitter Investor relations n.d.). It allows users to send short messages (limited to 140 characters), called ‘tweets’. Users can follow other users, which means the messages of those they follow will appear in their ‘feed’. Within this feed, all tweets are ordered reverse

chronologically, i.e. the latest tweets appear in the top of the feed. Tweets are usually – and by default – public messages, that appear in the timelines of all followers and are visible to any user, although private messages - also called direct messages - can also be sent. The platform can be used directly from the Twitter website, but it is mostly used via Twitter’s smartphone application or other, third-party applications that make use of Twitter’s public and free API. According to Twitter, about 80% of its monthly active users access the platform from a mobile device (About Twitter, Inc).

What are retweets, mentions, hashtags and how do they structure conversations?

Communication via Twitter relies on a number of user-invented conventions, which have later been picked up and codified by Twitter itself. While a standard tweet contains no conventions except the technical limitation of 140 characters, tweets that are intended to engage in some sort of conversation follow a set of conventions that enable users to address other users or tweets. According to boyd et al., these conventions were mostly initiated by users themselves, before they were officially adopted by Twitter. Users can address other users by using the ‘@user’ syntax (boyd et al. 2), also called mentioning. Using this syntax will create a link to the addressed user’s profile page, as well as enable the user’s tweet to appear in the ‘mentions’ tab of the addressed user (Twitter Help Center). Mentioning or replying to other users shows the intention of starting or participating in a directed conversation.

As opposed to mentioning and replying, hashtags are generally used to start an open conversation. Twitter’s hashtag functionality was introduced in July 2009 (Wikipedia 2008), but the idea of using hashtags as a method for labeling groups and topics is older, and stems from particular practices in computer programming (boyd et al. 2, Wikipedia 2008) and usage in IRC channels and ‘folksonomies’ (Bruns and Burgess 2). In Twitter, the hashtag (#), in combination with a keyword, is used to label or categorize a tweet and thus to assign it to a particular topic. It also means that tweets can be directed to take part in a particular discussion or conversation. For example, the hashtag ‘#globalwarming’ would place the tweet within the broad conversation around global warming. It is also used for smaller topics, such as events. ‘#elclasico’ would, for example, put all tweets about ‘El Clásico’, the football match between FC Barcelona and Real Madrid, in a single collection. Twitter

automatically hyperlinks each hashtag to the complete set of tweets with that hashtag, which also makes the hashtag an entry point to the conversation it is part of. To highlight the most popular topics in different countries, Twitter uses an algorithm called ‘trending topic’. This algorithm decides – via numerous factors of which only a few are publicly known – which hashtag is the most popular in each country (Gillespie).

(12)

Retweeting is the practice of copying another user’s tweet and posting it again, which means that the tweet is now spread across the retweeting user’s network. To emphasize that a tweet is a retweet, users generally put ‘RT @’, followed by the original author in front of the tweet, although the conventions for retweets are differing and inconsistent (boyd et al. 3). Retweeting has become more common over the years, and as of early 2014, around 25% of all tweets are retweets (McGregor 2014). Following up on the increasing amount of

retweets, Twitter has codified this behavior and implemented mention and retweet buttons in their interface.

Why is Twitter interesting as a (social) research object?

Since its launch in 2006, Twitter has more and more become an interesting object and useful tool for research purposes. According to Rogers, Twitter as a medium, as well as a source for research develops itself through three distinguishable phases: Twitter as banal, phatic, and shallow; Twitter as a news medium for event-following; and Twitter as (archived) data set. Rogers demarcates a seemingly small change that marks a great shift in Twitter as a research object: in 2009, Twitter’s slogan changed from “What are you doing?” to “What’s

happening?” (4). This symbolizes and emphasizes Twitter’s shift from a banal and shallow medium towards a news/event following medium. Because of this shift, the content on Twitter, and thus the data that can be extracted from it, has also changed. Where in the beginning Twitter contained lots of useless, ‘banal’ content, such as what Twitter users are having for lunch, Twitter as a news/event following medium provides more interesting and relevant content, as well as more valuable meta-data.

One of the main reasons for the popularity of Twitter as a research object is the openness of its data. While on other social platforms, such as Facebook, much of the data is hidden due to privacy settings or commercial intents, Twitter data are out in the open. This is primarily because of Twitter’s ontology. The medium is intended to spread the user’s messages to as many people as possible, and private messages are generally seen as preposterous. Additionally, Twitter makes its own data accessible via different Application Programming Interfaces (APIs) (see API Overview). This means that via automated

processes - in other words, a computer program - Twitter data can be requested from Twitter’s servers. The APIs vary from the free to use Search API, which provides access to the last few weeks of ‘relevant’ data for a particular search term, the also free to use Streaming API, which works as a real time monitor that gives access to a 1% sample of all data in real time, and the paid for Firehose APIs, where one can get a 10% or 100% sample of all the data in real time. The choice of API depends on research purpose and budget. The Search API will, according to Twitter’s developer documentation, not provide a complete dataset, but give the most relevant tweets to the researcher’s query over a set period. This is particularly useful when investigating how tweets have helped organize events that have happened in the past. On the other hand, the Streaming API will not provide Twitter data back in time, but it will provide a 1% sample of all real time data, which is more aimed at real time monitoring of tweets and can prove to be useful in providing data for following events, for example elections or football matches. It may also be useful to use both the Search API to create an initial dataset of about two weeks back in time, and Streaming API to collect data from that moment on to further fill the database (140Dev). The ‘full’ Firehose API, being ‘the best of both worlds’, provides the option to search back in time like the Search

(13)

API, and the less limited or even unlimited real time stream of the Streaming API, and is used when both the Search and Streaming API (or a combination of the two) fall short in

providing the right (amount of) data.

Another important reason for the popularity of Twitter as a research object is the uniformity of tweets. While messages on other social platforms can vary from only a few words to 2000-word blog posts, tweets never exceed the 140 character limit and are thus easy to read and can be rendered neatly in, for example, an Excel sheet. As Rogers

emphasizes: “Given its character limit and the fact that each tweet in a collection is relatively the same length, it also lends itself well to textual analysis, including co-word analysis” (7). The uniformity of tweets is particularly useful if tweets are to be analyzed manually to increase the quality and reliability of content analysis.

Scraping

As already mentioned, scraping is one of the typical digital media methods that can provide structure in the huge pile of digital data. Scrapers are software applications that are used to (automatically) capture online content. As emphasized by Marres and Weltevrede, scraping is a medium-specific method. This means that it approaches data as being natively digital, as opposed to ‘offline’ data being digitized, and therefore makes use of their benefits, such as meta-data that helps to order them. The method of scraping thus makes use of the

characteristics that are inherently embedded in digital data to help structuring them. For example hyperlinks, time stamps, hashtags, and metrics are means that structure data, and, at the same time, are already embedded in them. Thus, as Marres and Weltevrede state, the assumption in social research that digital online data are an “informational mess” (326), needs to be rejected. Instead, social research should benefit from the structured or pre-formatted data that are distinctive of digital online media.

Online content analysis

In order to effectively and efficiently analyze online content, it may be useful to take a look at traditional content analysis methods. Using coding schemes to analyze content of different kinds is a traditional technique that has been used long before digital media arose.

According to Bauer, “content analysis is a systematic technique for coding symbolic content (text, images, etc.) found in communication, especially structural features (e.g., message length, distribution of certain text or image components) and semantic themes” (Bauer, 2000 in Herring, p. 234).

In an attempt to come up with a technique to structurally analyze online content, Herring suggests two different approaches: the traditional approach construed to the web, symbolized as [web[content analysis]]; and the analysis of web content, symbolized as [web content[analysis]]. The former should be seen as a traditional content analysis method, but with certain adaptations to make them suitable for analyzing web content, the latter moves away from the traditional forms of content analysis towards more modern methods, targeted more specifically at analyzing web content.

The traditional approach to content analysis takes five steps: (1) formulating a research question/hypothesis, (2) (randomly) selecting a sample, (3) defining categories for

(14)

coding, (4) coding the content, and (5) analyzing/interpreting the data (235). In practice, this approach can be problematic because it is almost never followed strictly (e.g. the study is more exploratory than pre-focused, non-random samples are used).

As Herring argues that the traditional approach applied to web content (i.e.

[web[content analysis]]) falls short in providing an appropriate method for analyzing web content, she proposes two non-traditional approaches (i.e. [web content[analysis]]): the Computer-Mediated Discourse Analysis (CMDA) and the Social Network Analysis (SNA). For the former, she lays out a similar five step process as with the traditional approach: (1) articulate research question, (2) select computer-mediated data sample, (3) operationalize key concepts, (4) apply methods of analysis to data sample, (5) interpret results (238). She emphasizes that this approach is best applied to the analysis of e-mail, discussion forums, chatrooms, text messaging, mediated speech and monologue text on web pages. She states that in can also “offer insight into the hypertextual nature of websites” (238), but this content is better approached by social network analysis.

Social network analysis is more focused on link analysis, and is, for example, used to locate web spheres (clusters of web pages addressing a common theme, see Schneider and Foot). A well-known application of this kind of analysis is Google’s PageRank (see Brin and Page). As links are often termed as the essentials of the web, link analysis should provide a better means of analyzing web content than the computer-mediated discourse analysis.

Providing a more touchable example, Rui, Chen, and Damiano have conducted a content analysis on 1500 tweets sent by health organizations. Although they do not mention Herring in their article, they follow the five step process of Herring’s Computer-Mediated Discourse Analysis somewhat strictly. First, they pose two research questions: “What types of social support do health organizations provide through Twitter?” and “What types of social support do health organizations seek through Twitter?” (670). After that, they explain how they have selected their data sample: a list of 1400 health organizations set up by the U.S. Department of Health and Human Services was obtained, after which they selected 60 organizations using a random number generator. Two were excluded from the sample for not having a Twitter account. Using a Python script, they downloaded 1500 tweets sent by these 58 health organizations. They then move to Herring’s third step by setting up a coding scheme of six different types of content, varying from “providing informational support” to “seeking emotional support”, plus an “other” category (670-671). The fourth step consisted of coding each tweet with one of the seven categories. For the final step they conclude that health organizations’ use of Twitter is limited, because the tweets mostly consisted of one-way, information providing content.

A comparable study was performed by Daas et al. In order to investigate if Twitter could be a usable data source to Statistics Netherlands, they collected around 12 million Dutch tweets and categorized them by topic. They did this by first separating the tweets containing a hashtag (around 1.7 million) from the tweets that did not. From the tweets containing a hashtag, they filtered out the tweets with the 1000 most used hashtags (making up about 35% of all tweets with a hashtag) and manually classified them according to themes that Statistics Netherlands investigates (e.g. economy, politics, sports). The messages that did not contain hash tags were – after automated text analysis of the remaining 10.3 million tweets proved to be too ambiguous – reduced to a random sample of 1050 tweets that were also manually categorized. Their findings prove that tweets certainly contain

(15)

useful content, however, “collecting tweets and analyzing their content is not the same as using them for official statistics” (13). They state that relating tweet content to opinions of the Dutch population will be difficult without any additional sources.

It thus appears that, although Herring stated that Computer-Mediated Discourse Analysis was best applied to the analysis of e-mail, discussion forums, chatrooms, text messaging, mediated speech and monologue text on web pages, this method lends itself also for tweet content. Looking at the characteristics of tweets, this is, however, not surprising. Regarding content, they share a lot of commonalities with chatrooms, text messaging, and discussion forums (users react to each other), but also with monologue text on web pages (of which the blog is one common example), since tweets are also a means of spreading thoughts, ideas, and opinions (and Twitter is referred to as a micro-blogging platform).

Herring’s concluding remarks are that the traditional approach should benefit from scholarly understandings of the web in order to provide a method for web content analysis. She proposes “a methodologically plural paradigm under the general umbrella of Web Content Analysis” (245) that includes different methods, such as image analysis, link analysis, and language analysis that can all address the characteristics of web content. This means the traditional paradigm of content analysis has to expand, which will raise some resistance from within the field, but is essential to remain relevant in the web era.

Sentiment analysis

Sentiment analysis can be seen as a particular form of content analysis. According to

Thelwall et al., there are three approaches to (automated) sentiment analysis of online texts: full-text machine learning, lexicon-based methods, and linguistic analysis. Full-text machine learning concerns the training of an algorithm by passing in a set of texts annotated by human coders, by which the algorithm learns what features of texts should be associated with the positive, negative, or neutral category. The lexicon-based method starts with a list of pre-coded words that are given a particular sentiment and strength, after which the

algorithm can determine a weighted score to associate a text with the positive, negative or neutral category. The linguistic analysis focuses on the structure of texts, often also making use of a lexicon. By looking at the textual structure, this method can also distinguish

(double) negations, sarcasm, superlatives, idioms, and other complicated textual features. In practice, most systems use a combination of these approaches (408).

However, the accuracy of such analyses have traditionally been subject to issues in accuracy and reliability. There are many companies that offer sentiment analysis of (tweet) content, and naturally all of them claim to reach a high rate of accuracy or accordance with established methods of measurement (e.g. internet polls, trained human coders). Some studies also claim to provide accurate means of automated sentiment analysis of (tweet) content. For example, O’Connor et al. state that their sentiment analysis of Twitter messages has proven to correlate with poll-based public opinion measurements. The correlation reached around 80% at particular moments. To determine sentiments from tweets, they used a dataset of around 1 billion tweets gathered via the Search API and ‘Gardenhose’ (now Streaming) API from Twitter. Although the software to determine sentiments proved to produce a lot of errors in determining sentiments, they state that it was possible to

(16)

each other out (i.e. the number of mistakenly positively determined tweets would be roughly the same as the mistakenly negatively determined tweets).

Kim et al. have conducted another study to test the accuracy and reliability of automated sentiment analysis of tweets. They compared the Radian6 automated sentiment software with a team of trained coders, which they set as the “gold standard” (143). As table 1 shows, and as Kim et al. conclude, the Radian6 algorithm scores poorly in correctly

determining the sentiment of tweets. To put this in perspective, they also used the

crowdsourcing platform CrowdFlower to let the same set of tweets be coded by an unknown number of ‘workers’. As table 2 shows, crowdsourcing proves to be a much more reliable method for coding the sentiment of tweets than automated methods.

Positive -> Positive: 17.7% Neutral -> Positive: 8.3% Negative -> Positive: 3.3% Positive -> Neutral: 76.5% Neutral -> Neutral: 91.7% Negative -> Neutral: 93.3% Positive -> Negative: 5.9% Neutral -> Negative: 0.0% Negative -> Negative: 3.3% Table 1: Agreement of Radian6 algorithm with gold standard (green should be a high rate, red a low rate)

Positive -> Positive: 82.4% Neutral -> Positive: 4.2% Negative -> Positive: 0.0% Positive -> Neutral: 17.7% Neutral -> Neutral: 87.5% Negative -> Neutral: 0.0% Positive -> Negative: 0.0% Neutral -> Negative: 8.3% Negative -> Negative: 100.0% Table 2: Agreement of crowdsourcing with gold standard

The outcomes also show that O’Connor et al.’s argument of wrongly coded negative and positive tweets would cancel each other out does not apply here, since the rate of wrongly coded negative tweets differs from the rate of wrongly coded positive tweets.

In accordance with Lewis et al. and Bauer, Hill et al. also state that online content analysis, and in particular sentiment analysis, should benefit from a hybrid form of both computational and manual methods:

“Given the possibility that both machines and humans will occasionally misinterpret author intent, one may assume that sentiment analysis is too error-prone to add value to analysis; indeed, this is often the case. But for those situations where textual datasets are very large, even with large error rates, machine-based sentiment

analysis (especially coupled with other text analytic techniques) provides a singular way to understand otherwise unexamined data” (38).

An important argument they provide for using textual sentiment analysis is that, although sentiment analysis can cope with reliability and accuracy issues, it has the ability of creating a sample of texts with a particular sentiment that can be used to investigate further. To illustrate this process, they use the example of a clothing retailer who lets each customer fill in a simple survey that ends with an open question: “Do you have any other comments you would like to share with us about the retail store?” By subjugating all answers to a sentiment analysis, the retailer will be able to separate the negative answers from the rest, which could be around 20% of the total amount of answers (Hill et al. 39). While this smaller dataset is still as error-prone as the initial dataset, the retailer is left with a set that allows for a manual exploration of common themes, which can then expose particular issues that the retailer was

(17)

unaware of before the analysis, and which can be solved. The accuracy of sentiment analysis in this case is not even that relevant, since the exploration of common themes gets rid of the ‘noise’, or wrongly coded answers. In the light of Big Data and its messiness (as discussed by Marres and Weltevrede and Rogers), sentiment analysis functions as an ordering device that helps to structure the enormous pile of data that is available.

The example shows how sentiment analysis will never allow for a 100% accurate classification of texts – according to Hill et al., around 30% or more of the responses were inaccurately coded (39) – but how sentiment analysis can function as a means to discover common issues, by making a rough separation in sentiments that can decrease the initial dataset to about a fifth of its original size, after which zooming in on the negative responses can help to identify these issues. It also shows how a hybrid method – automated analysis combined with manual analysis – can be an effective approach, using both the processing benefits of computational methods and the accuracy and reliability of manual methods.

Sentiment analysis has thus proven to be a tricky method, and should be used with care. Simply letting a sentiment algorithm run on a set of texts and taking the outcomes as a reflection of reality is an easily made mistake, especially since sentiment analysis tools seem to be reliable by their appearance and their claims of reaching particular accuracy scores. It does, however, succeed in making a gross distinction between positive, negative, and neutral texts, which will make datasets more manageable and allows for an easier (manual) analysis to find common themes or problems that can be further dealt with.

With the above introduced methods that address both quantitative and qualitative analysis of online (social) media, this study aims at partly providing an answer to the problems for empirical social research outlined by Savage and Burrows and Hill et al.

Although the direct threat to empirical social research imposed by Big Data and social media deserves some nuances (and so does the ‘new belief’ in quantitative data, posed by

Anderson), the possibilities for using the new forms of social data in social research need to be considered and discovered. While the methods for quantitative analysis of social media are already widely used in all kinds of fields, the methods for qualitative analysis are still in an experimental phase and are in some aspects lacking reliability and accuracy. However, as discussed above, they can still be of value when used in the right way. Rui, Chen, and

Damiano have shown how traditional, structural coding of content can be applied to tweets. Additionally, Hill et al. have shown how sentiment analysis, despite its error-prone character, can be used in a valuable way by separating opinionated messages from neutral ones and thus creating a more manageable dataset that can be analyzed manually.

Research approaches at SCP

The above discussed methods and approaches show how analysis of online content can generally be conducted. However, to investigate if and how Twitter data can be of any use in complementing the traditional research performed by the SCP, it is important to outline the standards and requirements of this traditional research. The SCP uses different methods to acquire information, such as survey research, in-depth interviews and retrieving

administrative information from different institutions. Since the traditional methods will differ substantially from the different methods that can be used to analyze online (social) media, it is important to map both approaches and their differences and (possible) overlaps.

(18)

By doing so, it will become more clear how both approaches can benefit from each other and how they may be used complementary.

Survey research

Survey research is the most important means for the SCP to collect information on the different fields of inquiry. According to De Leeuw, Hox and Dillman, there are many different definitions of what constitutes a survey. In an attempt to capture these different definitions into one sentence, they describe the survey as “a research strategy in which quantitative information is systematically collected from a relatively large sample taken from a population” (2). This definition emphasizes the two most important aspects of survey research: samples and quantitative information. Questions that rise in designing a survey are “How many people need to be surveyed in order to be able to describe fairly accurate the entire group? How should the people be selected? What questions should be asked and how should they be posed to respondents?” (De Leeuw, Hox and Dillman 1).

Defining a sample is a crucial step that has to be taken with care and precision. This step starts with creating a sample frame, which defines the means with which the sample will be assembled. This can vary from simple lists of people that already exist (e.g. all employees from a particular company) to area probability sampling (geographical selection of sample members) to all kinds of algorithms that mathematically select members of a sample. When constructing a sample, it is key that the representativeness of it is known, as it will influence the types of inferences and conclusions that can be drawn from the data. In-depth interviews

Besides surveys, the SCP makes use of in-depth interviews with people that are specially selected for each subject (for example married immigrants or employees in particular sectors). These interviews are conducted face-to-face and are more aimed at qualitative research than surveys, and thus focus more on content. In these interviews, often personal matters and deeper topics are discussed, often revealing information that people are hesitating to share.

Focus groups

Focus groups are used as a sort of middle path between survey research and in-depth interviews. A group is formed of people with the same background or ideas, after which a questionnaire is presented. The topics are more superficial than those of in-depth

interviews, and often conclusions are made by deriving an aggregate of answers of the entire group.

SCP and social media

Why should the SCP want to use social media?

The theoretical part has already summed up the most important advantages of using social media in social research. Additionally, the Netherlands has the highest share of social media users in Europe, as in 2012, 65% of the Dutch population between 16 and 75 years old were active on social media, compared to a 40% average in Europe (Office for National Statistics 2013). Since the SCP solely conducts research among the Dutch population, this high rate is

(19)

an important argument in favor of using social media data. To focus on Twitter more specifically: in January 2014, the total number of monthly active Dutch Twitter accounts – that is, accounts that post a tweet at least once a month – was 1.3 million (PeerReach). It should be taken into account that this figure also includes Twitter accounts that belong to companies, institutions, and other organizations. Still, this forms an enormous pool of potentially useful sources.

Correlation: to what extent does social media analysis conform to the SCP standards?

As may have become clear in previous sections, there are major differences between survey research and social media analysis. Where surveys rely on carefully selected samples to assure the most representative target group, social media will only partly let the researcher create a sample. There are possibilities of selecting members based on particular properties, such as age, location, gender, level of education, nationality, etc. but there is a lack of

consistency in these data, since they all depend on users’ profile information and privacy settings. Thus, when selecting a group based on one or more of these properties, the people that have not included these data in their profiles (or have them hidden by their privacy settings) are excluded from the sample frame. Another sampling issue is that the people who use social media are still a limited group that excludes a significant amount of the

population, such as elderly people, people with no internet connection, or people that deliberately choose to not use social media for various personal reasons. These issues are hard to overcome, as is also emphasized by Murphy et al.: “Although social media research can accurately reflect activity online, more research is needed to determine whether we can create a frame of social media users from which we can sample individuals for research with a known and non-zero probability” (7).

Besides the issues with representativeness, there is a large difference in the data that is collected. This difference is, in the first place, caused by the entirely different method of collecting data. While survey data are ‘pulled’ out of respondents by asking targeted

questions that should directly deliver the needed information, social media data is ‘pushed’ out by users in all forms and subjects. This means that basically, data collection via social media analysis works as a complete opposite compared to data collection via surveys and interviews. Where survey data collection starts as qualitative data that is quantified, data collection via social media analysis starts as quantitative information that needs to be qualified. This leaves the researcher with two major challenges. The first is to filter out the relevant information from the enormous mass of data that is shared via social media. Since all social media data are digital, and thus can be processed by software, a large part of this challenge can be taken out by choosing the right tools. However, the design of search queries can still be challenging and is crucial in finding the right data. The second challenge is making sense of the data. Again, this is where social media data differ substantially from survey data. Survey data are purely quantitative and directly aimed at specific questions, while social media data can be both qualitative and quantitative, and have to be further analyzed to make sense of the data and to make the data useful.

(20)

Methodology

Scope

As mentioned in the introduction, this research is backed up by a case study at the Netherlands Institute for Social Research. Within the SCP, there is a variety of themes in which research is conducted. One of these fields is called the ‘social domain’, and this is the field in which the SCP thinks new means of collecting social data can be the most relevant. The social domain is a collective term for a number of facilities that have been handed over from the national government to municipalities due to planned decentralization. These facilities are aimed at citizens that need special attention for not being able to (fully) take care of themselves, due to, for example, poverty, mental/physical disabilities, addiction, depression, or child issues. The main facilities that are handed over from the national government to municipalities are youth care, care of the handicapped, participation in the labor market and social guidance. The national government remains fully responsible for these facilities, therefore it is important to know how municipalities perform in these areas. To accomplish this, the national government has set up an overall monitor that is divided into different sections. This monitor is set up to investigate the problems occurring in cities, how municipalities handle these problems, and the expectations of citizens regarding their municipalities.

While within the social domain there are a number of topics that can be investigated in terms of how they are discussed on social media, the topic of youth care stands out in a number of aspects. It is the sector that has faced the most radical reorganization and at the same time the one that has the most clients. It also became clear that after searches for the different topics on Twitter, youth care was the most discussed and contained many

emotional or passionate messages from clients themselves who were worried about what the reorganization would mean for the care of their children. Since these are the types of information that concern the SCP, this study will mainly focus on the discussions around youth care on Twitter. For the sake of structure and clarity, a separate research question will be used for the empirical research part of this study.

RQ: How do Twitter users engage with the topic of decentralization of youth care in the Netherlands and what methods can be used to map this debate?

Research methods

Since this study will research the possibilities for using Twitter – a typical natively digital medium – as a data source in empirical research, it will make use of methods that benefit from the pre-structured character of the natively online data that Twitter provides. This is primarily useful in conducting quantitative research. The method of scraping, as discussed by Marres and Weltevrede, is included in this research by using two online tools that enable the capturing and analysis of Twitter data. These tools allow different analysis approaches that will be introduced and explained in the section ‘Quantitative approaches’. Also, the two tools will be explained in the ‘Tools’ section.

(21)

Since this study also researches the possibilities for qualitative analysis, the

theoretical part has introduced two approaches towards online content analysis. These two approaches have their pros and cons. The method of online content analysis described by Herring provides a structural approach that is divided into separate steps. By deriving this method from traditional content analysis, Herring emphasizes the link between new and established approaches. Rui, Chen, and Damiano have shown how such an approach can be applied to the content of tweets, following Herring’s steps quite strictly. There are, however, a number of causes that complicate Herring’s and Rui, Chen, and Damiano’s approach for this research. The first is that there are no specific research questions regarding the content that is to be analyzed. In other words: this research is not aimed at answering specific questions about tweet content that arise at the SCP directly, but tends to examine if and how the SCP can use Twitter as an additional data source. While different search queries,

regarding different SCP-related themes, will be used to explore the possibilities of Twitter as a data source, it is not the main goal to answer specific questions by analyzing tweet content, as Rui, Chen, and Damiano have done. The second, more practical complication is that such an analysis requires trained coders to objectively analyze tweet content, which is not possible for this research. Sentiment analysis, on the other hand, is practically very feasible within this research. The tools to perform sentiment analysis with are often free or low-cost, and require no particular skills or training. Unfortunately, the lack in accuracy and reliability that all automated sentiment analysis tools deal with make sentiment analysis a problematic approach. As Kim et al. have shown, the Radian6 algorithm scores poorly in determining sentiment when compared to human coders in the form of crowdsourcing. Different tactics show how sentiment analysis can still be useful despite being inaccurate. When working with a dataset that is large enough (which is still hard to define in exact figures), automated sentiment analysis can be useful to make a gross distinction between positive, negative, and neutral texts, which makes a dataset more manageable and allows other approaches to be used to further analyze content and, for example, discover recurring issues that can be dealt with.

Tools OBI4wan

At the start of this research, the SCP had already embraced an online tool to try and analyze different online sources: OBI4wan. The tool is a paid service that acquires posts from a number of different online (social) media sources, such as Twitter, Facebook, Pinterest, blogs, forums, and news sites. Although the tool is primarily aimed at webcare services (i.e. communication with clients via different networks), the tool has a separate section that allows the user to collect data about nine months back in time. In this section, the user can create reports in which a dashboard shows to view and visualize data. The different options for visualization are presented as widgets that can be placed on an ‘empty sheet’. The options vary from line graphs, circle diagrams and cumulative column graphs to tag clouds or a list of all complete posts. OBI4wan presents a list of filters, in which the user can adjust the sources, time period, sentiment, etcetera. Within each report, the user can create and select multiple search queries, in which the same filters can be used as on the dashboard.

(22)

The search query allows Boolean operators and some additional parameters such as language (e.g. “lang:EN”) and specific usernames.

OBI4wan supports different kinds of data export. Each widget can be exported as a PNG (.png) image file, allowing the user to include visualized results in other documents. Also, the entire report can be exported as a PDF (.pdf) or Word (.docx) file. All messages (i.e. posts resulting from the search query) can also be exported to PDF or Excel (.csv) formats, albeit limited to 1000 posts.

Teezir

To map the advantages and disadvantages of the OBI4wan tool, the Teezir tool was tested. Teezir is mainly aimed at brand monitoring and, as a dashboard, is very similar to OBI4wan. Its basic functionalities are the same and it also works with widgets on an empty

background. There were, however, a number of key differences that helped to map the qualities and deficiencies of both tools.

A major difference between the two tools are the data export possibilities. While Teezir offers unlimited export of raw data in Excel and Word format, and export of ‘widgets’ as images, the export of raw data in OBI4wan is limited to 1000 posts. This is a key point because social research should preferably be backed up by all raw data that is examined. In case of the SCP, all inquiries should be delivered with the raw data to ensure the

reproducibility and transparency of analyses.

A second major difference is the used Twitter API. OBI4wan gets its Twitter data from a third party (see DataSift) that uses the full Firehose API. This means all Twitter data, with a small delay of around ten minutes, can be retrieved. Teezir doesn’t make use of the full Firehose API, which results in an undefined sample of tweets rather than the full stream. This results in an at least three times smaller dataset than when using the Firehose API. For example, a simple search for “Shell” with Twitter as the only source, the language set to Dutch and data from January 1st , 2015 to March 17th, 2015, lead to 18246 results in

OBI4wan and only 4963 in Teezir. Taking a longer period (June 1st, 2014 to March 17th, 2015) and the query “Obama” lead to 97110 results in OBI4wan, compared to 17518 in Teezir. The same settings for the query “Tennis” lead to 42648 results in OBI4wan and 8701 in Teezir. Twitter Capturing and Analysis Toolset

To be able to take a different perspective on data, as well as to be able to obtain more raw data, the Twitter Capturing and Analysis Toolset (TCAT) was also used (see Borra and Rieder). This tool is developed at the University of Amsterdam, and is part of the larger toolset of the Digital Methods Initiative. TCAT is more aimed at research purposes than OBI4wan and Teezir, and provides a number of options that are not available in the other tools. The most important option is the ability to export Twitter data in a number of different forms. The TCAT functions on different datasets, that first have to be configured with particular search queries before tweets can be captured via Twitter’s APIs. When a dataset is selected, the user has a number of options to analyze and export the data. Within the dataset, filters can be used to create a more selective dataset. For the output, the user can choose between comma-separated values (CSV) and tab-separated values (TSV), which are both spreadsheet formats. The user then has the option of grouping tweets as overall, per hour, per day, per week, etc. Different lists of data can then be exported as a spreadsheet,

(23)

such as most used hashtags and most mentioned users, but also overall statistics such as user stats and tweet stats from the dataset. (Suitable) data can also be exported as different network file formats, which allows the user to visualize the data in other tools, such as Gephi.

One of the disadvantages of using the TCAT is that there are limitations in the amount of data that can be retrieved, both back in time and in real time. Using Twitter’s Search API, it is possible to capture tweets back in time, but with limitations in both requests per minute and the period of time over which data can be retrieved (about two weeks back in time). As this research started using TCAT on March 16th, 2015, this has resulted in some important dates missing from the dataset, for example the moment that the new youth care system was introduced (January 1st, 2015). Also, as mentioned before, the Streaming API will only provide a 1% sample of all tweets meeting the search query.

Tool selection

Out of the three tested tools, this research will continue using two of them: OBI4wan and TCAT. Since Teezir shares a lot of features with OBI4wan, but does not make use of Twitter’s Firehose API, its benefits are too little to be of value in this research. The two tools that will be used differ significantly in their capturing and analysis capacities, which makes them suitable for different analyses. OBI4wan has the advantage of using Twitter’s Firehose API that provides access to an unlimited amount of tweets, and its dashboard interface provides a clear overview of tweet content and statistics. TCAT on the other hand, provides more options for analysis and full export possibilities for different datasets. The goal of using both tools interchangeably is to profit from the advantages of both tools, and limit the

disadvantages of both tools by using them complementary.

Query design OBI4wan

Using OBI4wan, different search queries have been used to try to identify social issues in the area of the social domain. While there are a number of potentially interesting topics that are currently treated within the SCP, there is one that stood out for a number of reasons. Since the youth care system was subject to major changes due to the decentralization, a brief look at the tweets sent about this topic immediately showed a diversity in opinions and

sentiments. It also became clear quite quickly that this dataset was of a manageable size, which allows for manual analysis of tweets to improve the quality and reliability of content analysis, as discussed in the theoretical part. After digging deeper into the subject of youth care, and consultation with SCP employees who were already familiar with the subject, a list of keywords was set up to capture as much relevant tweets as possible. Since OBI4wan’s search function allows the use of wildcards (*) at any location in the search query, a large part of the keywords in the initial list could be combined into a singular keyword. This has resulted in the following search query: “jeugd* AND (decentralisatie OR transitie OR

overheveling OR #3d)”. This means anything that has to do with youth, in combination with transition (in various, often occurring terms), is captured. After performing this query, different filters were enabled to manage the relevance and size of the dataset. First, the source filter was set to only use Twitter as a resource. This results in a dataset that is about

Referenties

GERELATEERDE DOCUMENTEN

This led to the development of human disease mimicking in vitro models advancing from 2D monocultures/cocultures to self-assembled 3D spheroids and patient-derived organoids;

This study focuses on investigating the reinforcing behavior of a TESPT modified lignin-based filler in a SSBR/BR blend in comparison to CB and silica/TESPT.. With mechanical

The second table will present the frameworks, models, and theories identified by name; categorization (framework, theory, or model); studies and projects that applied it; approach

In the published version of this article, items of an enumerated list in Sect. 4.1 have mistakenly been replaced by bullet points. Headings of Sect. 4.1 should read as follow:

The monotone target word condition is used for the second hypothesis, which predicts that the pitch contour of the musical stimuli will provide pitch contour information for

Overigens kan het ook zijn dat een procedure over een irreële inschrijving niet leidt tot terzijdeleg- ging, maar tot een veroordeling tot heraan- besteding, als de

The goal of this study was to examine the impact of using Social Media and Enterprise Social media on the association between a team’s Transactive Memory Systems and the

Still it is shown that it is possible to create a neural model that is able to produce non sequential click sequences and performs on par with models that assume a sequential