Gatekeepers among themselves? A study of the role of homophily for source diversity in journalists’ Twitter networks

(1)

Gatekeepers among themselves?

A study of the role of homophily for source diversity in journalists’ Twitter networks

Jonathan Dudek 12263192

Master’s Thesis

Graduate School of Communication

Research Master’s programme Communication Science

Supervisor: Dr. J. E. (Judith) Möller Date of completion: June 26, 2020

(2)

Abstract

This study investigates how homophily among journalists on Twitter may be related to the diversity of sources encountered and shared by them. In this, I assume that journalists’ Twitter networks may ‘gatekeep’ the sources journalists are exposed to. Approaching this research objective, I used a sample of 633 Twitter accounts by Dutch journalists and their respective Twitter followees, and extracted four months’ tweeting activity for those accounts. The followees were classified in order to find the shares of other journalists among them. Diversity of sources was identified among the tweets of the followees on the basis of the domain names of URLs shared. On average, 38% of journalists’ followees (that can be linked to a person) were journalists as well. Furthermore, a higher share of other journalists among the followees was negatively, although only weakly, related to the overall diversity of sources tweeted by the followees. Finally, out of their followees’ tweets, journalists tended to retweet other journalists more than other accounts by a person. These results testify the presence of homophily in journalists’ following behavior on Twitter and show how such homophily can limit the utility of Twitter as a sourcing tool. Also, they demonstrate how journalists’ activity on Twitter needs to be considered in the broader context of and as subject to the network activities of all present actors.

(3)

Introduction

As a frequent user of the social media network Twitter, it is hard to miss one particular “group” of users: journalists. Indeed, their presence on the microblogging platform Twitter is considerable. Hanusch and Bruns (2017) identify approximately 40-50% of the whole

Australian population of journalists on Twitter, and 67% of journalists in the US reported to be using microblogs (i.e., Twitter and Snapchat) on a weekly basis in 2017 (Cision, 2017). Use cases of Twitter in the professional context of a journalist are numerous, ranging from collecting information and contacting sources, to engaging with the public and interacting with peers (Barnard, 2016). This development runs parallel to the changes and challenges the rise of social media has brought about for the traditional news media. Herein, the affordances of social media define new conditions for what should be news content worth attention (Hermida, 2016), casting a new light on previously pivotal news production and dissemination mechanisms such as gatekeeping (Meraz & Papacharissi, 2016).

In the midst of this blending of the traditional news media and social media, the potential of the latter as a source of news stands out. Stories sourced from social media have increasingly made it into the news (Von Nordheim, Boczek, & Koppers, 2018), with the news even inferring the public opinion from current trends voiced on a platform such as Twitter (McGregor, 2019). These are observations that do not simply describe changes or

augmentations to classical news sourcing practices, but run deeper: In some cases, journalists perceive the newsworthiness of Twitter content on a par with content emitted by news

agencies (McGregor & Molyneux, 2018). This indicates how far Twitter has become a central element of journalists’ working practices.

There are obvious downsides to this integration of content sources on Twitter in the news. Worrying outcomes emerge, e.g., when insights into the public opinion are not being filtered thoroughly, giving room to intentionally misleading voices in the news (Lukito et al.,

(4)

2020). On the other hand, journalists’ Twitter networks are tailored around actors of their own profession, as testified by themselves (McGregor, 2019). Not only that, but journalists also show a tendency to interact mostly with each other (Hanusch & Nölleke, 2019; Molyneux & Mourão, 2019). Such preponderance of the own profession in journalists’ Twitter networks warrants a critical reflection of the importance of Twitter as a news source among journalists. In particular, it calls into question how far the Twitter ‘experience’ of a journalist is shaped by other journalists’ Twitter activity. This seems even further justified given that ideological inclinations apparent in journalists’ news writing have been found to be related with

partisanship in the Twitter accounts followed by journalists (Wihbey, Joseph, & Lazer, 2019). In sum, Twitter occupies a crucial role in journalistic practices. This might not

necessarily benefit a balanced news coverage, but may lead to an overemphasis of the voice of social media (McGregor, 2019), or even the inclusion of wrongful sources (Lukito et al., 2020). These findings make a point regarding the outcome of journalistic work, i.e., news pieces, and identify Twitter as a newly established provenance of news content. However, it remains open how exactly journalists could be affected from using Twitter in the earlier stage of their work – the sourcing of content. One possible outcome could be a limitation of the diversity of sources encountered. In this context, journalists might also be affected by the specific constellation of their Twitter networks. Concerning the latter, we know that other journalists constitute an essential part of journalists’ networks, an observation referred to as ‘homophily’ (Hanusch & Nölleke, 2019). Connections between journalists are established on the basis of the various affordances offered by Twitter. This can be functionalities such as retweeting others’ content (e.g., Hanusch & Nölleke, 2019) or replying and quoting each other (e.g., Molyneux & Mourão, 2019). The aforementioned study by Wihbey et al. (2019) took into account the aspect of following the Twitter accounts of other journalists. It is especially the last affordance that is central in the study presented here. Serving as an

(5)

indication of homophily, following behavior is observed in regard to a possible relation with the diversity of sources that journalists encounter on Twitter. Accordingly, the central research question raised is:

How does homophily in journalists’ networks on Twitter relate to the diversity of sources that can be encountered and that are shared in those networks?

This study demonstrates the challenges around meaningful research with social media data, combining a perspective on actors with a perspective on the content shared by them. This said, I did not take into account the ‘real-world’ products of journalistic work, but focused on insights obtainable from Twitter alone. At the same time, the emphasis on source diversity as related to the variety of accounts followed adds to the understanding of the factors that may influence journalists’ sourcing practices.

Theoretical Background

The surge of social media in the context of the Web 2.0 has brought about

considerable changes for the news industry. This includes the profession of journalists, who find themselves in the very midst of the changed media conditions (Hermida, 2016). In fact, they take a particularly active role in the lately evolved social media environments,

remarkably so in the case of Twitter (e.g., Hanusch, 2018). This makes Twitter as a social media platform, and journalists as actors on this platform a good starting point for studying the interrelation of social media networks and the news media. Adding to that, the affordances offered by Twitter make it a social media platform that is particularly apt for studying

interactions between journalists and the public (Gil De Zúñiga, Diehl, & Ardèvol-Abreu, 2018).

Considering extant concepts of news production and dissemination in the context of social media necessarily involves the question of how journalists might be affected as well. In fact, while so far having been central actors in the news media landscape, digital social

(6)

networks may impact them like any other user. This becomes apparent when turning to gatekeeping theory. While research has considered how journalists might exert a modified gatekeeping role on, e.g., Twitter (Molyneux, 2015), this does not touch upon the question of how they, in return, might be ‘gatekept’ here. This, however, is what the concept of

‘networked gatekeeping’ would suggest and will be further explored here. Gatekeeping in times of social media

The concept of gatekeeping has extensively been used for describing which content it makes into the news. It holds that the actors involved in the news production, i.e., journalists, news reporters, or editors, as well as the news organizations themselves define the selection of content for the news. To paraphrase from Shoemaker and Vos (2009, p. 1), gatekeeping refers to the refinement of information into the messages that eventually reach the news consumer in the form of news items.

With the advent of social media platforms, the mechanisms of gatekeeping have not become obsolete out of a sudden. Still, new ways for distributing content became available, giving new kinds of actors the possibility for having a voice. The production, distribution and usage of journalistic content (Klinger & Svensson, 2015) are all confronted with new

mechanisms and are no longer under the control of the formerly central actors. In other words, equivalent to a “mass media logic”, social media yields a “network media logic”, in which “intermediaries” take the place of the gatekeepers (Klinger & Svensson, 2015, p. 1246). An example for this might be the aforementioned study by Molyneux (2015), who discusses the connection between the retweeting behavior of journalists and the softening of their

gatekeeping function. She finds that the traditional gatekeeping considerations are being replaced by individual preferences for which content to retweet, and which not.

Meraz and Papacharissi (2016) describe gatekeeping as it becomes manifest in social media networks. This concept of ‘networked gatekeeping’ is defined as a “process through

(7)

which actors are crowdsourced to prominence through the use of conversational, social practices that symbiotically connect elite and crowd in the determination of information relevancy” (Meraz & Papacharissi, 2013, p. 21). With this definition placing particular emphasis on the crowd, both the originators and the distributors of content are disentangled from the conditions imposed by traditional news production. Instead, as Meraz and

Papacharissi (2016) show, the network itself defines which pieces of content will surface and surge in popularity. It is not only content, though, that can achieve prominence: As above quote suggests, network actors can be put in temporary positions of influence. This, as the authors explicate, will depend on the content emitted by them and the popularity it

consequently achieves to gain in the network. In this, it is common to observe what resembles a power law distribution, with a selected few gaining most of the prominence (Meraz & Papacharissi, 2016). The means through which this happens are, for example, the retweeting and the mentioning functions on Twitter, but giving rise to prominent actors is not their only achievement: they also support the filtering of relevant content (Meraz & Papacharissi, 2016).

The observations made by Klinger and Svensson (2015) and Meraz and Papacharissi (2016), while referring to a similar situation, are not exactly two sides of the same coin. Nonetheless, it becomes apparent that on social media any user could (theoretically) become both a central actor, and an entity receiving content from other influential actors. The very same situation may be assumed for journalists when they are active on, e.g., Twitter, and make use of it by sourcing content.

Twitter as a sourcing tool for journalists

The phenomenon of Twitter being used as a source in news reporting has received particular attention in research. Von Nordheim, Boczek, and Koppers (2018) conclude from a content analysis of news articles that the numbers of references to social media sites – and Twitter in particular – have surged in recent years. Adding to that, McGregor (2019)

(8)

investigated in how far the news try to infer the current public opinion from social media sites such as Twitter. She found evidence for that, with respective practices apparently playing a role in the production of the news. In an early exploratory study, Paulussen and Harder (2014) investigated the extent to which references to social media are included in Belgian

newspapers. They find that referring to social media has become part of journalistic work. As a caveat to that, they note that social media is specifically being used for representing non-official, non-expert, or non-elite sources, thus enhancing the variety of voices in news reporting. Deprez and Van Leuven (2018) studied the behavior of health journalists. Their finding hints to Twitter as being used by journalists as a general source of information, but also for following up on other media actors. Johnson, Paulussen, and Van Aelst (2018) took a closer look at the practices of economic journalists. Comparing the kinds of sources on Twitter with the offline sources used by journalists, those two seem to overlap. The authors conclude that the additional means of consulting sources via Twitter does not constitute a changed situation for the process of news production. This finding might appear to downplay the relevance of Twitter as a sourcing tool, contradicting, e.g., McGregor (2019) or Von Nordheim et al. (2018). On the other hand, this may also show that the overall picture is not quite clear, yet. Bouvier (2019) describes a special case of sourcing in which journalists followed trending stories on Twitter and report these in the news. This, the author states, can be problematic as far as it may lead to an overrepresentation of the voices of those who know how to game the trends on Twitter. In another case, Lukito et al. (2020) find that tweets with the apparent purpose of manipulation found their way into news reporting. Here, they were then used as evidence of the public opinion.

These findings show: Using Twitter as a source of content, and being around on Twitter in order to enhance news reporting is not free of bias. This bias is introduced as far as journalists making use of Twitter are subject to how this social media network functions and

(9)

to what it may bring forward. In that context, it should be noted that journalists in general have a broad range of sources to choose from, whereas the selection of those sources can itself be subject to bias. This is not restricted to Twitter: In the case of economic journalists, for example, certain elites from the world of finances can achieve particular prominence as sources (Johnson et al., 2018). When reporting on crises, journalists tend to give those sources room they are already familiar with, e.g., news agencies (Van Der Meer, Verhoeven,

Beentjes, & Vliegenthart, 2017). Furthermore, source selection is tied to gatekeeping

practices insofar as the latter not only defines the content that makes it into the news, but the kinds of sources as well (Shoemaker & Vos, 2009).

In regard to Twitter used as a sourcing tool, the question is how exactly bias might be introduced here. This calls attention to how journalists encounter their sources on Twitter. One possibility would be to explore trending topics or to follow discussions evolving around a certain hashtag (e.g., Hermida, 2016). Also, automated tools may be used for capturing the current sentiment apparent in general tweeting activity (McGregor, 2019). At the same time, one’s timeline is normally populated with the content shared by the Twitter accounts followed (henceforth referred to in this paper as “followees”).1_{Several studies have already hinted at}

the role journalists’ followees could potentially play in sourcing activities, and among those followees, other journalists in particular (Deprez & Van Leuven, 2018; Johnson et al., 2018; McGregor, 2019). Respective sample sizes observed, however, were small. This justifies further research into whom journalists follow on Twitter, but also seems to necessitate a better understanding of how journalists interact with each other on Twitter.

1_{Friends is another term used for these kinds of accounts, e.g., in the Twitter API documentation (Follow,}

(10)

Collegial bounds on Twitter

The affordances offered by Twitter for connecting and interacting with other actors allow for personally tailored uses and experiences. This context needs to be considered for a complete picture of how journalists use Twitter.

Journalists show a tendency for amplifying each other’s voices and for interacting mostly with each other on Twitter (Hanusch & Nölleke, 2019; Molyneux & Mourão, 2019). This pattern may even be described as resembling “social media echo chambers” (Molyneux & Mourão, 2019, p. 248). The principle behind can be described as one of homophily

(Hanusch & Nölleke, 2019): Similarity between people is constitutive of the foundation and maintenance of connectivity between them. Accordingly, the networks people are part of are made homogenous due to shared social characteristics, such as the profession (McPherson, Smith-Lovin, & Cook, 2001). The very same principle plays an important role in networked gatekeeping as well (Meraz & Papacharissi, 2016): A functionality such as retweeting of content may reach considerable parts of the entire network. However, networked gatekeeping gains considerable impetus particularly in more confined (sub-)networks: Within those, homophily brings about coherence in regard to the content shared and enables the pursuit of consistent content agendas among users (Meraz & Papacharissi, 2016).

Due to homophily, journalists tend to interact more with colleagues from, e.g., their own news outlet, the same locality, or the same topical field on Twitter (Hanusch & Nölleke, 2019). This happens in various ways, and by making use of the affordances Twitter provides. Hanusch and Nölleke (2019) show that this can include mentioning other journalists (by adding “@username”) in a tweet, but also retweeting the tweets of other journalists. While these communicative events indicate homophily, one might wonder whether the same

principle applies regarding the following behavior of journalists as well. In the face of limited empirical knowledge (see, e.g., Deprez & Van Leuven, 2018 and Johnson et al., 2018), the

(11)

actual extent to which this is the case remains open. Thus, as a first step towards an understanding of how journalists might rely on other journalists in their Twitter use, I am asking:

RQ: Out of all the Twitter accounts followed by journalists, what is the share of Twitter accounts by other journalists?

Content curation in social media

With knowledge about whom journalists follow, it still remains open how the

composition of respective followee networks might define the selection of content journalists get to see on Twitter. At this point, further insights are needed from research into general social media use. Some of the mechanisms that have been described for the usage of social media platforms may apply for journalists as well.

The question of what leads to encountering content on social media has received broad attention, with research focusing on, among other aspects, the incidental exposure to the news (Fletcher & Nielsen, 2018), or the influence of individual perceptions towards finding the news on social media (Gil de Zúñiga, Weeks, & Ardèvol-Abreu, 2017). Accordingly, a number of factors plays a role in the exposure to and the effects of what one gets to see on social media. In this situation, a social media platform such as Twitter is not, e.g., an arbitrary tool for exploring topics, but may rather be conceived as aligning with an individual user’s preferences and ways of usage. Thorson and Wells (2016) provide an account of how the exposure to (social) media can be defined by different (co-occurring) “curated flows”. Among those, social curation seems especially pertinent in the context of journalists’ Twitter diet and against the backdrop of the extent to which they use Twitter for interacting with other

journalists.

Social curation, as described by Thorson and Wells (2016) builds on the idea that one’s social network has an influence on the selection of content someone is confronted with.

(12)

This includes friends and family, or other acquaintances, who, as in the two-step flow model of communication are thought to forward content to its recipients, serving as intermediaries in the flow of information (Lazarsfeld, Berelson, & Gaudet, 1948; Thorson & Wells, 2016). Thorson and Wells (2016) explain that in this situation, two sources of homogenization (of content encountered) are possible: First, the upfront choice for a certain network, including the members of this network. Secondly, the choices made by the members in one’s network.

One might conclude that for a journalist to use Twitter indicates a choice for a

network that is particularly populated by users of the same profession. Moreover, affordances such as following each other or retweeting each other’s content might allow for even further emergence of and encapsulation in “niche networks” (Klinger & Svensson, 2015, p. 1250). Regarding the content shared in those networks, it is not granted, though, that it is

representative of the interests of the network’s members to equal parts. Instead, Thorson and Wells (2016) suggest that the network logic described by Klinger and Svensson (2015) could be at play as well. According to this logic, it is particularly the content with the potential for becoming successful in a network that will be promoted by the users of a network. Following the authors, this emphasizes the importance of the preferences of users for content distribution in a network. In this context, Klinger and Svensson refer to the network users as being “like-minded” (2015, p. 1246): successful spreading of content in social media networks requires some level of familiarity and closeness among users.

In the case of journalists, who share a profession and show homophily, this could mean the following: It is likely that the content they are exposed to on Twitter depends on the composition of their personal network. Furthermore, it may be expected that this content is tailored towards the particular interests and preferences of the users in this network. As a consequence, the content shared by the followees of journalists might not at all constitute a balanced source of information.

(13)

Research objectives

Building on the insights gathered, it is fair to assume that content does not pass to journalists on Twitter in unfiltered ways, but is filtered and directed by both the workings of the networks in question, and the interests and preferences of the users present. Moreover, given homophily and self-reinforcing sharing behaviors, it stands to reason that networked gatekeeping might result in the homogenization of the content shared within the networks in question.

These theoretical findings lead to a couple of assumptions. The curative effects of networked gatekeeping might determine (some of) the content (and the sources thereof) that journalists consult or get influenced by. If journalists indeed show a tendency to

predominantly follow each other, this could eventually be related to a rather restricted, homogenous selection of sources. This could be stronger for those journalists who follow other journalists to a higher degree. Therefore, I hypothesize:

H1: The higher the share of other journalists among journalists’ Twitter followees, the less diverse the selection of sources will be that journalists may potentially encounter in their Twitter timelines referred to by their followees.

Furthermore, when it comes to interactions with sources referred to by journalists’ followees, the network logic at work may have an attenuating effect on the diversity of those in

subsequent sharing. This follows from the assumption made by Thorson and Wells (2016), based on the concept of network logic (Klinger & Svensson, 2015). Among the content, and herein, the sources provided by their friends, it could be that journalists select that kind of content that will eventually “make the cut” for popularity on Twitter. Regarding tweeting behavior linking to other Twitter users’ content (e.g., in the form of retweeting or replying), we already know that journalists preferably reshare their own kind (Hanusch & Nölleke, 2019). Retesting this finding and restricting it to followees only, I hypothesize:

(14)

H2: When journalists re-share sources referred to by their followees, they will predominantly re-share those referred to by other journalists.

Methodology

This study builds on the tweeting activity of a sample of journalists affiliated with five newspapers from the Netherlands: Het Financieele Dagblad, NRC, De Telegraaf, Trouw, and de Volkskrant. More specifically, two aspects in journalists’ activity on Twitter were of interest: the following of other Twitter accounts, and tweeting, i.e., sending tweets. The Twitter accounts followed by the journalists (their ‘followees’) were restricted to those that represent an individual; this selection was further split into a set of accounts that may be related to a journalist or not, respectively. For both the journalists in the initial sample, and the followees, tweets sent within a defined time period were extracted from the Twitter API. Diversity of sources was then operationalized by the diversity of the domain names of URLs shared in the tweets by the followees. Finally, statistical analyses were conducted in order to address the research question and the two hypotheses stated above.

Sample and data collection

The Netherlands was chosen as country of origin of the journalists observed to set the study apart from previously studied samples from Belgium (Deprez & Van Leuven, 2018; Johnson et al., 2018), Australia (Hanusch & Bruns, 2017; Hanusch & Nölleke, 2019), or the US (McGregor, 2019; Molyneux & Mourão, 2019). The time period of analysis was set to four months, from October 19, 2019, to February 18, 2020, allowing for a reasonable amount of tweeting activity to be included within the restrictions set by the data source, the Twitter API, as well as the time available for the data queries.

In order to obtain a broad sample of Dutch journalists with an active Twitter presence, I focused on nationwide newspapers. This choice should make sure that the sources shared on

(15)

Twitter by the journalists are not bound by regional relevance, which might have limited their comparability.

Following the approach described by Hanusch and Bruns (2017) for identifying the Twitter accounts of journalists required the so-called ‘Twitter lists’ (How to use Twitter Lists, n.d.): These lists can be created by any Twitter account and, put simply, combine other

Twitter accounts on the basis of some criterion of classification. For example, the Twitter account of a news outlet may provide a list of journalists affiliated with it. This was also the case with the five newspapers selected for this study. In contrast to the procedure by Hanusch and Bruns (2017), however, I could not refer to additional sources of information, such as a registry of journalists. Once the Twitter lists by the Dutch newspapers had been identified, the user ids and account information of the accounts included were scraped with the Twitter API in early February 2020. A manual check filtered out any account that did not actually

represent a journalist, as evident from the account user description. Tweets sent by the journalists (including original tweets, retweets, and replies) were queried from the Twitter API on February 19, 2020. In total, 636 accounts by journalists could be retrieved like that, sending a total of 158,470 tweets in the time period defined.

Next, the followees of the journalists from the sample were collected with the Twitter API between February 21-22, 2020. The query failed in the case of three accounts, reducing the set of journalists to 633. Subsequently, detailed account information for the accounts followed was queried from the Twitter API. Then, tweets sent by these accounts were queried from the Twitter API, with queries running from February 19, 2020, until May 03, 2020. This included the original tweets, retweets, and replies sent by a user.2 Tweeting activity was again limited to the time period defined. Furthermore, getting these tweets was subject to the

restrictions imposed by the Twitter API, according to which a maximum of 3,200 tweets can

2_{The Twitter API distinguishes between one’s home timeline, i.e., the tweets a user gets to see on the start page,}

the mentions timeline, i.e., the collection of tweets in which a user was mentioned, and the user timeline, i.e., the collection of all tweets sent by a user (Get Tweet timelines, n.d.).

(16)

be obtained per account. This required two additional criteria: The account of a followee was only included if it had either tweeted less than 3,200 tweets overall (as indicated by the ‘statuses-count’ in the account meta data), or if the earliest tweet found in the collected data dated before the time period of analysis started (i.e., October 19, 2019).3 This resulted in a final number of followees of 112,199. The steps for identifying Twitter accounts by a person described in the next part further reduced this number to a final total of 49,926.

Classification of Twitter accounts

Among the accounts followed by the journalists, any kind of account could be present: automated bot accounts, accounts by organizations, news outlets, individual persons, and, likely as well, accounts by other journalists. In order to address the research objectives of this study, it was necessary to identify accounts that can be related to an individual person.

Secondly, among the accounts by a person, it was necessary to find those that can be related to a journalist.

In order to facilitate both manual and automatic steps of classification, accounts needed to have a detectable identity. Therefore, only those with an existing user description written in either English, Dutch, German, or French were considered. The langid-Python package was used to identify the language of user descriptions (Lui & Baldwin, 2012). From the accounts that remained after this pre-selection, I drew random samples of user

descriptions to be coded manually as references for the automated approaches to follow. This included a sample of 1,000 user descriptions for each of the four languages considered, i.e., 4,000 in total. The coding followed a single criterion: Do the terms in the user description

3_{Due to the restriction of the Twitter API, in a large sample of Twitter accounts some of the accounts might}

never reach a count of 3,200, while accounts with high tweeting activity may reach this number within a short period of time. In addition to that, the Twitter API restricts the number of queries that may be sent, which affects the time it takes for retrieving data for a large sample of accounts. Since in this study, querying the user timelines of the accounts followed by the journalists spanned more than two months, the user timelines retrieved towards the end of the queries might well have amounted to 3,200 tweets, but it was possible for the oldest of these tweets to have been sent only after the queries started. Thus, it would not have been reasonable to consider respective accounts any longer.

(17)

imply that this account represents a person? If answered positively, user descriptions were labelled with ‘1’, if answered negatively with ‘0’.

Concerning the automated classification, I combined two different approaches: First, I trained a Logistic Regression classifier based on each of the four samples coded manually, as described. In this case, the best result was achieved when combining all four samples into a single training set and training a Logistic Regression classifier accordingly, with an accuracy of 86% and an AUC4 (area under the curve) of 0.85. The reason for that might be the fact that some user descriptions use multiple languages, thus, increasing the sample size increases the number of cases the classifier is trained on. In addition to that, I applied ‘Named Entity Recognition’ (short: NER). With these algorithms it is possible to infer the kind of object that a name represents. Common object types can be, e.g., organizations, persons, or geographical regions. Applied to the user names of Twitter accounts, NER can give an indication of

whether the user name represents a person or another type of entity. I used the NER functionality in the spaCy-module, based on the en_core_web_lg-2.2.5 statistical model.5 Compared to the combined sample of 4,000 user descriptions obtained from the manual coding described above, NER returned an accuracy of 79% and an AUC of 0.77. Finally, comparing both approaches combined with the manually coded sample of 4,000 user

descriptions improved the result further, resulting in an accuracy of 90% and an AUC of 0.92. Based on these results, although restrictive, it seemed reasonable to further consider only accounts that are classified as a person by both NER and the trained classifier. Accordingly, the two approaches were applied to the full set of Twitter accounts.

In order to identify Twitter accounts that might represent a journalist, I trained another Logistic Regression classifier. I took a sample (n = 1,500) from the accounts by a person

4_{The AUC score indicates in how far results obtained with the classifier deviate from a random classification:}

While a score of 1 would be perfect, a score of 0.5 would mean only a random chance of being correct (Trilling, Tolochko, & Burscher, 2017).

5_{A description of Named Entity Recognition with spaCy can be found here:}

(18)

identified with the steps above, with all four languages represented. This time, the coding followed two selection criteria: 1. Does the user description include terms that clearly refer to a journalistic profession? (such as: ‘journalist’, ‘correspondent’, ‘reporter’), and/or: 2. Does the user description indicate affiliation with a news outlet? (such as a description like: ‘covering politics for @bbc’). Again, if answered positively, user descriptions were labelled with ‘1’, if answered negatively with ‘0’. The classifier trained on the manually coded sample achieved an accuracy of 93% and an AUC of 0.87. Subsequently, this classifier was applied on the whole set of Twitter accounts identified earlier as accounts by a person.

Inferring content diversity

For each account of a journalist, the diversity of sources this journalist might have been confronted with in the user timelines of the followees was inferred. The according measure based on the URLs shared in the tweets of the followees: For each Twitter account by a journalist, all URLs occurring in the tweets of respective followees were combined into one list. Then, I took into account the domain names in the URLs, assuming that those return more unified results than full URLs.6 It can be expected that large sets of (unique) URLs show great variety, especially if taken from various Twitter accounts. Focusing on domain names can reduce the diversity observed.

For each list of domain names, the source diversity (Nwala, Weigle, & Nelson, 2018)7 accounts for the number of unique domain names and the total number of domain names (including duplicates). It is calculated with the following formula (Nwala et al., 2018, p. 68):

6_{An example taken from the dataset may illustrate the difference between URLs and domain names: In the URL}

string zorgkrant.nl/wetenschap-en-onderwijs/11225-4d-techniek-laat-de-bloeddoorstroming-zien/, zorgkrant.nl is the domain name. Strictly speaking, what is referred to here as domain name is the compound of the top-level domain (‘nl’), and the second-level domain (‘zorgkrant’).

7_{While other measures of diversity exist as well (e.g., Simpson’s diversity index), the one used here was chosen}

since it can be applied not only to URLs, but to domains (and hostnames) as well. Also, contrasted to, e.g., Simpson’s diversity index, it emphasizes absolute diversity by focusing only on the ratio of unique to overall URLs present, not considering the distribution of unique URLs (Nwala, 2018a).

(19)

(In this case, represents the source diversity based on domain names, refers to the number of unique domain names, and C stands for the total number of domain names included. The result is normalized.) An index of ‘0’ indicates that either only one domain name was present overall, or that among two or more domain names in a list, none of them occurred only once (i.e., all domain names in the list are the same). An index of ‘1’ indicates that of all the domain names present, none is a duplicate of another domain name (Nwala, 2018b).

The source diversity index was calculated for each of the 633 journalists based on Nwala (2018a). Prior to that, shortened URLs indicated by the strings ‘bit.ly’, ‘tinyurl’, or ‘goo.gl’ were removed. Also, all URLs including the string ‘https://twitter.com’ were

removed, since those occur in the context of retweets that are also what Twitter names ‘quote tweets’ (Tweet objects, n.d.).As such, they only indicate content within the Twitter network. In this part of the analysis only tweets containing an URL were considered. This included a total of 2,577,333 URLs, occurring in 3,457,351 tweets sent by 45,674 followees (some followees were not found with tweets containing an URL).

Final statistical analysis

The RQ was addressed by calculating the ratio of followees that are by a journalist to all personal accounts followed by a journalist from the initial sample. This included only Twitter accounts with tweeting activity in the time period of analysis.

The analysis conducted for H1 was again restricted to tweeting activity in the time period defined and to tweets originating from followees that are by a person. Here, two variables were taken into account: the share to which a journalist follows other journalists on Twitter, as derived for the RQ, and the diversity index calculated for each journalist on the

(20)

basis of the domains of the URLs in the tweets sent by the followees. Then, Spearman’s rank-order correlation was calculated for the two variables.8

For testing H2, I took into account the retweets by the journalists in the sample. These retweets needed to include an URL, and the account authoring the original tweet behind the retweet needed to be a followee of the retweeting journalist. Both retweets and original tweets were exclusively sent in the time period defined. These restrictions limited the number of journalists present to 383. I split the retweets sent by the journalists into two groups,

depending on whether the account of the original tweet was by a journalist, or not. The counts of retweets per journalist and group were subsequently compared with a Wilcoxon signed-ranks test.9

Results

The RQ asked for the share of Twitter accounts that can be related to a journalist among all the Twitter followees of a journalist that are by a person. Considering all the journalist in the initial sample (n = 633), on average 38.07% of the personal accounts followed by them could be related to another journalist. The maximum share observed was 82.61%, and the minimum share 0.00% (SD = .12).

H1 assumed that higher shares of journalists among the personal Twitter accounts followed could be related negatively to the diversity of sources that journalists may encounter. Indeed, there is a highly significant, weak negative correlation (rs(633) = -.19, p < .001)

between the share to which a journalist follows other journalists, and the diversity of domain names shared in tweets by the followees. According to H1, this negative correlation was to be expected: The higher the share of accounts by other journalists followed, the lower would be

8_{A Shapiro-Wilk test for normality had indicated that the values for the diversity measure are distributed}

non-normally (W(633) = .87, p < .001), hence, I opted for a non-parametric correlation test.

9_{The two groups of retweet counts were dependent, since both originate from the same journalist and add up to}

the full count of retweets per journalist. According to a Shapiro-Wilk test, the differences between the two counts were non-normally distributed (W(383) = .72, p < .001), requiring a non-parametric test equivalent to the paired t-test (McDonald, 2014, p. 187).

(21)

the diversity of sources shared by the followees. Accordingly, the result provides weak support for H1.

H2 expected that journalists predominantly re-share content that originates from accounts by other journalists. In other words: I expected that content from other journalists is re-shared to a higher extent than the content from non-journalists. Comparing the two groups in a Wilcoxon signed-ranks test revealed that the median difference between retweet counts for tweets by journalists (Mdn = 2.00) and retweet counts for tweets not by journalists (Mdn = 1.00) was significantly greater than zero, Z = -6.44, p < .001. This result supports H2,

meaning that journalists re-share content originating from other journalists more so than content from other accounts by a person. However, the share of retweeting other journalists to overall retweets correlates positively (albeit weakly) with the share to which journalists follow the accounts of other journalists (rs(383) = .17, p < .001). Thus, a weak association

exists between the extent to which other journalists are followed, and journalists being retweeted more.

Conclusion & Discussion

Results show that journalists take a considerable share among the accounts (that can be related to a person) followed by journalists, albeit not constituting a majority. Accordingly, there is still considerable leeway to variety among the accounts followed by journalists. However, it should not be forgotten that in some cases, journalists do primarily follow other journalists, indicating that the extent of homophily in following behavior can vary.

This variation might not seem concerning at first sight. The result returned for H1 hints to a detrimental outcome regarding source diversity, though. Given the weak correlation found, other factors might play a role as well. Still, the link between the two variables exists and may hint to a particular form of gatekeeping, meaning a pre-filtering of sources that sets in even before the journalists themselves could make a selection. Actually, their choice has

(22)

taken place earlier, when they chose for following certain accounts on Twitter. The composition of this collection of followees has eventually led to a gatekeeping of which the journalists became the subjects of. This situation may be considered in line with the ‘networked gatekeeping’ described by Meraz and Papacharissi (2016).

The result obtained for H2 provides an additional perspective on the pre-filtering of sources observed: As far as journalists tend to share tweets by other journalists more, they may carry forward the already limited diversity apparent in the tweets by their followees. This outcome, though, may be more concerning when the followees are more homogenous. Furthermore, the finding that journalists tend to retweet each other more than other (personal) Twitter users is in line with the observations by Hanusch and Nölleke (2019) and Molyneux and Mourão (2019). In contrast to them, the present study focused on retweets of tweets including an URL only, placing emphasis on the sharing of sources. Also, the result shows that such retweeting behavior applies to retweets of tweets sent exclusively by journalists’ followees: Even if a person’s Twitter account belongs to a journalists’ followees, it may be retweeted less simply because the person in question is not a journalist. Accordingly, some preference among journalists for the content shared by someone from the own profession may be stated. This could hint to the presence of the network logic described by Klinger & Svensson (2015). Arguably, though, a Twitter user may not necessarily discern between tweets from a followee, and tweets from accounts not followed when deciding which tweets to retweet.

Finally, the positive correlation found between the share of journalists among the followees, and the share of retweets originating from other journalists is not surprising: If journalists are followed more, they should be expected to be retweeted more. Considering that this relation is not very strong, though, tentatively provides further support for assuming a

(23)

network logic (Klinger & Svensson, 2015) to be at play: For a journalist, it might seem most convenient to share someone of the same profession.

Implications

This study tried to expand scientific perspectives on journalists’ behavior on Twitter by the potential impacts on them using Twitter. Particularly, this concerns the question of how journalists are impacted as network actors among other network actors. As the results presented have shown, journalists can indeed be seen as embedded in this broader context of their network, which thus should not be disregarded.

The results call journalists to be sensitive about their selection of accounts to follow. Exclusively following other journalists might eventually limit the diversity of sources encountered, an outcome that may not be desired when Twitter is used as a sourcing tool. Limitations

The study presented was subject to a number of methodological limitations. To begin with, the limitations for data retrieval imposed by the Twitter API have resulted in the user timelines not to be recreated completely. Accounts with high tweeting activity had to be omitted. Future research projects aiming to cover the timelines of larger sets of users over a certain period of time should obviate limitations by first, getting an indication of the tweeting activity by those accounts. Timeline requests should then be adjusted to that, e.g., by

recurrently querying timelines for users with high tweeting activity. Instead of exhausting the full limit of statuses, though, those requests should opt for less tweets each. On the other hand, any research design needs to be aware of the trade-off between the effort of getting complete data of Twitter activity (if completeness can ever be assumed in that context), and designing a dataset that is likewise able to infer meaningful insights from.

Related to that is the low number of retweet counts obtained from the accounts of the journalists in the initial sample. With a larger timeframe of analysis, more retweets by those

(24)

accounts could have been observed. This would not only have validated results further, but might possibly have led to data analyzable in more sophisticated statistical analyses. On the other hand, a larger initial sample would have increased the number of followees and respective timelines exponentially.

The classifiers used in this study were not able to correctly identify all accounts by a person or a journalist (or not). While this is an unavoidable issue when applying a machine learning method, the test results still yielded acceptable results. Nonetheless, it can be

expected that there may have been even more journalists among the followees – not only due to false negatives, but also since the classifier required user descriptions to be available. For the sake of more exact results, it might be necessary to identify the accounts (by journalists, at least) with the help of registries listing journalists, as used by Hanusch and Bruns (2017) or Molyneux and Mourão (2019).

Summary & Outlook

This study set out to understand how homophily among journalists on Twitter could affect the diversity of sources that they encounter, as well as how they interact with those. Results reveal the presence of homophily in journalists’ following behavior on Twitter. Cases of comparatively high homophily can lead to less diversity in regard to the sources that may be encountered by journalists. Furthermore, as far as journalists are more likely to share content by other journalists, this may be considered an additional constraint on the broadness of the content (and the sources) shared in respective networks. While findings could provide only a limited quantification of the impact of sources to be encountered on Twitter by journalists, future research may further investigate the ‘early’ phase of sourcing activities of journalists. On the part of journalists (and of any Twitter user, for that matter) and for the sake of diversity of sources to encounter on Twitter, one remark remains: Homogenous collections of followees cannot be advised.

(25)

References

Barnard, S. R. (2016). “Tweet or be sacked”: Twitter and the new elements of journalistic practice. Journalism, 17(2), 190–207. https://doi.org/10.1177/1464884914553079 Bouvier, G. (2019). How Journalists Source Trending Social Media Feeds: A critical discourse

perspective on Twitter. Journalism Studies, 20(2), 212–231. https://doi.org/10.1080/1461670X.2017.1365618

Cision. (2017). 2017 Global Social Journalism Study. Retrieved from https://www.cision.com/ content/dam/cision/Resources/white-papers/SJS_Interactive_Final2.pdf

Deprez, A., & Van Leuven, S. (2018). About Pseudo Quarrels and Trustworthiness. Journalism Studies, 19(9), 1257–1274. https://doi.org/10.1080/1461670X.2016.1266910

Fletcher, R., & Nielsen, R. K. (2018). Are people incidentally exposed to news on social media? A comparative analysis. New Media and Society, 20(7), 2450–2468.

https://doi.org/10.1177/1461444817724170

Follow, search, and get users. (n.d.). Retrieved from https://developer.twitter.com/en/docs/ accounts-and-users/follow-search-get-users/api-reference/get-friends-ids

Get Tweet timelines. (n.d.). Retrieved from https://developer.twitter.com/en/docs/tweets/ timelines/overview

Gil De Zúñiga, H., Diehl, T., & Ardèvol-Abreu, A. (2018). When Citizens and Journalists Interact on Twitter. Journalism Studies, 19(2), 227–246.

https://doi.org/10.1080/1461670X.2016.1178593

Gil de Zúñiga, H., Weeks, B., & Ardèvol-Abreu, A. (2017). Effects of the News-Finds-Me Perception in Communication: Social Media Use Implications for News Seeking and Learning About Politics. Journal of Computer-Mediated Communication, 22(3), 105– 123. https://doi.org/10.1111/jcc4.12185

(26)

pages: A comparative analysis in four Westminster democracies. New Media and Society, 20(4), 1488–1505. https://doi.org/10.1177/1461444817698479

Hanusch, F., & Bruns, A. (2017). Journalistic Branding on Twitter: A representative study of Australian journalists’ profile descriptions. Digital Journalism, 5(1), 26–43.

https://doi.org/10.1080/21670811.2016.1152161

Hanusch, F., & Nölleke, D. (2019). Journalistic Homophily on Social Media. Digital Journalism, 7(1), 22–44. https://doi.org/10.1080/21670811.2018.1436977

Hermida, A. (2016). Social Media and the News. In T. Witschge, C. W. Anderson, D. Domingo, & A. Hermida (Eds.), The SAGE Handbook of Digital Journalism (pp. 81–94). London: SAGE Publications Ltd.

How to use Twitter lists. (n.d.). Retrieved from https://help.twitter.com/en/using-twitter/twitter-lists

Johnson, M., Paulussen, S., & Van Aelst, P. (2018). Much Ado About Nothing? Digital Journalism, 6(7), 869–888. https://doi.org/10.1080/21670811.2018.1490657 Klinger, U., & Svensson, J. (2015). The emergence of network media logic in political

communication: A theoretical approach. New Media and Society, 17(8), 1241–1257. https://doi.org/10.1177/1461444814522952

Lazarsfeld, P. F., Berelson, B., & Gaudet, H. (1948). The people’s choice: how the voter makes up his mind in a presidential campaign. New York: Columbia University Press.

Lui, M., & Baldwin, T. (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea.

Lukito, J., Suk, J., Zhang, Y., Doroshenko, L., Kim, S. J., Su, M. H., … Wells, C. (2020). The Wolves in Sheep’s Clothing: How Russia’s Internet Research Agency Tweets Appeared in U.S. News as Vox Populi. International Journal of Press/Politics, 25(2), 196–216.

(27)

https://doi.org/10.1177/1940161219895215

McDonald, J. H. (2014). Handbook of Biological Statistics (3rd ed.). Baltimore, Maryland: Sparky House Publishing.

McGregor, S. C. (2019). Social media as public opinion: How journalists use social media to represent public opinion. Journalism, 20(8), 1070–1086.

https://doi.org/10.1177/1464884919845458

McGregor, S. C., & Molyneux, L. (2018). Twitter’s influence on news judgment: An experiment among journalists. Journalism, 21(5), 597–613.

https://doi.org/10.1177/1464884918802975

McPherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology, 27(1), 415–444.

Meraz, S., & Papacharissi, Z. (2013). Networked Gatekeeping and Networked Framing on #Egypt. International Journal of Press/Politics, 18(2), 1–29.

https://doi.org/10.1177/1940161212474472

Meraz, S., & Papacharissi, Z. (2016). Networked Framing and Gatekeeping. In T. Witschge, C. W. Anderson, D. Domingo, & A. Hermida (Eds.), The SAGE Handbook of Digital Journalism (pp. 95–112). London: SAGE Publications Ltd.

Molyneux, L. (2015). What journalists retweet: Opinion, humor, and brand development on Twitter. Journalism, 16(7), 920–935. https://doi.org/10.1177/1464884914550135 Molyneux, L., & Mourão, R. R. (2019). Political Journalists’ Normalization of Twitter.

Journalism Studies, 20(2), 248–266. https://doi.org/10.1080/1461670X.2017.1370978 Nwala, A.C. (2018a). An exploration of URL diversity measures [Blog post]. Retrieved from

https://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html Nwala, A.C. (2018b). url-diversity. [GitHub repository]. Retrieved from https://github.com/

(28)

Nwala, A. C., Weigle, M. C., & Nelson, M. L. (2018). Bootstrapping web archive collections from social media. HT 2018 - Proceedings of the 29th ACM Conference on Hypertext and Social Media, 64–72. https://doi.org/10.1145/3209542.3209560

Paulussen, S., & Harder, R. (2014). Social Media References in Newspapers: Facebook, Twitter and YouTube as sources in newspaper journalism. Journalism Practice, 8(5), 542–551. https://doi.org/10.1080/17512786.2014.894327

Shoemaker, P. J., & Vos, T. P. (2009). Gatekeeping Theory. New York: Routledge.

Thorson, K., & Wells, C. (2016). Curated Flows: A Framework for Mapping Media Exposure in the Digital Age. Communication Theory, 26(3), 309–328.

https://doi.org/10.1111/comt.12087

Trilling, D., Tolochko, P., & Burscher, B. (2017). From Newsworthiness to Shareworthiness: How to Predict News Sharing Based on Article Characteristics. Journalism and Mass Communication Quarterly, 94(1), 38–60. https://doi.org/10.1177/1077699016654682 Tweet objects. (n.d.). Retrieved from

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

Van Der Meer, T. G. L. A., Verhoeven, P., Beentjes, J. W. J., & Vliegenthart, R. (2017). Disrupting gatekeeping practices: Journalists’ source selection in times of crisis. Journalism, 18(9), 1107–1124. https://doi.org/10.1177/1464884916648095 Von Nordheim, G., Boczek, K., & Koppers, L. (2018). Sourcing the Sources. Digital

Journalism, 6(7), 807–828. https://doi.org/10.1080/21670811.2018.1490658

Wihbey, J., Joseph, K., & Lazer, D. (2019). The social silos of journalism? Twitter, news media and partisan segregation. New Media & Society, 21(4), 815–835.