Deanonymisation Attacks: The Illusions of Anonymous Data and User Traffic

(1)

Deanonymisation Attacks: The Illusions of

Anonymous Data and User Traffic

Lars Knitel 10019006

Supervisor: Robin Boast

Second reader: Lonneke van der Velden

MA Thesis: New Media and Digital Culture

University of Amsterdam

July 16, 2017

Abstract

Deanonymisation attacks on anonymous data sets and user traffic are of tremendous concern when using any electronic device or application that is or can be connected to the internet. Studies demonstrate the possibil-ity of undoing anonymisation securpossibil-ity by comparing anonymised data to even a small amount of auxiliary data. Ubiquitously monitored data such as locations, social network interrelations and web browsing histories are all useful in cross-referencing data sets and can have a significant impact on the deanonymisability of anonymised user data. In addition, mea-sures that people take to prevent eavesdropping on their devices such as using anonymous internet connections through a virtual private network (VPN), Tor or other tracker-blocking software are not sufficient in main-taining online anonymity, because tracking strategies are becoming more pervasive and intrusive. Browser “fingerprinting” methods can uniquely identify persons by misusing the hardware and protocols of their devices. Moreover, new fingerprinting techniques have emerged that can capture human characteristics such as how people type (keystroke dynamics) or use their mouse. Browsing the web using a Tor network is one of the most secure ways for users to remain anonymous online, but the network is highly targeted by US intelligence agencies such as the National Security Agency (NSA) and the Federal Bureau of Investigation (FBI) who have achieved successes in the deanonymisation of Tor users and services by abusing the protocols the network relies on.

Keywords. anonymity, deanonymisation, ubiquitous data monitoring, cross-data set analysis, behavioural tracking, biometrics

(2)

1 Introduction

The advertising ecosystem of the contemporary web, wherein people are con-tinuously and thoroughly tracked everywhere and at any time in exchange for (free) services, contributes to an information database of unprecedented size (Schneier 1; Greenfield 48; Granick 17). Technological development has there-fore provided a growing number of pervasive techniques with which to monitor and collect data hidden in even the smallest corners to create a complete profile of people’s online behaviour. Due to the global surveillance disclosures of NSA whistleblower Edward Snowden, the extent to which governments can track cit-izens has become apparent (Granick 55-6). The foundation of the NSA tracking strategies is revealed in an NSA presentation: “collect it all,” “know it all,” and “exploit it all”; a conceptual framework which acutely demonstrates the cur-rent state of governmental control: “[m]odern surveillance is mass surveillance” (Snowden; Greenwald; Granick 145).

Due to the abovementioned developments, people are becoming more aware of the vulnerability of their online presence, reinforcing their desire for anonymity (Madden). However, absolute anonymity is unattainable. Companies or govern-ments only share anonymised (limited) data sets and people might use strategies to remain anonymous on the internet, but deanonymisation techniques are re-sponding to the general public’s ignorance of those methods. Several researchers have illustrated, by cross-referencing anonymous data with other data sources, they could re-identify the anonymous data source (Su et al. 1; Englehardt and Narayanan 1; Sweeney; Ohm 1723). This becomes easier, as the number of pos-sible identifiers increases significantly if a search is conducted within the colossal web of ubiquitous data that has been tracked and stored by search engines, on-line social networks and various smart devices(Greenfield 48). Wherever people travel, their devices are continuously monitored and data is created and stored. Re-identifying (anonymous) user traffic is possible through ever-developing online tracking techniques (Nikiforakis 541; Acar et al. 1129; Cao et al. 1). Tra-ditional techniques such as cookies and browser fingerprinting continue to evolve into more advanced versions that are even able to follow users on anonymised in-ternet connections via a virtual private network (VPN) or Tor (Watson; Narayanan and Englehardt 12; Shieh and Uberti; Schleizer). Fingerprint methods are changing into deanonymisation attacks on the underlying hardware of users’ devices, wherein trackers can uniquely profile characteristics of the device’s bat-tery or CPU usage or even create a biometric profile regarding the manner in which users type or move their mouse (Schleizer; Narayanan and Englehardt 11-3; Zhong et al. 117).

This thesis examines the current landscape of deanonymisation techniques intended to reverse the anonymity of limited data sets or re-identify users in anonymised internet traffic. The dissertation is structured as follows. Chap-ter two first describes anonymised data sets and inChap-ternet traffic, followed by a brief outline of the deanonymisation techniques which are explained more exten-sively in the case studies. Furthermore, the ubiquity of both (unobtrusive) data tracking and mining practices and their functions in deanonymisation practices

(4)

are discussed. The behavioural tracking mechanisms that companies apply to follow users are subsequently listed and analysed. The chapter ends by ask-ing why these data trackask-ing and deanonymisation techniques matter. Chapter three presents the methodology of black box theories required to investigate and indicate the insufficiencies of online anonymisation methods. Chapter four offers a case study in which cross-data sets analysis, using different studies in the deanonymisation field, compares data sets to discover similarities and re-identify users in anonymised data sets. This chapter ends by discussing cross-device tracking, which is a method that aggregates all the collected data sets from the different devices belonging to one user. Chapter five analysis and maps deanonymisation methods in which users themselves employ strategies to prevent behavioural tracking techniques such as those identifying users through the misuse of a device’s underlying hardware to extract technical characteristics or reveal how devices are used (biometrics). Finally, using several cases, this thesis investigates deanonymisation attacks that have been used to break the anonymity of the Tor network.

2 Deanonymisation and re-identification

In order to describe the possible methods of deanonymisation and re-identification we should first paint a picture of the current landscape of tools that inter-net users may use to, albeit partly, anonymise their traffic and usage on the network. These tools encapsulate strategies and services such as ad blocking, end-to-end encryption and VPN or Tor network usage. Moving on, I can then describe the concepts of deanonymisation and re-identification that could undo the anonymity of these online data and internet traffic strategies. Subsequently I discuss processes like data mining and - tracking and the role of ubiquitous data in the trivial, and therefore rapid, expansion of these. Concluding this chapter, I describe the relevancy of exposing and understanding deanonymisation and re-identification methods.

2.1 Data mining, anonymous data and online anonymity

To know what deanonymisation attacks are, the concepts of ‘data mining’, ‘anonymous data’ and ‘online anonymity’ must first be explained. Data mining is the practice of collecting data into large online databases and analysing those data sets to acquire new knowledge from them (Witten et al. 4). By recognizing patterns in ‘big’ data you could, the argument goes, describe or predict the be-haviour of individuals or groups, receive insights and provide possible solutions to various problems (Witten et al. 4). The process of data mining is automated, which means that it could search which (meta)data belongs to which individual data subject in a digital database and store that data (Witten et al. 4). These digital data subjects could eventually be connected to certain persons (Solove 13). For example, data mining can be useful for a web company to analyse user choices. By finding patterns in people’s behaviour such as visited web pages,

(5)

one could find out what products or services the visitors are interested in. When the data demonstrates people are interested in a specific topic, the company is able to anticipate and provide more products or content in that range, which eventually could create more revenue or brand awareness for that company.

Data mining could eventually lead to data sets with all kinds of personal in-formation that could lead back to real individuals. The next paragraph explains that these data sets can be modified in order to provide a ‘sense’ of anonymity One of the assumptions made by companies and governments is that online data could be anonymous if the data is stripped of any personal identifiable information (PII) (Raghunathan 4-5). This PII could consist of names, (IP) addresses or other identifying data (3). Other anonymisation algorithms could also be used in order to hide the data subject’s source and make it legal to share data sets without the permission of the connected users (El Emam and Arbuckle 3).8 These anonymisation techniques, including k-anonymity, are discussed in chapter 4, “Cross-data set analysis”.

Globally, an increasing number of these open data sets are becoming pub-licly available, most being made public by governments to stimulate open data initiatives. This allows companies, groups or individual volunteers to analyse the data sets by data mining them (Ohm 1708). For example, the Open Govern-ment Directive is an initiative in the United States that pushes federal agencies to publish their data to fulfil the Freedom of Information Act (FOIA) (digital-gov.gov). There is a page with limited data sets from the Department of Health and Human Services (https://www.hhs.gov/open/), with which health care or other academic researchers could gain interesting new knowledge for medical treatments (El Emam and Arbuckle 1-2). It provides significant opportunities to academics as it would be either difficult or impossible to conduct research without such an abundant source of data. In addition, websites can sell their anonymised transaction data to advertising companies and phone firms provide their data sets to law enforcement agencies (Ohm 1708). Large organisations where security and privacy standards are important could anonymise data sets before sharing them within their organisation. For instance, in order to pro-tect the privacy of their customers’ transactions, bank departments need to anonymise their data before they can share them with marketing departments (Ohm 1708).

2.2 Anonymous Internet Traffic

As explained in the previous section, anonymised data sets facilitate researchers in their studies, involve companies and individuals in research, or allow large companies to share sensitive information internally or externally. This while assuming that personal information is safeguarded by traditional anonymisation standards. Another aspect of online anonymity is where users themselves use strategies to hide their internet traffic in order to create a feeling of anonymity. Several means of doing so are employed such as using VPNs, Tor, HTTPS, end-to-end encryption and ad blockers. The internet provides network connections that could be vulnerable to eavesdropping from whatever adversary: company,

(6)

government or criminals. With rising user awareness of the fact that the above parties might track their online behaviour for different purposes, many users attempt to find methods that will keep their traffic as anonymous as possible. 2.2.1 VPN and Tor

In order to create a secure internet connection, one can use networks that hide their IP address. In this section, VPN and Tor will be discussed that can provide such specific security and I will discuss the controversies that come into play.

In March 2017, the U.S. congress voted for a new privacy bill that gives an internet service provider (ISP) the right to sell their users’ data to the highest bidder (Brodkin). This was previously prevented by a bill in the Obama period (Brodkin). A VPN makes sure that the ISP can not see the websites you visit. It uses a tunnel protocol which provides a security layer when sending data over the internet connection (Ferguson and Hudson 9). The moment when the network detects an unsafe connection, it stops and the data is sent back to the server. Subsequently it tries to find another safe route to deliver the data to the designated destination. Many websites, such as online streaming service Netflix, have banned VPN network connections from their websites though. The company tries to ban internet traffic from other countries, because most of the website’s content is not legally available in other countries where the VPN network is routed through. Users can not be sure what VPNs will do with their data, but VPNs will most probably respect their online data better than most ISPs do (Robertson; Finley, VPN ). Also, regarding to the Snowden files, the NSA flags users that indicate they want to use a VPN (Angwin and Tigas; Zetter, NSA). To do this, the NSA uses a computer program called ’Toygrippe’. Another tool, ’XKeyscore’, is able to fingerprint both VPN- and Tor users. Both computer programs detect when users are visiting information pages or blogs about the secured networks (Zetter, NSA).

The Tor network is another service that could hide an IP address, and thus secure internet traffic. The network, maintained by The Tor Project, is built to encrypt and anonymise internet traffic (torproject.org). Instead of using a single provider like a VPN does, it makes use of a distributed network whereby the traffic is sent through different servers, called ‘relays’ or ‘nodes’, across the globe (Dingledine et al. 1). Data packages are always encrypted between those different relays in the network. After a series of nodes, it arrives at an exit node, which is able to decrypt the packages and show the final message to the intended receiver. Tor is the abbreviation for ‘The Onion Router’. The onion symbolises the structure of the service, because the network consists of different encrypted layers akin to an onion. Every node obtains a key to decrypt a layer of the onion and pass it through the circuit. That is also the reason why the routing process of the digital packages is called ‘onion routing’ (1). The IP address could be any address in the world, which provides a secure internet connection and increases anonymity. In fact, the traffic is biased by noise: all the different IP addresses it goes through (1).

(7)

(NRL) in order to provide online anonymity for people working for the govern-ment, dissidents of governments, whistle-blowers and journalists (Levine). Tor could offer them a secure connection to the internet to communicate messages that could bring them harm if intercepted by an adversary. It was necessary for the NRL to make the project an open project, because only this method would provide a high enough amount of relay points to the network to keep the internet traffic anonymous. If the NRL accommodated the relay points alone, it would be problematic to ensure anonymous communication because all of its servers would function as the nodes in the network, thus almost all of the traffic would be from it. The network is now maintained by The Tor Project, but is still funded by government institutions, for example by the Department of State Bureau of Democracy, Human Rights, and Labor1_.

The reason why The Tor Project became controversial was because the net-work offers anonymity to obscure hidden services as well. The Silk Road was one of those illicit services where internet users could buy drugs, guns or even hire assassins (Palmer; Van Hout and Bingham). The original Silk Road was closed by the FBI in 2013, but after the shutdown alternative websites took over and Silk Road 2.0/3.0 were also established (Greenberg). Moreover, one of the main reasons that the Tor network is being attacked by intelligence agencies is because child pornographers have free rein on the network.

The Tor network is a free alternative to most VPNs, but internet connec-tions on the Tor network are slower than VPNs because it must route traffic through all kinds of networks/nodes (Brodkin). The ‘Tor Browser’ is the most consumer-friendly set up for users that want to be anonymous on the web. This is a Firefox browser that is pre-installed with Tor and has preference settings that will provide a high amount of online anonymity. Those settings could decrease the comfort of web browsing, because the Tor Browser does not sup-port flash and, when used properly, JavaScript2 _{is also disabled. JavaScript is}

a programming language that enables websites to work or interact with web-site visitors. The browsing experience is considerably worse because webweb-sites currently rely largely on JavaScript code (Finley).

2.2.2 Ad Blockers

A user can also choose to take less drastic measures such as enabling an ad blocker to prevent eavesdropping while browsing the web. Most of the time, websites include tools on their pages that tries to improve the experience of vis-itors by including engaging functionalities (social sharing, commenting) or offer advertising space to third parties to generate revenue (Gerlitz and Helmond 1352; Schneier 43). At the same time, these tools also function as trackers and leak behavioural data to other companies, not only the website owner it-self (Falahrastegar et al. 104). An ad blocker can be standalone software or a browser extension that prevent websites or tracking companies showing its advertisements, or to track visitors with tracking technologies that collect data

1_{https://www.torproject.org/about/sponsors.html.en} 2_{In the Tor browser, JavaScript is enabled by default.}

(8)

about their behaviour and send (or leak) it to other companies. A browser ex-tension such as uBlock3 is able to block trackers by disabling codes that could collect user data on a page. The application also provides its users a list includ-ing the third-party trackers that are tryinclud-ing to gather data from them and offer them to add websites to a whitelist to support those

2.2.3 End-to-End Encryption (E2EE) and HTTPS

Anonymity for users can also be provided by web companies themselves through technologies such as end-to-end encryption and HTTPS. End-to-end encryption is not easy to crack (Jesdanun and Liedtke). A Wikileaks article4 from March 2017 reveals that it poses significant difficulties for intelligence agencies to de-crypt end-to-end ende-crypted (E2EE) messaging systems (Newman). Facebook Messenger, iMessage and WhatsApp use this encryption method to protect mes-sages sent with their services. In 2013, when Edward Snowden revealed the US government’s cyber espionage programs, the tech companies started working on these encrypted services in order to provide security and privacy for their users, preventing eavesdroppers from intercepting messages.

How does this work? Only the people who are sending messages to each other can read the messages. The messages are protected with a lock, an encryption key, and only the sender and the receiver have access to the key. Internet providers, telecom providers, or even the company owning the message appli-cation itself, are all unable to access the message. There are other encryption systems available, but those require trusting the third parties that process the messages because they could access the content. This makes E2EE much safer than regular encryption methods (EFF).

In 2015, the U.S. government put Apple under pressure to help them with collecting data from the iPhone of one of the San Bernardino5attackers (Jesda-nun and Liedtke; Kerr). Apple refused the request, which prompted the FBI to hire hackers to breach the iPhone. However, in order to decrypt the encrypted messages, they had to undertake targeted attacks. Those attacks are costly and are comparable to classic methods of spying. They need to intercept the mes-sage before it is being encrypted and send away to the other device (Jesdanun and Liedtke). It is similar to wiring a place with microphones at the location where the suspect is around (Jesdanun and Liedtke).

Today, half of all websites on the World Wide Web are equipped with se-cured HTTPS connections to protect users’ internet traffic (Gebhart). This internet communication protocol is the (most probably) familiar HTTP, Hy-pertext Transfer Protocol, but with a Transport Layer Security (TLS) security layer on top of it (Callegati et al. 78). The URLs are connected to an encrypted connection with a website’s server, which provides authentication between the

3_{https://github.com/gorhill/uBlock} 4_{https://wikileaks.org/#EXAMPLES}

5_In _2015, _there _was _a _mass _shooting _in _San _Bernadino:

http://www.npr.org/sections/alltechconsidered/2016/12/03/504130977/a-year-after-san-bernardino-and-apple-fbi-where-are-we-on-encryption

(9)

client and the server of the website being visited, rendering it difficult for eaves-droppers to track the data (Naylor, David, et al. 133). At first, HTTPS was only used to secure online bank transactions, e-mails and other sensitive data exchanges. However, in the past decade it has also been used to protect and se-cure internet connections on other types of websites to help them keep accounts safe and protect the identities and browsing data of users.

HTTPS also ensures that ISPs get less information about the web behavior of people. A HTTPS connection makes sure the ISP can only see the server a user is connected to and not the actual webpages he or she visits. For example, when visiting an article from The Guardian, the ISP could only see the server’s name. TheGuardian.com and not the deeper subpage of the article.

2.3 Deanonymisation attacks

As explained in the previous paragraphs, there are different strategies to anonymise online data or provide anonymous web traffic for internet users (through VPNs or Tor). This provides users with the feeling of online anonymity. The problem arises when these strategies seem not to be working anymore, or at all. Dif-ferent researchers and security experts have illustrated problems in the actual anonymity of the strategies of both anonymised data sets and internet traf-fic. Ostensibly anonymous data and online traffic on VPN or Tor networks can still be deanonymised by various techniques. This could result in a false sense of online anonymity. This section briefly discusses two different types of deanonymisation techniques that are further considered in the case studies of this thesis.

Anonymised data can be deanonymised by cross-referencing multiple data sets to reveal PII. To put it simply, anonymised data sets are connected with ex-ternal information to attempt to uncover the identity of the data subjects. For example, if there is a data set of anonymised location check-ins from Foursquare6 and personal online reviews of these locations, it is possible to cross-reference timestamps and reveal the persons behind those check-ins. Data sets exist in ‘rows’, which represent the data items, and ‘columns’ with the data values. With overlapping (meta)data in the datavalues, one could compare and discover the missing pieces in the anonymised data set. This deanonymisation technique has been significantly researched by computer scientist Arvind Narayanan of Princeton University, (which is discussed further in case study 1). Narayanan teamed up with Vitaly Shmatikov to write a famous and influential research pa-per about a deanonymisation attack on publicly available data sets from Netflix and IMDb (1-2). (This paper is discussed in case study 1).

All the data that could help distinguish a data set from another data source can be used for deanonymisation. It would be ideal to gather as many data sets as can be found that could help to uncover correlations and overlapping informa-tion in the metadata of the data sets. The more meta data collected, the more anonymous data could potentially be deanonymised. Reversing the

(10)

tion of data is mostly referred to as deanonymisation or re-identification (Ohm 1703). However, this research also considers deanonymisation through advanced online fingerprinting techniques that are adopting forms of ‘biometrics’ (which is discussed in case study 2: Deanonymising User Traffic) and attacks on traf-fic at anonymous internet connections such as VPN, Tor, HTTPS and E2EE (Watson; Englehardt and Narayanan 1; White; Shieh and Uberti; Noubir and Sanatinia 2; Kwon, et al. 287). There has been considerable research con-cerning how browsers, networks or applications that provide anonymity can be circumvented or attacked. Internet traffic on hidden services, like VPN and Tor, are still deanonymisable. Researchers and intelligence agencies have suc-cessfully cracked the anonymity of the Tor Network with several attacks and abusing the protocols that the network relies on. These deanonymisation and re-identification techniques are further explored in the case study.

2.4 Ubiquitous data

Much of these deanonymisation attacks depend on the ubiquity of data, because ubiquity is an important factor in deanonymisation and re-identification attacks. In this era of pervasive technology, digital traces are ever present and constitute data that eventually could be used to deanonymise data. Almost all the devices people wear, locations they visit and encounters they make in the world could lead to more exposure of personal data. Those data points are all nodes in data networks that might be linked to each other, rendering the data they carry vulnerable to possible deanonymisation attacks.

Ubiquity is about pervasive coverage; for example, the way Wi-Fi networks work. Wi-Fi devices are always listening to check whether the devices detected have authorised access to their networks, asking: have the correct credentials been provided to log into the network? This causes constant trivial (back-ground) monitoring of devices’ locations. Every action individuals take is, in a sense, being trivially monitored because it must be tracked and stored, lead-ing to a manner of connectivity in which everythlead-ing is continuously examined (Greenfield 48). While such actions are usually monitored inconsequentially, that constant trivial (low-level) observation can become non-trivial and evolve into an intrusion easily such as when the ubiquitous data collected is used to track online behaviour.

Monitoring is required for weather instruments or traffic monitoring, so these uses are non-trivial. Whereas non-trivial uses of monitoring exist such as with weather instruments or traffic monitoring, ubiquity uses monitoring to make all the connections, forcing the entire system to depend on almost completely pervasive and continuous observation. Even when an application or network are not interested in a particular person, it still has to check for authorised connec-tions and this results in most web traffic including checking in with devices to provide useful information to users. As an illustration, Google Maps can offer its users advice on how to drive from A to B and how much time this would take by constantly checking the location (GPS) and time with users’ connected Google accounts or data acquired by third-party apps (Schneier 8).

(11)

In essence, those are innocent questions like, ‘How do I move from A to B?’ Be that as it may, might that information be employed for something other than transport advice? That innocent question will become a very non-innocent one when a government wants to know exactly where a user is going and what he or she is doing. The underlying data system does not make those decisions, but it becomes a medium in which those questions can be asked very easily. At this moment, this innocent or trivial network of constant connectivity becomes a sinister data mining medium through which the behaviour of individuals can be tracked and unique user profiles created. “They may believe they are doing this anonymously, or pseudonymously, but often they are incorrect. There is huge commercial interest in making money by mining the Web” (Witten, Frank and Hall 22). This matter is examined in the next section about ‘behavioural tracking’, and the possible online tracking mechanisms that can collect data for user profiles are explained.

2.5 Behavioural tracking

In the previous section, we discussed the different deanonymisation techniques that are facilitated by the ubiquitousness of data. In this section, several tech-niques used to collect and create user data profiles will be discussed. Tracking technologies such as cookies, evercookies, cookie syncing and fingerprinting that we know are being used by advertising companies. Mining users’ data can be exceedingly valuable to the online advertising business because such data can provide insights into users’ behaviour and predict possible customers. “Dataveil-lance is the systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons” (Clarke). All that data can be collected and stored in the online data profiles of users and eventually be analysed and used in order to provide targeted advertising (Clarke). It is then part of the so-called ‘data-double’; a digital profile of some-one’s identity (Poster). This branch of advertising is called online behavioural advertising (OBA) wherein companies use the information from data profiles to display tailored adverts to individual users (Chester 55).

For example, if data collected about a person reveals that he or she is con-stantly visiting web pages about ‘Barcelona’, or using search queries which in-clude “spanish courses online” and “hotel barcelona”, these data suggest that there is a quite plausible chance this person is interested in going to Barcelona, Spain. This information can be valuable for advertisers to sell advertisement spaces to travel agencies offering flights to Barcelona. Then, if that same per-son visits a website and is exposed to an advertisement about cheap Barcelona tickets, the person could be tempted to buy tickets via the travel agency.

The above represents a brief explanation regarding how OBA works. The upcoming paragraphs describe several behavioural tracking techniques for col-lecting data used in this form of advertising on the web.

(12)

2.5.1 Cookies, Evercookies and Cookie Syncing

Cookies are probably the most well-known tracking mechanisms used by web-sites to collect users’ data while they are browsing the web. Cookies are able, by placing small unique codes (ID) on the user’s computer, to keep track of what users’ preferences are, to remember their login credentials or to recall items a person has put in his or her online shopping cart (Raley 122). When visiting a webpage, the website will check if there is already a cookie available from its visitor; if not they will create a new cookie to identify them. There are different types of cookies. The so-called HTML-cookies are easy to delete by users themselves or be circumvented by some browser extensions. Flash cookies, officially called ‘local shared objects’ (LSO), are placed on the Flash memory of the computer via the Adobe Flash extension and can regenerate a removed HTML cookie. European internet users can no longer avoid HTML and LSOs after a European privacy law7 made it mandatory for websites to inform their visitors about the cookies they use to collect data about them. There is a dif-ference between first-party and third-party cookies. The cookies that are placed by the visited website itself are called first-party cookies. The third-party kind are those that other parties place on partner websites and follow users around on them. Facebook and Google are both major players in the third-party field (Brookman et al. 143) . The case studies present how third-party tracking services function.

In the previous section I discussed traditional tracking mechanisms HTML cookie and LSO’s , which are both easy to delete. As a response to this, tracking companies implemented a new cookie into their tracking mechanisms that is not quickly removed: the evercookie (Acar et al. 676). The cookies are stored in various locations that the website (or client) can access. When a user deletes the cookie from his or her browser, the evercookie can respawn and accessed again using a JavaScript method (Acar et al. 677). Evercookies are also known as supercookies and zombie cookies, which indicates the high difficulty of removal (Berghel 105). In a published NSA presentation, the NSA claims it has used evercookies to track Tor users (Snowden).

First or third party trackers are able to recognize a user’s cookie by a unique ID. This user ID allows web businesses to search for the associated profile in their own user database. With cookie syncing, trackers can exchange unique ID’s in order to target the same user across platforms. This makes tracking much easier for companies (Acar et al. 681). Cookie syncing is mainly used to offer real-time advertising, which allows advertisers to sell advertising space to the highest bidder (Englehardt). “Once two trackers sync cookies, they can exchange user data between their servers”(Englehardt). For users, it is difficult to discover cookie syncing. It remains invisible to them with whom the tracking companies share the cookies, which could affect user anonymity (Acar et al. 674).

(13)

2.5.2 Browser Fingerprinting

‘Browser fingerprinting’ is a technique that goes beyond the concept of website cookies and is almost impossible for users to circumvent. Many internet com-panies that want to track user traffic use this method to collect unique profiles of online behaviour, because it is much more reliable than cookies. Browser fin-gerprinting can identify users by analysing and collecting information about the browser version, installed plugins, operating system and screen resolution used (Omer and Polenetsky 295). This fingerprint profile constitutes a unique data set that could distinguish a certain user from others. The Electronic Frontier Foundation (EFF) made a web application where users can check how unique their profile is8. This makes it possible for companies or data tracking sites to have a precise picture of who is using the web(site) and in what way.

At first, users tried to circumvent this form of browser tracking by using different browsers for different contexts. This method is currently obsolete be-cause researchers have recently demonstrated a new technique of identifying people on multiple browsers; aptly named ‘cross-browser tracking’ (Cao et al. 1). However, those techniques can be even more advanced. Microdata about the way users type on their keyboard or how they control their mouse or trackpad can be used to identify anonymous users and fingerprint them (Norte). These practices are explained further in case study 2.

2.6 Why does this all matter?

“We should treat personal electronic data with the same care and respect as weapons-grade plutonium—it is dangerous, long-lasting and once it has leaked there’s no getting it back.” —Cory Doctorow Critics warn us about adversaries that might have reason to deanonymise data and re-identify users (Ohm 1721). “Narayanan and Shmatikov suggests stalkers, investigators, nosy colleagues, employers, or neighbours. To this list we can add the police, national security analysts, advertisers and anyone else interested in associating individuals with data” (Ohm 1724). If countries with repressive governments are considered, it is eminently possible that citizens could face life-threatening situations when their internet traffic is deanonymised by their government.

Data passes through multiple companies or entities before reaching its desti-nation, making it hard to prove or even detect that it originated from a deanonymised database. There are lots of companies known to sell “anonymised” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical com-panies.” On the other hand, companies carrying out data aggregation/deanonymisation are a lot more secretive about it (Narayanan, The secret ).

Another disturbing problem is that, as Doctorow stated, data can end up anywhere because of the connections with all kinds of third-party trackers.

(14)

Those third-party trackers all have their own terms of agreements. Scott Taylor from Hewlett Packard has mentioned that data is gone after the first contact with their application (Narayanan, The secret). They do not know exactly which data third parties will gather and where that data will finally end up. Narayanan suggests the term ‘data laundering’ for this, which defines the idea well. Thus, people who believe they are anonymous online, are actually not as anonymous as they think. If anonymised data can be deanonymised, then peo-ple could have the feeling they have forever lost control over their personal data and internet traffic. These deanonymisation attacks are a problem this thesis addresses. Deanonymisation attacks can be seen as ‘black boxes’, because most of the people that use the web do not know how deanonymisation works or are not even aware of these practices. In the next chapter, I will discuss how this obscurity of deanonymisation strategies are related to black box theories

3 Methodology

Deanonymisation of online data and internet traffic are concepts that fall within the scope of sociotechnical studies. Sociotechnical research is the study about “the mutual constitution of people and technologies” (Sawyer and Jarrahi 1-2). “Sociotechnical research is premised on the interdependent and inextricably linked relationships among the features of any technological object or system and the social norms, rules of use and participation by a broad range of human stakeholders”(Sawyer and Jarrahi 2). In this thesis, I will use this sociotechnical approach because deanonymisation and re-identification deal with digital tech-nology practices which have influence on individuals and their systems, and vice versa. The previous chapter discussed the anonymity of people that could be affected by the technological practices of deanonymisation and ubiquitous data. Sociotechnical researchers try to understand the relations and interconnections between technology and humans. There is somewhat of a mutual constitution that humans and digital technologies independently of each other have and a power to act in a way on their own. In which they both have agency in its own activities (Sawyer and Jarrahi 2).

In this chapter, I will discuss the concepts of black boxes in sociotechnical systems and how they are relevant to deanonymising data and user traffic. Lit-erature of, among others, Etienne Wenger, Bruno Latour and Susan Leigh Star will be discussed, wherein they wrote about black boxes. Concluding this chap-ter, I describe the relevancy of exposing the black boxes of deanonymisation.

3.1 The black box syndrome

In Etienne Wenger’s paper —‘Toward a Theory of Cultural Transparency: El-ements of a Social Discourse of the Visible and the Invisible’— he describes a problem in the social world wherein documentation and systems are organized through simplifications of processes which limits the understandings of the dif-ferent activities within a system (85). Wenger explains how proceduralization in

(15)

certain activities have become a problem. For example, someone and somewhere in an organization has developed a specific procedure. But to simplify a process for other employees in other departments, one developed a worksheet where the employees deliver input and where the worksheet, with its designed internal sys-tem, provides the output. A considerable amount of working procedures would be simpler and faster if the understanding of a previous procedure could be left out for the next employee (94-5). Wenger argues that this simplification of processes is not by definition a bad development, but the problem “lies in the fact that the process of proceduralization becomes a relation between two sepa-rate communities” (95). The moment these systems are embedded within each other, it becomes even more complex. Wenger calls this condition the ‘black box syndrome’ (Wenger 96; Star 400). “In technological jargon, a black box is a device which performs some useful function, but whose internal mechanisms are not available to inspection. Arguably, the world in which we live is increasingly becoming a set of black boxes” (Wenger 98).

As I discussed in chapter two, tracking systems are omnipresent and the data could end up anywhere. Users and companies lose sight of all the user data they scatter or leak. Data begins life as trivial metadata, but can become non-trivial when they end up in unintended databases which are used to cross-reference with anonymised datasets. Therefore, the concept of the black box syndrome applies to the area of ubiquitous data monitoring and deanonymisation.

3.2 Black boxing

Bruno Latour wrote about ‘black boxing’ in his book —‘Pandora’s Hope: Es-says on the Reality of Science Studies’— (183). He mentions black boxes in a chapter where he writes about technical mediation; how humans are trans-formed by technology (183). When something is not functioning anymore or broken, it could result in a sort of crisis (183). When someone has a crisis in what is coming in or out of a device, when it is not making sense anymore, one want to open this black box. Latour explains this by giving an example of a broken overhead projector (183). First things are working fine and people are not worrying about how the device works. The moment that nothing is projected, you will have to open the black box (pandora’s box) to actually see what has caused the damage. “Black boxing: a process that makes the join production of actors and artifacts entirely opaque” (183). “We no longer focus on an object but see a group of people gathered around an object. A shift has occurred between actant and mediator” (183). With actant, Latour refers to an individual or object that participates in a process. Only insights in the inputs and outputs of a technological device or systems are becoming available and machines become more blurred and more complicated when more technology and science are embedded in their processes (Latour 185; Winner 365).

The processes of the black box syndrome and black boxing are also related to actor-network theory (ANT), because it involves the simplification of processes (Callon 29). An actor-network exists out different individual entities or nodes, but those nodes are all complex networks in itself as well. Complicated processes

(16)

are made simple to just discrete nodes and not looking at the systems of all those individual nodes (Callon 29-30). There is a network in the individual nodes itself, which makes almost infinite possibilities available. ANT is also similar to Latour’s broken black box. When the simplified black box is not completely functioning in the bigger system, the actor-network, it has to be opened which delivers more new actors.

3.3 Opening the black box

Susan Leigh Star was a well known researcher in sociotechnical studies. She has done research into organizations and systems that use artificial intelligence (AI) and human-computer-interaction (HCI) and what kind of influence these technological fields have on social systems; trying to understand the relations between humans and technology (Star 395). In Susan Leigh Star’s paper —‘The Trojan Door: Organizations, Work, and the Open Black Box’— , she discusses a new way of research wherein social systems are studied in relation to technologies (395). “[A] new view of systems for social actors, information, knowledge, and technology” (395). In the paper she mentions the technique of opening the black box. With black boxes in technology, they mean systems, objects or devices in which someone knows the input that goes in the black box and the output that comes out of it. For outsiders it is unknown what happens inside that box (396). Star refers to a book written by Latour and Woolgar: ‘Laboratory’, wherein they discuss the ‘deletion of modalities’. Deletion of modalities happens when there is something scientifically proven and the things that qualify, modify or say something else about that, are stripped away from this fact (Latour and Woolgar 79-80). “These are made partially visible under the ethnographic gaze, although of course they will never be restored, nor is that gaze itself exempt from the simplifying process” (Star 399). Technologies today are becoming more interwoven with human lives and the economy. Computer technologies take a lot of work away from humans. For example, very complex technologies are now involved when people want to do home banking, “which requires translation across several electronic media, software packages, and accounting convention” (Star 400).

While Star’s paper dates from 1992, her concepts are still relevant to the contemporary ubiquitous data monitoring environments. As devices people use become smarter and their internal systems more sophisticated, this results in more black boxes. The tasks outsourced by users and the actions performed by the devices become more complex and autonomous. This creates an environment wherein black boxes are becoming invisible for people. As for anonymous data, the black boxes exist, but people are not noticing the presence of these black boxes of deanonymisation methods.

3.4 How to open the black boxes of deanonymisation?

In this chapter, I discussed black box theories coming from various academics. How technologies keep evolving and becoming more complex since they

(17)

con-stantly are embedded in each other. Systems are created with technical layers on top of other layers. People lose sight of how devices with their integrated technologies and installed applications are actually working, and how those same devices create a large network of ubiquitous data that eventually could be of use in deanonymistion attacks.

In the case studies, I am not trying to break the black box open, just like in Star or like Latour’s work, who both say: when there is a crisis in what is coming in or out of the black box, and is not making sense anymore, the black box has to be opened. In this thesis I am not doing either of those techniques. Not opening the box, but I am more trying to create a view from the side of the black box, describing the practices of deanonymisation. I will examine what goes in and what comes out from a different place which is more unnoticed by most internet users and reveal the black box of ubiquitous data which includes our online identity. I will use a somewhat hacker approach, where I want to find a way through the practices into the black box of deanonymised data and anonymised traffic. A hacker approach involves checking what works or what can be improved (Bratus 73). By specifying what is wrong with anonymisation techniques, it could provide insights in how to improve such strategies.

Anonymised data are black boxes in a sense they should provide anonymous data, but in practice they are not anonymous since by using cross-references we can deanonymise these sets. Thus, they are inherently broken. Where are the weakest links of these data-sets? And how does ubiquitous data make this process of ensuring anonymisation even more complex? The premise of black boxes like VPN, Tor, tracking blockers which promise a ‘sense’ of anonymity are not working as intended, they are, too, inherently ‘broken’. What is broken and how is it broken? In the upcoming case studies I will uncover several inner workings of these black boxes to pinpoint where the ‘broken’ pieces are.

4 Case Study 1: Cross-Data Set Analysis

A data anonymisation process can proceed as follows. In a method called k-anonymity, data attributes such as names are either removed (suppression) from the data set, or data subjects are generalised into groups (generalisation) (Ohm 1707, 1714). For example, when a person is 25 years old, they could be gener-alised into a ‘20-29 years old’ category (Samarati and Sweeney 3). The ‘k’ in k-anonymity represents the attributes in the data set that the data subjects are referring to. As an illustration, in 3-anonymity there are at least three rows in the data set that have the same value, which makes it harder to deanonymise data (9-10). However, as is discussed in this case study, it still is possible to deanonymise through the help of so-called ‘quasi-identifiers’ (Dalenius 329). These data are not identifiers on their own, but when the data is combined with other quasi-identifiers, the data could eventually produce a unique identifier.

Computer Scientist, Arvind Narayanan is a pioneer in deanonymisation stud-ies. He conducted a significant amount of research into online privacy, security

(18)

and anonymity. He also maintains a blog9 about the subject in which he writes his thoughts about such matters as: “The end of anonymous data and what to do about it”. In one of the earlier research papers Narayanan undertook with professor Vitaly Shmatikov, they present a definition of what they think deanonymisation is. “Deanonymisation is a strategy in data mining in which anonymous data is cross-referenced with other sources of data to re-identify the anonymous data source” (1). One data set is compared with another and checked regarding whether there are overlapping pieces in the microdata data set. In research papers, these kinds of additional data are often referred to as ‘auxiliary information’ (Dwork 1-2).

Deanonymisation practices are first explored through analysing several fa-mous cross-data sets. Data sets including personal health records, online search queries or movie preferences are used to cross-reference with other, auxiliary, data sets. The large amount of first and third parties that collect users’ location data via mobile applications and the vulnerabilities that those mobility traces have in relation to deanonymisation attacks are subsequently discussed. After-wards, there is an examination of deanonymisation that uses social network data as auxiliary data. The next section considers cross-device tracking, describing methods concerning how companies track the multiple devices of one single user. Finally, based on the studies discussed, the reliability of anonymisation methods will be discussed

4.1 Personal health records, search queries and movie

re-views

In 2000, one early deanonymisation demonstration was undertaken by Latanya Sweeney; a computer scientist who re-identified persons in an anonymised data set which included data regarding all hospital visits made by state employees (2-3). The Massachusetts Group Insurance (GIC) who released the data set had removed personal information such as names, addresses and Social Security numbers to make them publicly available to researchers and were assuming that the data could not lead back to the individuals in the original data. This was also confirmed by a statement of Cambridge MA’s former governor, William Weld (Greely 352). Unfortunately, both GIC and the governor were wrong. Sweeney ordered a data set with all the electoral rolls from Cambridge, MA, which included names, ZIP codes, birthdates and sexes of the voters of that state. She used this set to cross-reference with the anonymised data set and found Governor Weld’s medical information with ease because six people were born on the same day, three of which were men and just one man lived in his ZIP code, who was obviously the governor of Cambridge, MA. Governor Weld was sent his own personal health records from Sweeney to demonstrate that he and the GIC were wrong about the intangibility of the anonymised data set (Greely 352). Thus, Sweeney proved it was still possible to deanonymise anonymised health records with limited data sets which included peoples’ ZIP codes, sexes

(19)

and birth dates.

In 2006, the web company America Online10 (AOL) released an enormous data set which included the search query data of all their users from over a time span of 3 months. They stated it was part of an AOL research program and the information would probably be useful for researchers (Barbaro and Zeller Jr.). Unfortunately, it became one of the biggest deanonymisation scandals in online anonymised data sets (Arrington). Just moments after the data release, com-puter experts, bloggers and journalists were investigating the data and found clues in (controversial) search queries that could lead back to real individuals (Frind). For example, journalists from the NY Times identified and contacted a user, the 62-year-old Thelma Arnold, from the anonymised data set (Barbaro and Zeller Jr.). They discovered clues in the search queries like “land[s]crapers in Lilburn, Ga”, “several people with the last name Arnold” and “homes sold in shadow lake subdivision gwinnett county georgia” (Barbaro and Zeller Jr.). Thelma confirmed those search queries were hers. Shortly after the NY Times publication, AOL terminated the employment of two persons who were respon-sible for the leak (Anderson). The incident reveals that search queries are both a useful tool and a harmful means through which to identify people. Some obscure search queries could harm people if not analysed within their contexts (Boutin). Personal search queries could tell much about a user and are easily connected to auxiliary data such as address books.

Arvind Narayanan and Shmatikov wrote a deanonymisation paper— ‘Robust De-anonymisation of Large Datasets’—about deanonymisation using the Netflix Prize data set11_{. It has attracted a lot of attention in online anonymity studies}

(Zetter, World’s Most ). In the research, they used an anonymised data set that Netflix12_{had made public in order to attract researchers or computer experts to}

help them in improving the algorithms for the TV and series recommendations system. Netflix promised the creators of the best solution a $1 million prize. Netflix made clear in their ‘rules’ section13 _{of the website that the data sets}

that they released for the contest were completely anonymous and protected the privacy of subscribers’ data. Narayanan and Shmatikov thought of the Internet Movie Database as an auxiliary data set, because it could give them clues to users’ movie preferences. They assumed that the ratings of movies in Netflix would show similarities with the movies the users rated at IMDb. The researchers were able to create an algorithm that could cross-reference between the data of Netflix and IMDb to look for similar data in the data sets (Nrayanan and Shmatikov, Robust 2-4). When finding matching microdata, they could, with an almost precise accuracy, match them with the anonymised data set (12). They discovered many aspects regarding the records of known

10_{America Online is a large online company that has a lot of online services, and websites} such as ‘The Huffington Post’, ‘Techcrunch’ and ‘Engadget’. They provide a lot of internet content.

11_{http://netflixprize.com/}

12_{Netflix offers movies and TV series via streaming or video-on-demand services}

13_{http://netflixprize.com/rules.html At first, it was mentioned on the FAQ page as well,} but Netflix deleted the statement.

(20)

users, but they also mentioned that they could predict the possible political preferences of subscribers. For instance, they could estimate users’ opinions of movies, which could imply their views of the world, through movies that make statements regarding gender equality or views on terrorism (13). In the paper, they stated: “Even though one should not make inferences solely from someone’s movie preferences, in many workplaces and social settings opinions about movies with predominantly gay themes such as “Bent” and “Queer as Folk” (both present and rated in this person’s Netflix record) would be considered sensitive” (13). The researchers have shown that even when only a small amount of data is available in an auxiliary data set, they could successfully re-identify users in the anonymised data set.

In the matters discussed above, it has been made clear that cross-referencing between data sets works even when there is little data available. Deanonymising an individual’s movie preferences may seem less harmful than health records or search queries, but it could still have an impact when the data is analysed from a different perspective and is wrenched from its context. The next section, considers the consequences of ubiquitous location tracking systems and their impact in cross-data analysis. These are pervasive systems that could provide practical auxiliary data.

4.2 Location based services

The Global Positioning System14 _{(GPS) is an important aspect of most of the}

mobile applications people use (Srivatsa 628). Think about a weather app that needs a user’s location in order to provide useful forecast data, a navigation app such as Google Maps, or a quantified self- application such as Strava15that needs to keep track of a user’s movements to analyse motions (Gonzalez et al. 779; Gambs et al. 34; Ziegeldorf et al. 17). In those location-based service applications, users’ mobility traces are essential data that has to be stored and analysed. But today, almost any application users have is able to collect and store locations, and pass (or leak) the data to third parties (Almuhimedi et al. 1-2). “[M]any apps shuttle your location information to third-party services that serve ads based on your whereabouts.” (Moynihan). Data leakage may occur when app developers use code libraries or application programming interfaces (API) from other companies, which could allow those companies to access the collected data of the application. The section concerning cross-device tracking explains this further.

Anonymised data sets that still have mobility traces in the data are highly vulnerable to deanonymisation attacks. Together with data values such as time, one could easily discover when and where an individual is located. Those time stamps, in combination with location data, could be linked and searched in another data set to check whether the location and time data match. It would be likely that one could recognize a user if they have the same locations and times in

14_{http://searchmobilecomputing.techtarget.com/definition/Global-Positioning-System} 15_{https://www.strava.com/}

(21)

both data sets. “[A] user’s mobility trace, if revealed, can provide information about habits, interests and activities—or anomalies to them—which in turn may exploited for illicit gain via theft, blackmail, or even physical violence” (Srivatsa 628). In this context, we might consider all the controversial locations people visit that could indicate the kind of person they are (Blumberg and Eckersley). Questions can be asked such as: did you go to the abortion clinic, or hospital? Did you attend an anti-government march? These are all questions that most people do not want to be answered by a data set with proof of location (Blumberg and Eckersley). It makes ubiquitous location tracking harmful to the anonymity of people.

The next section reveals, on the basis of three cases, that location traces in limited data sets are vulnerable to deanonymisation attacks. First, a deanonymi-sation study is discussed which uses mobility signatures to deanonymise people in an anonymised data set. The second study uses the topology of social net-works to create a social graph, which in turn, could be used as an auxiliary data set to deanonymise a limited data set with locations. Finally, a deanonymisation method of Tinder users is discussed, which makes use of triangulation.

4.2.1 Deanonymisation of mobility traces

Sebastien Gambs et al. were able to conduct deanonymisation attacks on ge-olocated data. In their approach, they created an algorithm that was capable of cross-referencing data sets with mobility traces (Gambs et al. 6-11 ). In the deanonymisation attack, they used a method called a ‘Mobility Markov Chain’ (MMC); “A[n] MMC is a probabilistic automaton, in which each state corre-sponds to a point of interest (or several points of interests) characterizing the mobility of an individual and an edge indicates a probabilistic transition between two states (i.e., points of interests).” In the model, they claim that information concerning an individual’s mobility could be seen as a signature (Gambs et al. 2). The mobility signature can help to identify the user in the data set with the PII removed. By checking if those mobility signatures are the same, they could re-identify the persons in the anonymised mobility traces data set. The researchers developed their method by implementing a training phase to create an auxiliary data set. After that they got another, but anonymised, data set with some of the users’ mobility traces in it. With the MMC method, they were able to deanonymise the users in the anonymised data set. In the paper, they demonstrated that the method also worked on anonymised data sets from the real world. For example, a Nokia data set, two GeoLife16 _{data sets and an}

Arum17 _{data set were all successfully deanonymised.}

Social networks often claim that users are anonymous, because the identi-ties of users are linked to random identifiers (Wheaton; Coldewey). Research by Mudhakar Srivatsa and Mike Hicks makes use of social network relations

16 https://www.microsoft.com/en-us/research/project/geolife-building-social-networks-using-human-location-history/

17_{A data set of GPS locations of five researchers in the city of Toulouse:} https://www.net.t-labs.tu-berlin.de/papers/KRT-BSFCBSMD-10.pdf

(22)

between users to deanonymise mobility traces in data sets. They created a so-called ‘social network graph’ by discovering patterns in the data of different social media users. The relationship between those users can be checked in the social media network (2). This way, they could expose PII such as the name of the users in the anonymised data traces. ”[C]ontact graph identifying meetings between anonymised users in a set of traces can be structurally correlated with a social network graph, thereby identifying anonymised users” (1). This allows the researchers to deanonymise users by comparing the contact graph with in-formation about when those users met. The scientists tested their hypothesis with three different data set combinations. One data set with mobility traces through WiFi and the social network of the University of St Andrews, Small Blue contact traces and Facebook, and a data set with bluetooth traces and a conference people social network (2-3). Using different data sets helped the researcher to deanonymise the mobility traces of those people with an accuracy of 80% (Srivatsa and Hicks 10). “They use public social networks as auxiliary information, based on the insight that pairs of people who are friends are more likely to meet with each other physically” (Narayanan, New Developments).

In 2014, security consultant Max Veytsman from Include Security found a way to discover the anonymised locations of users on Tinder18. Tinder is an online dating application which uses geolocation services to bring users into contact. A classic method called ‘triangulation’ (or trilateration) was used to determine the exact location of a user. Via an API, Veytsman was able to request the location from a Facebook ID connected to the Tinder service. He changed his own location by giving three different geolocations through the API. When in possession of the radial distance from the Tinder user of three different locations, it is possible to calculate the exact location of the user. The same technique is used in cellular communication networks, in which it is possible to estimate the location of a user using the radial distances from three base stations. When in possession of mobile network data, one can relocate the positions by cross referencing between different base stations.

The previous sections demonstrated that metadata such as locations are useful for cross-data set analyses. By analysing patterns in mobility traces and the interrelations within social networks, researchers can create auxiliary data sets to cross-reference with anonymised data and re-identify individuals. Also, one case reveals by misusing an application service, it is possible to use a cross verification method (triangulation) from three sources to expose an anonymised location. In the next section, I will analyse how social network data could deanonymise users in anonymised social network data and browsing-behaviour histories.

4.3 Social networks and browsing-behaviour histories

Social networks can gather lots of information about their users. Every post, like, comment, check-in or other interaction with friends on social networks

(23)

stitutes real valuable information about users (Sandoval 148). Social networks are typically free because they can use the data produced by users to advertise (Sandoval 148). On their policy page, Facebook mentions they share data to improve personal advertising19 (Narayanan and Shmatikov, Social 174). In the previous section, the deanonymisation of mobility traces was established by us-ing a social network graph. The next section, demonstrates that anonymised data sets from a social network can also be deanonymised by cross-referencing them with graph data from another social network. It is subsequently exam-ined how social networks were used in a study to deanonymise web-browsing behaviour data.

Narayanan and Shmatikov have conducted the first research in order to deanonymise data sets using data from two different social networks: Twitter20

and Flickr21 _{(173). They were able to successfully deanonymise a third of the}

user data from Twitter with an auxiliary data set from Flickr (185). Data from the anonymised Twitter data set could be cross-referenced with a Flickr data set to identify the Twitter accounts. “We demonstrated [the] feasibility of successful re-identification based solely on the network topology and assuming that the target graph is completely anonymised. In reality, anonymised graphs are usually released with at least some attributes in their nodes and edges, making deanonymisation even easier” (185). In both data sets, they needed the same user names in order to deanonymise the data, therefore those who were not using the same usernames, which is likely for different social media platforms, were omitted from the research (184). The researchers predicted that, with the expansion of social networks, it would become much easier to deanonymise social network data.

In 2017, researchers were able to deanonymise a data set of anonymous web-browsing histories by connecting it to social media profiles (Su et al. 1261). “Each person has a distinctive social network, and thus the set of links appear-ing in one’s feed is unique. Assumappear-ing users visit links in their feed with higher probability than a random user, browsing histories contain tell-tale marks of identity” (Su et al. 1261). The scientists create some sort of behavioural fin-gerprint to uniquely identify the people in the anonymised data set. They were able to deanonymise 70% of the approximately 400 volunteers that provided their web-browsing history for the research (Su et al. 1262). Twitter was used as the auxiliary data set, but they expect it would also work on other social net-works like Facebook or Reddit22_{. According to the researchers, the difference}

with other deanonymisation methods applied to browsing-behaviour histories is that this technique is broadly applicable, in contrast to highly targeted attacks on individual users (Su et al. 8; Olejnik et al. 14; Kassner).

Both studies demonstrated deanonymisation using social network topologies. Because every user of a social network has a different network of interrelations and news feed, it creates a unique fingerprint that can be used to re-identify them

19_{https://www.facebook.com/policy.php} 20_{https://twitter.com: microblogging site.} 21_{https://flickr.com: photo sharing network.} 22_{https://reddit.com: social news website and forum.}

(24)

in another anonymised data set. As aforementioned in the AOL deanonymisa-tion case, search queries can be harmful when the original user is re-identified. For example, search queries including health complaints could be beneficial in-formation for insurance companies to calculate rates (Schneier 83).

In the next section, I will discuss strategies companies use to track persons across all the devices they use. By using cross-data set analysis, tracking compa-nies can distinguish which devices belong to one individual user. This enables them to get a complete picture of the user’s behavior. Subsequently, I will discuss the cross-device tracking abilities of large tech companies Google and Facebook and explain how their services can contribute to third-party tracking.

4.4 Cross-device tracking

For online marketers, it is profitable to know which devices belong to whom, because people use all kinds of technology interchangeably: mobile phones, tablets, personal computers, laptops or GPS systems (Brookman et al. 133-4). When devices are used individually, tracking companies would collect and create all the different data profiles of people using each device. User behavioural profiles would be mixed up and cross-device tracking could ensure that those different data profiles are connected to one single individual. Companies can use two different strategies in order to track data across devices and store data in one data profile (135).

‘Probabilistic’ tracking is based on identifying unique devices with the meth-ods discussed in chapter two: (ever)cookies or digital fingerprinting. Afterwards, they could correlate those data sets with each other to finder shared attributes that help to identify the person (135). For example, when a shared attribute like an IP address is the same at nights and on the weekends, it is highly likely that it is the same user using those different devices. As described earlier, geoloca-tion data is sensitive informageoloca-tion with which to identify users (De Montjoye et al. 1-2). When this information is collected from different devices, probabilistic trackers could also compare the geolocations and check whether multiple de-vices belong to the same user (Brookman et al. 135). The shared attributes in data profiles, such as IP address and geolocation data, do not per se verify that it is one user using those devices, because it is possible that more people are connected to the same (public) Wi-Fi connection. However, this determination can be corrected by examining the user’s behavioural browsing data, therefore ascertaining whether the user is frequently visiting the same websites, which can verify the estimation (135). The accuracy of cross-device tracking is around 97% (Bilton).

The other kind, ‘deterministic’ tracking, occurs when companies like Google or Facebook ask users to sign in at their services on all of the devices they use. This ensures that Facebook on your laptop could be synced with the Facebook application on a mobile phone. Both desktop and mobile Facebook apps can collect data about users and store it in the same Facebook profile (Brookman et al. 136). “Leveraging this identifying information, companies may be able to correlate user activity across other devices where the consumer uses the same

Deanonymisation Attacks: The Illusions of Anonymous Data and User Traffic