• No results found

Web Tracking and Privacy

N/A
N/A
Protected

Academic year: 2021

Share "Web Tracking and Privacy"

Copied!
66
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master Thesis New Media and Digital Culture University of Amsterdam

Leon Meuwese

Web Tracking & Privacy

Student: Leon Meuwese Student number: 10484418 Phone number: +316 21705274 Email address: lsmeuwese@gmail.com Supervisor: Anat Ben-David Second reader: Jan Simons Date of completion: 02/08/2014

(2)

1 Index

Introduction 2

Chapter 1: Literature review and theory... 6

1.1 Privacy versus Public ... 6

1.2 Privacy Models ... 8

1.3 Security and Privacy ... 9

1.4 A concept in disorder ... 10

Chapter 2: Key aspects of privacy. ... 11

2.1 Intimacy ... 11

2.2 Anonymity... 14

2.3 Control ... 16

Chapter 3: Big Data, Data Mining and Web tracking. ... 18

3.1 Big Data ... 18

3.2 Web Tracking ... 21

3.3 Web tracking data ... 24

Chapter 4: Method... 26

4.1 Dataset ... 26

4.2 Method Tools ... 28

4.3 The Internet Archive’s Wayback Machine ... 32

Chapter 5: Findings ... 34

5.1 General Findings ... 34

5.2 The five different trackers ... 37

5.3 A network of web tracker companies ... 40

5.3 The top ten web tracker companies ... 43

5.4 Web tracking data ... 45

Chapter 6: Areas of conflict ... 50

Chapter 6: Discussion ... 54

Conclusion ... 58

Literature ... 61

(3)

Introduction

Since the beginning of the 21th century the Internet has become a crucial and mayor part of Western society. It has its influence on our culture, economy, and politic. Millions use the web on a daily basis. A recent study shows that the Internet is the most commonly used media channel in Western Europe (Laurence 2012). This massive usage of the Internet has created enormous amounts of web traffic data. Take for example all the Twitter ‘Tweets’ or all the Google search queries. They all generate a lot of web traffic data. It is estimated that in 2013 600 billion gigabytes of data was transferred over the Internet (Tokmetzis 2013).

The magnitude of this data is so large that it can be seen as a part of the development called ‘Big Data’. Big Data is a comprehensive term for any data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data is perhaps best explained as a development which emphases on the perceiving and understanding of the relationships between different pieces of information. Big Data, in its ability to analyze large numbers of data, focuses not on causality but on correlations.. The ways in which Big Data can be analyzed are numerous. According to professor Viktor Mayer-Schönberger, Big Data has the potential of changing the understanding of people’s views on lives, organizations and societies in ways hardly imaginable a couple of years ago (Mayer-Schönberger 2013, 44).

Although Big Data has great economic and scientific potentials there are also some negative aspects. The gathering, storage, and mining of digital information pose privacy risks. (Rubinstein, Lee, and Schwartz 2008, 267). Especially, because of the lack of a strong legal framework or a strong public voice, to protect online privacy (Solove 2013, 34).

Privacy

The recent revelations made by Snowden about NSA’s spying program have put privacy on the public agenda again. But privacy is a difficult term. There is not a sufficient comprehensive definition of privacy and privacy also does not have a uniform historical narrative. It is a concept which is formed by different cultural aspects and by different historical experiences. The concept of privacy has broad historical roots in sociological, legal and anthropological discussions about how broadly it is valued and preserved in different cultures. Moreover, the

(4)

3 and Stuart Mill distinguish between the public sphere of political activity and the private

sphere associated with family and domestic life. There are also scholars that argue that privacy can be better explained as a concept than as a term. Within the debate remains confusion and dispute over the meaning, value and scope of privacy.

In order to create a workable definition, this thesis will formulate three key points of privacy issues. These three key points are used to explore how web tracking can challenge the privacy of internet users. The three privacy key points are:

1. Intimacy 2. Anonymity 3. Control

The main focus of this research is the relation between web tracking and privacy. The research question is: How has web tracking changed in the last eight years in Western Europe and in what way is web tracking challenging key aspects of privacy?

In order to answer this question and to provide structure to this thesis, there are two sub questions. Firstly, a workable definitions of privacy needs to be formulated. Therefore, the first sub question is: What are the main different postulated properties of privacy? With this sub question I am able to extract different key aspects privacy which can be used to investigate the state of privacy in relation with web tracking.

The second sub question is: How has the population of web tracking changed over the last eight years in Western Europe? In order to answer this question, the top fifty websites of six countries in Western Europe are examined. The six countries are: France, Holland, Spain, Italy, Germany, and the United Kingdome. The Internet Archive’s Wayback Machine enables me to research web infrastructures of the past and the Ghostery tracker tool enables me to actually locate the trackers. With the help of these digital tools I am able to create an overview of web tracking activities of the last eight years in Western Europe.

(5)

Outline

This thesis is divided in six chapters. The first chapter introduces the theoretical framework of this thesis and provides the –more political- debate about privacy. In this chapter I assert that privacy does not have a clear definition but that it can be better understood as a concept.

In the second chapter I will explain in what way I have extracted three different privacy key points. With these three key points I am able to define or measure the current state of privacy on the web.

The third chapter starts with an introduction to the big data landscape. The different ways in which Big Data and Data Mining are used are described and explained with the help of some examples. This in order to clarify in what way some big data practices can result in new privacy issues.

A second part introduces web tracking as a significant form of data mining. A more technical part explains how web tracking actually work. This is followed by a more descriptive section which illustrates what kind of data is being tracked by common web trackers. It also takes a closer look at the differences between the diverse kinds of trackers.

Chapter four describes the method of the empirical part of this research. It explains the boundaries of the dataset and provides information about how the empirical research takes shape. Also the different research tools, such as the Ghostery extension tracker tool and the Digital Method TrackerTracker tool (DMI tool), are explained in detail. These tool are necessary in order to investigate web tracking practices of the past. The purpose of this chapter is to clarify how the empirical research has been accomplished.

The fifth chapter uses the method developed in chapter four in order to present the findings of the empirical research. It starts with a general overview of the findings which provides a clear overview of the web tracking practice in Western Europe of the last eight years. A second part presents the findings concerning the different kinds of trackers. A third part looks closer at the different kind of data which is being tracked and at the different tracker companies and looks by these companies. This chapter also reviews a couple of mayor web tracking companies. Companies such as Google and Yahoo, but also lesser known companies such as Nexusapp and Rubicon.

(6)

5 The sixth chapter links the theory with the empirical research. It describes in what way

web tracking can be challenging for the three established key points of privacy. The three key privacy points: intimacy, anonymity and control are all different influenced by web tracking. The last chapter of this thesis describes the limitations and difficulties of this research. It also describes some possible solutions and discusses how the relation might take shape or what a relation could look like. This research ends with a conclusion and provides –hopefully- an answer to the main question of this research.

(7)

Chapter 1: Literature review and theory

Privacy

The term “privacy” is a troubled term that does not have a comprehensive narrative. In different fields, such as philosophical, political and legal, it has different meanings and generates different discussions. The concept of privacy has, for instance, broad historical roots in sociological and anthropological discussions. The anthropologist Margaret Mead has demonstrated how different cultures all have some sort of privacy. And privacy is protected in different manners, for instance by concealment, seclusion or limiting admission to secret rituals (Mead 1949). Furthermore, there are scholars that claim that all privacy issues are actually based upon property issues. This view is, for example, given by Thomson (Thomson, 1975). The professor Alan Westin has shown in his studies that even animals at times are looking for a form of privacy (Westin 1967). The fact that even animals look for privacy shows that the term privacy is a challenging term.

This chapter cannot describe the history of privacy because such history does not exist. Instead it will give several different historical notions about privacy in the Western world and provide an outline of the most prominent key players about the privacy debate.

1.1 Privacy versus Public

The word ‘privacy’ originates from the Latin word ‘privatus’ which means ‘separated from the rest’. And from the Latin word ‘privo’ which means ‘to deprive’. So in a sense the term privacy speaks of a separation or a distinction. This is clearly visible at the Greek scholar Aristotle who has made one of the first mentions about the concept of privacy. He made a distinction between a public sphere and a private sphere. He saw a clear difference between the public life and the private domestic life. The ‘polis’ was a place for political activity but the ‘oikos’ was a place for the family (Trull 2013, 7).

Aristotle’s classical distinction between the public and the private sphere influenced the English philosopher John Stuart Mill. He used Aristotle’s dichotomy in his work “On Liberty", which was published in 1859. He also believed that there was a private and a public sphere. The public sphere was an area that was controlled by the public, and thereby by

(8)

7 politics. The private sphere was controlled by the individual. Privacy as something bipolar.

(Mill 1859, 25). With this distinction Mill has formulated his famous ‘principles of freedom’. He argues that there is ultimately only one sole reason for humans to interfere with other humans. That reason is self-protection. Within this reasoning political actions against individuals are only justified if those particular individual’s actions are harmful to other people. This liberal view of freedom can almost been seen as a kind of demand for privacy. The state is by default not allowed to interfere with someone’s personal life. Within this reasoning it is even unlawful for governments to make the motorcycle helmet obligatory. Making something like that obligatory is, following Mill’s liberal view, a violation of personal freedom and in essence illegal. The choice to not wear a helmet is a personal choice, making wearing a helmet obligatory limits the freedom of the individual.

Another scholar who distinguishes two areas of interest is the German scholar Jürgen Habermas. He describes a model of modern society which conceptualized the relation between the private and the public. According to him, the world we live in is divided in a private sphere, which include family, private households and intimacy, and a public sphere. The public sphere is described as a communicative network that enable (private) persons to take part in culture and the formation of public opinion (Habermas, 1987, 319). Although this dichotomist view of privacy helps explaining this term, technological developments of the 21st

(9)

1.2 Privacy models

There are various ways of looking at privacy. The most common viewpoint in the new media field is primarily based on the surveillance model. This model is roughly based upon George Orwell’s 1984. But there are other interesting ways of looking at Privacy. Philipe Agre, for example introduces the capture model. This model emphasis on data mining by both governments and private corporations. Another well know privacy debate is about the boundary between privacy and security. A last approach is given by Daniel Solove. He uses different privacy issues as milestones in his explanation of privacy.

The surveillance model assumes that humans are being watched, either by governments or some other power, but that no action is taken against an individual until the law is broken. Visual metaphors like Bentham’s panopticon (Vaidhyanathan 2011, 111) and Orwell’s ‘1984’ dystopia have dominated discussion of contemporary developments in privacy and surveillance issues. The surveillance model is by far the most common view in the literature on privacy. It is found, for example, in definitions of privacy in terms of the right to be left alone, or in concerns over misused private information (Agre 2003, 744). Most literature on technology and privacy are in line with the surveillance vision. The modern history of this model is based upon historical experiences of secret police organizations, like the Russian KGB or the government of Nazi Germany. The recent eavesdropping’s scandal of the PRISM NSA program is another example.

Although the surveillance model has interesting insights it is not sufficient anymore. The technological developments of the 21st century have outdated this model. Philipe Agre

introduces a new model, which can cope with the 21st century technological developments.

This model is called the capture model. Agre uses the term capture, a common term among people in IT. The capture model focuses on the ‘capture’ or gathering of information also known as Data Mining. Increasing correlation of this sort gathered information has become an acknowledged corporate goal. The capture model is based on the theory that humans leave bits of information everywhere they go. The verb ‘to capture’ means capture as in gathering or to obtain certain data. (Agre 2003, 744).

The capture model, like the surveillance model, is a metaphor-system and not a factual description of privacy. It collects information about human activities. The capture model

(10)

9 describes a situation that results when grammars of action are imposed upon human activities.

Human activity is hereby described as a kind of language, for which a good representation scheme provides an accurate grammar. The capture model also actively interferes in and reorganizes human activities. In doing so it marks spaces in structural metaphors: data is being captured and activities are represented by computers in real time (Agre 2003, 746). The capture model, furthermore, procures and stores different data locally, depending on the institution. So in contrast with the surveillance model, the capture model is decentralized. A final aspect of the capture model is that it is driven by private organizations relating to ‘quasi-philosophical’ goals. In other words, the capture model is applied by private corporations to acquire human trends and interests and not by a single power or government (Agre 2003, 738). Philipe Agre’s view is a good and necessary update of the surveillance model. It deals better with the technological developments of the 21st century. His view does not reduce the

surveillance to an outdated idea. Furthermore, the contrast between the two models can also help to understand the difficult nature of privacy in the digital age. Different scenarios or cases with different consequences for privacy can be explained or understood best by looking which model fits best in the development of new media issues.

1.3 Security and Privacy

Another problematic privacy debate issue is the question where the boundary lies between privacy and security. The best example can be seen in the United States. There, after the attacks on the Twin Towers on September eleventh, the United States government demanded more space in its fight against terrorism. The Patriot Act made it easier to collect personal information, especially digital information (Osher 2002, 524). This effected not only the privacy of American citizens. Because online traffic is digital and a lot of the web traffic travels through American territory, the Patriot Act affects people all over the world. Popular websites, such as Facebook and Google, have their servers located in the USA. So, American intelligence agencies are (in some cases) allowed to read a Spanish email send by the popular email service Gmail.

Defenders of the Patriot Act argue that the loss of privacy is compensated by the gain in safety. Daniel Solove calls this “the false trade-off between privacy and security”. Sacrificing privacy does not automatically mean that we are more secure. The answer is not surrendering

(11)

our privacy, but to create security measures with oversight and regulation (Solove 2011, 37). This examples shows that privacy issues and privacy legislation are international problems. The internet has surpassed national borders. This makes dealing with privacy issues, notably legislation problems, very difficult.

1.4 A concept in disorder

Another scholar who has a new approach for the term privacy is the American scholar Daniel Solove. According to Solove, privacy cannot be understood independently from society. This is similar as the view from sociologist Barrington Moore, which he recalls. Moore argues: "the need for privacy is a socially created need. Without society there would be no need for privacy." (Moore 1984, 73).

More importantly, Solove argues that since information technology has created a new context for privacy, it is better to look at the term in a different way. He claims that privacy is a concept in disorder and that traditional theories are often too abstract to really explain the concept. He comes with a different, more practical approach. Instead of focusing on a definition of privacy, he focuses more on defining privacy issues. His aim is to define certain activities that fall under the rubric of privacy. He calls this concept the "taxonomy of privacy" (Solove 2006, 483). Shortcomings of existing theories, such as the immobility to deal with the technological developments of the 21th century, are solved in this way.

Although his approach is to some extent pragmatically, it does seems to deal with the difficult concept of privacy on some levels. Solove’s practical view makes it possible to create a sustainable policy or can help with the creation of new laws. His approach is also better usable to cope with the rapidness of the technological developments of the 21th century. This research follows Daniel Solove view, although slightly different, of dealing with privacy in the digital age. The next part of this chapter explains how.

(12)

11

Chapter 2: Key aspects of privacy.

As seen in the previous part of this chapter, privacy is a troublesome term. In order to create a workable concept of privacy this research explains privacy by extracting three key points of privacy interests. The three points are: Intimacy, Anonymity and Control. This chapter explains and describes these three points and simultaneously answers the first sub question of this research: What are the main different postulated properties of privacy?

2.1 Intimacy

Privacy is often seen as something legal or something cultural. The debate around privacy are often around these kind of issues. But this misses an import step in the discussion, because privacy is -at a more fundamental level- more than a legal or a cultural issue. Privacy refers to what makes us human. In order to have a meaningful discussion about privacy, the debate should start with the observation that privacy, above all, is a human characteristic. In this thesis is this fundamental part of privacy is termed “intimacy’.

Intimacy is the most basic form of privacy. It can be described as the private space inside every human such as personal thoughts and feelings. The space where personal thoughts and feelings are hidden is considered private in the sense that it is only accessible by that particular person. Nobody else is able to see what a person is thinking, feeling or experiencing. Payton and Claypoole describe this fundamental aspect of privacy as the ‘circles of privacy’. In the middle of a circle of privacy are the personal secrets, thoughts, and rituals that people keep to themselves (Payton and Claypoole 2014, 56).

The basic form of privacy is also important when looking at a society. Solove, for instance, argues that without society there was no need for privacy. A society without any form of privacy is at some level a world where it is impossible to have secrets. While it is of upmost importance that someone is able to have secrets. Complete transparency sounds worthy to aspire to, but full transparency also means no secrets. A state where everything is public and open only leaves space for the dominant view. A world where everything is public by default is a world in which everything that is made private will have the associating of guild attached to it. Secrets also play a crucial role as a counterweight to existing power structures.

(13)

A well-known argument concerning privacy is ‘I have nothing to hide’. But this claim misses a crucial point. It does not matter whether a person has something to hide or not, but it is important that a person is able to hide something when needed. The ability to hide something is very important as a contrast to prevailing views. If it is not possible to conceal things everyone should always agree with existing ideas and beliefs.

An example concerns homosexuality. Not until recent homosexuality was forbidden, and in some countries it still is forbidden. Nowadays, we find it strange that previously it was prohibited. In those days privacy was crucial for homosexuals to develop and explore their feelings. Even more it was important to create awareness and eventually to make it legal. The ability to do some illegal things, is a requirement for the development of society.

The key point intimacy also plays a role in personal relationships. James Rachels describes in ‘Why Privacy Is Important’ that privacy is a requirement for having (meaningful) personal relationships. Personal relationships are based on personal trust and there can only be trust if things can be confidential. According to Rachels personal relationships lose their diversity and depth if we are not able to have secrets (Rachels 1975, 323). No privacy means that anyone is like an open book. Rachels goes even further by claiming that it would be impossible to have any personal relationships at all without privacy (Rachels 1975, 326). Although Rachels’s assumptions about maintaining personal relations seem accurate they are a bit farfetched. Through blogs, social media and data-mining, one could argue, we already have lost some privacy. But the diversity and depth of our personal relationships have not been influenced by our loss of privacy in the online world. Also, the American researcher Norman Mooradian argues that Rachels’s theory misses the more important emotional aspects of personal relationships (Mooradian 2009, 165).

That intimacy is an important key aspect of privacy is also visible by Samuel Warren and Louis Brandeis. They described privacy in 1890 as ‘The right to be let alone.’ They valued privacy as something worth protecting and stated that the law should protect the ‘private domain of individuals’. This domain includes things such as a person’s body, home, property, thoughts, feelings, secrets and identity. Warren and Brandeis are important legal scholars that have influenced a large part of legislation about privacy.

(14)

13 A lot of Legislation bout privacy still aims to protect the ‘inner’ circle of people. Privacy,

for example, is recognized as a fundamental human right. In 1948 at the United Nations General Assembly the Universal Declaration of Human Rights was been declared. Article 12 of this important legal document addresses the privacy of individuals and makes it clear that a person’s privacy is something worth to protect.1

Furthermore, different national laws provide laws to protect people’s privacy. An example is article 201 of the Dutch Criminal Law. It states the penalty of a maximum of one year imprisonment if a person illegally opens someone else’s letter.2 Although this seems like

a clear law, it becomes more difficult to apply this law when looking at email as a letter.

1 The Universal Declaration of Human Rights. United Nations. 14 February 2014. <http://www.un.org/en/documents/udhr/index.shtml>.

2 Wetboek van Strafrecht. Wetboek Online. 20 June 2014.

(15)

2.2 Anonymity

Anonymity is the second key point of privacy used in this research. The focus of key point ‘intimacy’ spoke of what ‘private’ means, or could mean in relation to the term ‘privacy’. This next key point will take a more detailed look at the term public, in relation to the term privacy. The term public is sometimes problematic in relation with privacy issues, because public is seen simply as the opposite of private. But this distinction is not always right. There is a grey area in which it is not always clear whether something is private or public.

‘Being public’ is not the same as ‘no privacy’. When being in public there is still some level of privacy. Helen Nissenbaum refers to this as ‘privacy in public’ (Nissenbaum 1998, 560). The private sphere of a person is not restricted to a person’s neural network (his brain). When someone writes a thought in a diary, this thought is also part of his inner being. The thought written on paper can be considered just as intimate as the actual thoughts in his mind.

Although the web is not the same as a diary, the web can function in a similar way. With internet being the most rich source of information, a lot of people express their thought and inner beings in forms of questions. Search queries from the released AOL query dataset give a rich insight into these questions. These questions cover the complete spectrum of human experience, so also the most vulnerable aspects like anxieties, fears and sexuality. People are even considered more open and honest towards a search engine than towards their own family. This makes the Internet, at some point, both private as public.

Another example that shows the difficulties around the term public in relation with public is the Greenwood versus California case of 1988.3 Billy Greenwood was an American

citizen accused of drugs trafficking. The police had searched the garbage bag in front of Greenwood’s house without a warrant. The essence of this case was about the question whether police officers had violated the Fourth Amendment, also known as ‘the right to privacy’, when they checked the garbage bag of mister Greenwood. The court majority declared that the Fourth Amendment was not violated, a garbage bag is a public place (Nissenbaum 1998, 74).

Another good example which on a more practical level illustrates that anonymity is a key aspect of privacy is given by danah boyd. She shows that there is a difference between

(16)

15 making something public and being public. She argues that a conversation between two

persons in a loud bar is a private conversation. If the music suddenly stops, the conversation can be heard by everybody in the bar and thus is not private anymore. Boyd calls this ‘security through obscurity’ (boyd 2008, 14). This private-public problem is also perceptible on Facebook. Facebook’s newsfeed update of September 2005 showed personal profile changes on every friend’s newsfeed page. Although someone’s relational status on Facebook is public information that does not mean that Facebook should actively show this status to every Facebook friend or user. The more active newsfeed broadened the social information and people felt that their privacy was being violated (boyd 2008, 16). So, although Facebook is clearly a public space, people still expect a form of privacy.

Data collecting is done under the assumption that it involves anonymized data. But anonymized data can be de-anonymized. The debacle following AOL’s public release of ‘anonymized’ search records of many of its users highlights the potential risk for individuals in the sharing of personal data by private companies. In this case, the anonymized search records provided by AOL where not hard to de-anonymize. The movie ‘I love Alaska’ made by Lernert Engelberts and Sander Plug illustrates how search records can provide a very personal insight in the life of a person, in this case AOL user #711391.4

4 I Love Alaska. 2009. Minimovies. 20 June 2014.

(17)

2.3 Control

A third aspect of privacy is control over personal information. Control is especially important in the relation between privacy and power because it is important to know what others know about you. This control over personal information is (at some level) lost in Western society. Gilles Deleuze describes this society as a society of control (Deleuze 1992, 4). This is a society which is imbedded in personalized mass surveillance.

The American scholar Haggerty speaks of data doubles. With this term he explains the separation between the real person and the digital person (D. Haggerty, Richard V. Ericson 2000, 606). Companies create consumer profiles to improve their service and to target specific markets. For these companies the profile of the consumer becomes more real than the actual person. Hence the datadouble is more real than the actual individual, even more so, because only the datadouble is being checked and judged.

Companies can use the data they collect to discriminate you for what you are instead of who you are. So they could reduce your agency as a person, to for instance, a patient of heart disease in the near future (If genetics were open) or prone-to-depression, an addict etc. Also in cause of hiring or firing staff, they could use this data. So, control over personal data is vital to control how companies perceive their customers.

Control over personal information is also an important aspect of the view of Warren and Brandeis. Their aim was to change the privacy laws of their time, by explaining the meaning of privacy. In their time privacy was protected by the law ‘The right to privacy’. But this law was not sufficient anymore. New technologies made it necessary to create new and more suitable laws in order to protect privacy. The mass media and the photo camera made it possible to violate privacy in a new way. A newspaper, for example, could easily publish private information about an individual. The answer to these new privacy issues was the control over personal information. This was an important aspect of the philosophy of Warren and Brandeis (Warren and Brandeis 1890, 14).

The American writer William Prosser also tried to grasp the difficult notion of privacy by focusing on control over personal data. According to Prosser, privacy consist of four different areas. These areas are not stand-alone comprehensive areas but they clarify the different aspects of privacy. The four areas are: Intrusion upon a person's seclusion or solitude,

(18)

17 or into his private affairs; Public disclosure of embarrassing private facts about an individual;

Publicity placing one in a false light in the public eye and the appropriation of one's likeness for the advantage of another (Prosser 1960, 388). These four areas are still usable in the 21st

century. Reading someone’s email is an example of intrusion upon a person’s private affairs. Placing humiliating photo’s on the Internet is a good example of public disclosure of embarrassing private facts about an individual and also of publicity placing one in a false light in the public eye. All these seemingly different cases of privacy invasions are all mainly based upon access and personal information control.

At some level these privacy issues described, are bigger nowadays than they were in the 19th century. The Internet has made it very easy to distribute personal facts about a person.

Prosser could not have foreseen that the privacy issues he described could become that much bigger.

Control of your personal information is furthermore important to ensure that governments cannot easily punish citizens they dislike. Because there are so many rules and laws, everyone is bound to break one or two rules. Without privacy, the government would be able to completely at random punish people. Privacy is therefore important to protect the individual against the state.

(19)

Chapter 3: Big Data, Data Mining and Web tracking.

There are multiple ways to collect data on the web. The mere fact that data is collected is in essence not the source of the concerns about privacy. What is most important are the applications that are possible because of the newly created (almost always) digital databases. This chapter will examine these applications and starts with a small introduction about the field of Big Data and data mining. The second part of this chapter provides a more technical explanation to the practice of web tracking. It answers questions such as: What are web cookies and web trackers, and what do they do?

3.1 Big Data

Big Data is a concept used to indicate the rising use and creation of massive datasets. The importance of Big Data is described by the scholar Rita Raley. She writes about the rise of Big Data practices. She speaks of the ‘Data Renaissance’. With this term she means that Big Data is becoming more and more the engine of our information economy (Raley 2013, 123). Big Data has also been described as a ‘gold mine’ and as the ‘new raw oil’ of the Internet (Bridgwater 2013). This analogue stands for the process where there is an enormous amount of ‘raw’ data, which needs to be collected, mined and treated before it becomes useful. This development creates a lot of economic growth and chances.

This economic growth is partially based on the ability of Big Data to make things more efficient and to discover new patterns, which were hidden before. A well-known illustration of how Big Data is able to make a great impact is the Google Flu project. This is a project where extensive sets of Google search data were used to estimate flu activity in de United States (Mayer-Schonberger and Cukier 2013, 368).

A crucial part of Big Data is data mining. Data mining refers to a series of techniques used to collect an immense sources of digital information and find meaningful patterns. The term originates in the interdisciplinary field of computer sciences. The methods used for the analysis contain methods like artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the collecting

(20)

19 part, it also involves database and data management (Rubinstein, Lee, and Schwartz 2008,

261).

Data mining has two main practices: pattern-based and subject-based. Pattern-based mining searches and selects individuals for further research by analyzing large data sets for specific data linkages and certain models. If an individual fits the profile, he or she will get investigated. Governments and companies can act upon a particular profile or pattern and as a consequence may interrupt in known and unknown ways into the lives of innocent people (Rubinstein, Lee, and Schwartz 2008, 262). A school boy named Vito was questioned by the FBI after posting a tweet mentioning President Obama and suicide bombers. Vito simply fitted the profile of a possible terrorist.5

Subject-based searches usually start from the predicate of reasonable suspicion. This way of data mining simply accelerates the process by which law enforcement or intelligence agents gather relevant information about subjects they already suspect of a crime or violation (Rubinstein, Lee and Schwartz 2008, 264). This form of data mining is less challenging to the privacy of people. But still the collecting of ‘public’ data poses privacy risks.

The collecting of data by either companies or governments is not a new practice in information technologies but it is clear that there have been qualitative as well as quantitative changes in the last decade. The intensification and scale of data mining is growing. The gathering of private information is accelerating as computing becomes ever more integrated with our personal everyday life. An example is the rise of smartphone usage worldwide. All these phones are generating a lot more data. The size of data storage has surpassed conventional metrics (Mayer-Schonberger and Cukier 2013, 89).

There is also a qualitative difference. Data mining at the present is not only just descriptive monitoring. It also has predictive speculation and prescriptive acting (Rita 2012, 124). So now it is not only possible to look for knowledge in big data sets, computers are also capable of making forecasts and patterns.

Macro and predictive

The applications of data mining can be distributed into two types. These two types will be called “macro” and predictive analytics. Macro is the type of applications that uses the data of

(21)

datasets to see larger trends. The type of questions this method answers are about economic growth, weather, disease spread and other higher level phenomenon. Economic growth could for instance be predicted on the basis of information about the amount of traffic near supermarkets. The traffic was measured using the location of smart phones. In this type of research the individual is not of special interest. Only all individuals combined shows general trends that users of the data are interested in. Another example is the application based on twitter tweets that predict the general level of anxiety in the society. Based on this level stock brokers try to predict the fluctuations in the market (Siegel 2014, 230).

In contrast to this type is the tactic of Predictive Analysis. This term is based on the work of Eric Siegel. In his book Predictive Analytics he explains, how different (usually mathematical) methods can gain insight into individual people and how corporations can benefit from these methods. An example was the use of a precise decision tree that determined whether a bank from a mortgage bank should be contacted or not, based on an array of features about that person (Siegel 2014, 387). This means that no longer all costumers are treated the same way, but the treatment was tailor made.

(22)

21 3.2 Web Tracking

Although we can learn a lot from Big Data practices, the gathering of large sums of (personal) data also enables a whole new way of control. This is especially the case on the Web. In the online world of the web (almost) every transaction is being watched and monitored. The Snowdon Leaks, for example, have made it clear that a lot of web communication is scanned by the NSA (Macaskill 2013).

But not only intelligence agencies are monitoring Internet traffic. Web tracking is in some way a hidden kind of surveillance. Personalized and targeted advertising is just a small visible aspect of this tracking. ‘We are not only watched by governments but also by private corporations. We do not life in a surveillance state but in a surveillance society’ (Whitaker 1999, 120). Large Internet companies track web users in order to create digital profiles. These digital profiles provide companies with valuable information about their customers and are used to send them targeted advertisements. The ability to track users and their online behavior can be very lucrative for advertising companies. In 2013 the estimated total online advertisement market is about 2.5 billion US Dollars (Dagar et al. 2013, 3). But there is a serious side effect of this tracking. It challenges the privacy of web users. People online are knowingly or unknowingly sacrificing their privacy in exchange for convenience.

There are multiple ways to collect data on the web. Here it is important to state that there is no distinction between the tracking of an object and the tracking of a human when it concerns privacy. Agencies and companies act upon collected data, whether that data is of a person or an object (Agre 2003, 743). Assumptions actions are sometimes taken only based upon the data and not based upon the person

Cookies

The infrastructure of the web is based upon the Hypertext Transfer Protocol (HTTP). This HTTP determines the rules of communication between the web client (usually the web browser) and a web server. This way of communication is designed as a one-way traffic system. The web browser asks information and the server supplies the information. Some modern web applications though, require that the server knows who is asking for information

(23)

in order to generate the illusion of a fixed connection between the client and the server (Nikiforakis et al. 2013, 541).

When a person surfs the Internet, data is being collected about that person, like his IP address. A part of this data is being used to make Internet browsing more efficient and convenient for the user. For example, it makes auto-login and preferred personal settings possible. It also provides the web site owners information about their visitors which they can use to improve the website or to better his services. Web tracking, furthermore, provides Internet companies with valuable data, which they can sell to advertisement companies. In this way web tracking is helping web companies to stay in business.

The idea of Web tracking comes from programmer Lou Montulli. He is famous for his work in producing web browsers and in 1994 he was working for the Internet company Netscape Communications when he introduced the idea of cookies in the context of an Internet browser (Nikiforakis et al. 2013, 541). The cookie mechanism allows a web server to store a small amount of data on the computers of visiting users, which is then sent back to the web server upon subsequent requests. Cookies, at that time, were quickly embraced by browser vendors and web developers (Nikiforakis et al. 2013, 542).

So, when a person visits a particular website, a small piece of data is stored on that person’s computer. That data is called a cookie. Cookies do not contain any software nor do they held any personal information about a person. Mostly, cookies are simply a string of text that works as a unique identifier or as a kind of label. Every time a person loads a particular website, the host of that website looks for a possible cookie. When a cookie is found, the website knows that the person has visited the website before. It can than act and, for example, load the preferred settings for that person. When a different computer is used or when cookies are deleted, the host will see the visitor as if he has arrived for the first time.

Although cookies are made to be a reliable instruments for websites to remember information or to record the user's browsing activity it is also possible to use this data in a different way. Web trackers are able to store all sorts of private data. Data like a search histories, page-vieuws, location and used hardware. Internet users are willingly or unwillingly sacrificing their privacy by granting companies access to their private information. Although in most cases this kind of information will not harm or damage a person, it still breaches the

(24)

23 private sphere. As mentioned in the previous chapter, not everything what is public available

has to be made public. Being public and to make something public are two different things. This problem becomes bigger when companies are not only using one web tracker to establish user profiles but create a large network with which they are able to record and track everything a person does online. Moreover, there are also companies that combine collected data with other public available information.6 Companies have always strained to know as

much as possible about their customers, because they can use this information to improve their product or profit. The online advertisement industry is a lucrative business and the Internet is a place where gathering information about potential customers has never been so cheap and easy.

Cookies are mostly harmless except for third-party cookies.[18] These cookies are not made by the website itself, but by web banner advertising companies. These third-party cookies are so dangerous because they take the same information that regular cookies do, such as browsing habits and frequently visited websites, but then they give out this information to other companies.

As mentioned in chapter two, a key point of privacy is having control of how is having access to one’ personal data. Third-party cookies are a clear example of modern technology developments that are making control over personal information more difficult.

Web tracking

Online web tracking takes place on multiple levels. Sometimes the tracking practice is clearly visible. This is for instance the case when someone accepts the usage of a web cookie. But sometimes the monitoring and tracking is unknown or unclear. People, at that point, are unaware of the tracking and of the consequences. A recent study has shown that, although awareness is rising, most people are not aware of web tracking (Krishnamurthy, Naryshkin, and Wills 2011, 1).

Cookies are the most well-known form of online tracking, but there are other ways that companies use to track browsing behavior on websites. There are, for example, Flash cookies. They are also known as ‘locally shared objects’. These are pieces of information that Adobe

6The nine companies that know more about you than Google or Facebook. Quartz. 20 June 2014.

(25)

Flash can store on a user’s computer. They are designed to save data such as video volume preferences or things like user’s scores in online games. Flash cookies have caused controversy because they cannot be deleted while normal web cookies can (Berry 2013, 37).

Another way of tracking uses something called ‘server logs’. When a person loads a page on a website, their computer makes a request to that website's server. This server will log the type of request that was made and will store information such as IP address, the date, the time and the kind of browser. Server logs form the basis for web analytics and can only be seen by the owners of the particular website (Zimmer 2010, 509).

3.3 Web tracking data

After explaining how a tracker works, this next part examines the different kind of information a tracker is able to track. There are a lot of different web trackers. The web tracking tool Ghostery alone has found over 1900 different web trackers.7 These different trackers are all

capable of collecting different kind of data. It can also varied how long a tracker saves the collected data. It is beyond the scope of this research to investigate each of the 1900 trackers. Instead this research focuses more on some of the frequently found web trackers that are found in the top fifty website of Germany, France, Italy, the Netherlands, Spain and the United Kingdom. The web trackers that appear a lot in this research are also more challenging for the privacy of web users.

The original purpose of a web tracker is measuring web traffic. Web trackers track and save things like page views, ad views, clickstream data and search queries. With this information web developers and web companies can improve their website and measure how much visitors a website has had. Another obvious thing a web tracker is able to track are the visitors IP-addresses. An IP-address provides a lot of other meaningful data. IP stands for Internet Protocol. Basically an IP-address is like a phone number for computers and is therefore essential for communication between computers. There are multiple things to distract from an IP-address. It is for instance possible to extract the user’s location with the help of an IP-address. Another thing web trackers are able to see is the kind of hardware or software a visitor is using.

(26)

25 Another aspect which influences the possibilities of a tracker is the duration the

gathered information is saved. The longer the information is saved, the more it can be used to analyze the online behavior of a person. Some trackers save the tracked data for about three months, like the tracker ‘ScorecardResearch Beacon’. There are also other trackers which save the data as long as three years, like the tracker AppNexus. Unfortunately, it is not always possible to find out how long data is saved.

A next interesting aspect concerns the question if the collected data are shared with other companies. If a web tracking company shares collected data with another company, the web user losses control of his or her personal information. As described in chapter 2.3, to maintain control of one’s personal information is an important aspect for maintaining one’s privacy.

As described, there are numerous things a web tracker can track. This research focuses on nine different kind of data components in order to structure the analyses of the web tracking companies. The ten different data components are:

- IP-addresses - Page views - Kind of software - Kind of hardware - Email addresses - Search histories - Clickstream data - Location data - Ad views

(27)

Chapter 4: Method

Web historiography

This is a research about Internet websites, which uses archived websites in order to investigate the history of web tracking. Because of the growth of more advanced websites (i.e. web 2.0 platforms) it has become less and less clear to determine where the structure of a website starts and where it ends. Websites have become increasingly designed by dynamically-generated third-party objects and functionalities such as embedded content, social plugins and advertisements (Mayer and Mitchell 2012, 413). Through these integrations websites and platforms mutually shape each other and establish relations with external actors. This is an important development concerning website archiving (Brugger 2012, 757). The definition of Anne Helmond is used to define a website. She considers a website to be more than the ‘visible’ website and speaks of the interaction with the environment, the website‘s ecosystem, as the proper and total website (Helmond 2013, 4).

While in website archiving the website is often considered the main unit to be archived, it is not only just the website, but the total website ecosystem, that should be archived. The Internet Archive collects the entire webpage’s source code which makes it possible to do better significant research.

4.1 Dataset

The first step in this empirical research is the establishing of a representative and workable data set. This research focuses on the top fifty websites of six different European countries. The selected countries are; France, Germany, Italy, the Netherlands, the United Kingdom and Spain. Together these six countries are used to represent Europe, they make up for 66% of the total European population.8

The websites which have the most visitors are more useful for this research on the use of trackers than websites with just a few visitors, because they represent a larger part of the total Internet usage. The top fifty websites per each country are provided by the Internet company Alexa. However, this does not mean that the list is hundred percent accurate. Alexa's

(28)

27 traffic estimates are based on data from Alexa’s global traffic panel. They claim that this is a

sample of all Internet users. This panel consists of millions of Internet users using one of the 25,000 different Internet browser extensions.9 Alexa is an Internet company that provides

commercial web traffic data. It is located in California and owned by Amazon.

The global traffic rank, as Alexa calls their lists, measures how a website is relatively doing to all other sites on the web over the past three months. This rank is calculated using a combination of the estimated average daily unique visitors to the site and the estimated number of page views on the site. The site with the highest combination of unique visitors and page views is ranked number one. Because this research wants to examine the difference between European countries it uses the Alexa Traffic Rank by Country. Furthermore, Alexa only looks at the domain level of websites. For example, www.cnn.com, edition.cnn.com and cnnnews.com are all seen as one website. They are all part of the same domain, cnn.com.10

The way they create or measure Internet traffic has led to some criticism. The Alexa software automatically installs itself, without notification of the user. It gathers information about individual surfing behavior, also without notification of the user. The Alexa software can therefore be described as ‘spyware’. So, ironically, to acquire the six top fifty website lists, it was necessary to use an Internet company that works with web trackers.

By consulting the Alexa archive country top fifty most visited websites on the 1st of

April 2014 the data set was constructed. The six lists where used in combination with the Internet Archive Wayback machine in order to collect archived websites for the years 2006 and 2010. The full list is available in appendix 1. In short, the total data set consists of 50 different URLs per country. That makes 300 URLs for each year in total, and a total of 900 URLs for this entire research. In the beginning of chapter 4 the URL list is examined and further explained.

9 About Us. 2014. Alexa 26 June 2014. <http://www.alexa.com/about>. 10 About Us. 2014. Alexa 26 June 2014. < https://alexa.zendesk.com/hc/en-us>.

(29)

4.2 Method Tools

The Browser

To create the most accurate tracking data this research used the Internet browser Firefox, version 28.0. This browser has only the Ghostery Extension Tracker tool installed. Furthermore, every web tracking research is done from the same location and with the same Wi-Fi connection. This was the Eduroam Wi-Fi connection from the University of Amsterdam. A last precaution concerns the web browser. Firefox was reinstalled every time the collecting of tracking data for a different country has started. All these technical measurements were done to reduce the number of variables that could influence the collected data. When a browser has installed multiple extension, chances are that they influence each other. When, for example the extensions ‘Disconnect’ and the extension ‘Ghostery’ are activated at the same time, the results are different than when only Ghostery is enabled. This research will only use a browser that has installed Ghostery.

Ghostery

Ghostery is a free privacy browser extension for multiple Internet browsers. The owner is an advertising and privacy technology company called Evidon. The extension enables users to see which trackers are imbedded in the websites they visit. This research uses Ghostery for Firefox version 5.1.2 in combination with Firefox version 28.0.

The Ghostery extension scans the source of a page for trackers. This means that it looks for scripts, pixels and patterns of trackers and matches them to their database of over 1900 different known trackers.11 Ghostery uses simple string matching, which is matching a number

of characters in a code string, and regex, a regular expression in computer language, as a method to detect web trackers. 12 It displays the trackers found and it gives the user the option

to block the tracker. Ghostery has created five tracker categories. These five categories are advertising, analytics, beacons, privacy and widgets.

(30)

29 Advertising

This tracker enables online advertising companies to deliver personal targeted advertisements. The purpose of an advertising web tracker is usually to provide advertisement companies with information about web users. They can use this information to create targeted advertisements. Common advertising web trackers look for clickstream data, visited pages and all other data components that can help to discover the interests of web users. An example of a much used advertising tracker is DoubleClick.

Analytics

This tracker provides research analytics for website publishers. The main purpose of this tracker is to collect web traffic data like page views, search histories and location information. This trackers is meant to collect Internet traffic data for research purposes. Also, for improvements of sites with the goal to increase convergence to sales. It creates statistics which web developers can use to improve web applications. Google Analytics one of the most used Analytic tracker.

Beacons

The only goal of these trackers is to follow the user and track it user behavior. Web beacons are small objects embedded into a website. A common kind of web beacon is ‘pixel gif’. This is a simple small image that is approximately the size of a pixel. When a web page with this small image is loaded, it will make a request to a server for that image. This ‘server request’ allows companies to know that someone has loaded that particular page.

This system has been abused by spammers who will identify active email accounts by sending emails that include pixel trackers. This is the reason why sometimes email services ask if you trust the sender before they displays images. Web beacons are not as useful to website owners that have access to server logs. However, they are useful to advertisers displaying their ads on someone else's website or services that don't have server log access. Often advertisers will embed web beacons in their adverts to get an idea of how often an advert is appearing.

(31)

Privacy

The main purpose of this tracker is to provide web users with privacy notices. This tracker category is relatively new and not commonly found on the web. This research, for example, has not encountered a privacy tracker in 2006 and has only found a couple of privacy trackers in 2010 and 2014. The Ghostery library only consists of ten different privacy trackers. One of them is even made by the Ghostery. This tracker is called the ‘Ghostery Privacy Notice’ and is used to provide web users and businesses with privacy related information.13

Widgets

A widget tracker is any tracker that provides some form of functionality. These can be a social media plugin or a comment form. Most widget trackers are related to a social media platform. The Widget tracker ‘Twitter Button’ connects Twitter users with twitter when they visit a particular website which has this widget installed. This helps twitter users to easily share content, interests, and activities with other twitter followers. The strength of these trackers is that the can be used to link personal profile information with web tracking behavior.

The Digital Method Initiative TrackerTracker tool.

Another tool that is used in this thesis is The Digital Method Initiative TrackerTracker tool (DMI tool). This tool is created by the University of Amsterdam. It is built on top of the Ghostery library, and uses thereby trackers that are initially found in that tracker tool. It can only track what Ghostery is able to track, but is has some specific advantages. The strength of this tool is that it is able to investigate a list up until a 100 different URLs at a time. There is also the possibility to let the DMI tool track subpages of entered URLs. Another valuable advantage is the fact that this tool has four different ways in which the collected data can be presented. That is a CSV file Exhaustive, a CSV file per host, a GEXF file or a simple HTML output.14

13 Company Overview. 2014. Ghostery Enterprise. Web. 26 June 2014.

<https://www.ghosteryenterprise.com/company-overview/>.

(32)

31 This research is mainly interested in the GEXF file. Because these files in combination

with the visualization tool Gephi15 make it possible to change large datasets and URL lists into

clear and understandable graphs which help to get a better view and understanding of web tracking activities occurring on the web.

(33)

4.3 The Internet Archive’s Wayback Machine

To go back in time on the internet this research uses the Internet Archive. This Archive was founded in 1996 and it has saved more than 415 Billion web pages since then.16 In 2001 the

Internet Archive established the Wayback Machine. This is an interface and content navigation system which enables users to navigate through the archived websites. This tool makes it possible to do proper historical web research (Rogers 66).

A critical factor of the Internet Archive is the way they have saved the webpages. Instead of snapshots of websites, they save the entire source code of websites. This makes it possible to look for traces of web trackers. Because most web trackers are embedded in the source code, web trackers can be found in the archived webpages of the Wayback Machine. Figure 1 displays the archived source code of www.volkskrant.nl of 9 April and shows a trace of the tracker ‘Google Analytics. The trackers are in this way hardwired into the source code of a website.

Figure 1: Image 1. A small part of the source code, saved by the Internet Wayback Machine.

When using the Internet Archives Wayback Machine for the gathering of trackers, there are a number of things one must take into account. First of all, although the Internet Archive is an immense archive it is still a small part of the total World Wide Web. So, not all pages are saved by the Internet Archive and therefore there are some gaps in the data set of this research.

Secondly, not every URL saved by the Wayback Machine is available. Some URLs are blocked by the owner of that particular web domain. This is done by implementing a Robot Exclusion standard. This is done by placing a robots.txt file at the top level of their website. This is a used by web developers against web crawlers and other web robots which can access

(34)

33 all or a part of a website. This blocks the grab made by the Wayback Machine.17 There are also

archived URLs that give a constant 404 not found error or other error messages. Errors like these are bound to occur with an archive as massive as the Internet Archive.

This research uses archived website captures for each of the years 2006, 2010 and 2014. In this way it is possible to detect certain developments over time. When selecting the dataset, only website captures are used that are as close as possible to the 1ste of April, with a marge of one month (29 days).

The period in which the archived website should occur therefore is March 3 until April 30 of each particular year. This means that if the Internet Archive has saved a website but there is no capture within this marge, it is seen as if there was no capture at all. Websites without a capture within the set time range or websites that did not exist in the particular year are marked with the code NG. This stands for ‘no grab’. There are also URL’s that are blocked by Robot Exclusion Standard. These URL’s are marked with the code NDR. This stands for ‘No grab due to Robot Exclusion’. Every other error, such as a 404 not found error, is marked with a simple x. If an URL is archived in the proper way, but no tracker was found, it receives the code ‘zero’. An example of this working method is visible in table 2. The results of this research are presented in the next chapter.

Table 1: Example of a working sheet for this empirical research

URL Trackers in 2014 Trackers in 2010 Trackers in 2006

http://www.blogspot.com.es 4 DoubleClick Reclame GoogleAdsense Reclame Google FriendConnect Widgets Google+ Platform Widgets 0 zero 0 NG http://twitter.com 1 Google Analytics Analytics 0 NDR 0 NDR

http://live.com 4 Adobe Test & Target Bakens BlueKai Bakens Media Optimizer Bakens Proficientz Analytics 2 Omniture Bakens ScoreCard Research Beacon Bakens

2 Adobe Test & Target Bakens

Proficientz Analytics

(35)

Chapter 5: Findings

This chapter presents the results of the empirical research. It analyzes various findings in search for an answer to second sub question: How has the population of web tracking changed over the last eight years in Western Europe?

The first part of this chapter describes the general findings. It presents the main web tracking developments since 2006 and observes some general differences between the six European countries which are examined in this research. The second part of this chapter examines the differences between the five tracker categories provided by Ghostery. The third part reveals how a network of web trackers is formed in the top fifty websites of Europe. The final part investigates what kind of data the major tracking companies are gathering and reviews a couple of mayor web tracking companies. Companies such as Google and Yahoo, but also lesser known companies such as AppNexus and Rubicon.

5.1 General Findings

As described in chapter 3, this research examines tracking activities of the top fifty websites of six European countries in the years 2006, 2010 and 2014. The three different years all should contain 300 different URLs. But because the Wayback machine is not always able to archive webpages there are some URLs missing. The full list of URLs is included in appendix 2. A first important step in this empirical research is to check how many URLs are properly archived by the Wayback machine. The next important aspect is to check to what extend the URLS contain trackers. The three pie charts, displayed in figure 2 display the findings of these two important aspects. 25% 38% 25% 12% 1%

2006 (n = 300)

Trackers Found No Tracker found 55% 21% 12% 11% 1%

2010 (n = 300)

84% 1%

2014 (n = 300)

Figure 2: Three pie charts that present the dataset of 2006, 2010 and 2014, based upon the Ghostery extension tool.

(36)

35 The first thing that becomes visible when taking a closer look at figure 2 is that it is obvious

that the number of URLs that contain a tracker has clearly increased over the years. In April 2014 84% of the used URLs had at least one tracker. In 2010 this was 55% and in 2006 this was 25%.

Partly this is caused by the fact that in 2006 the URL list has much more faults than in the other two years. In 2006 38% of the 300 used URLs where not properly researchable. 13% was not available because of Robot Exclusion Standard (NDR) and 16% was not researchable because there was no URL archived by the Wayback Machine (NWG). In 2010 11% was unavailable due to Robot Exclusion Standard. Almost all of these Robot exclusions Standards where the same URLs as in 2006. Some popular websites, such as the social media platforms Facebook.com and Twitter.com, are unsearchable because they have the Robot Excluison Standard.

The difference in percentage of URLs with a tracker becomes smaller when eliminating all the URLs that are not researchable. Figure 3 shows the new tracker ratio after eliminating the unworkable URLs. N resembles the number of workable URLs. These three pie charts show a different patron. It is still clear that in since 2006 web tracking practices has increased. In 2014 only 15 percent of the 297 examined URLs does not contain a web tracker.

Figure 3: Three pie charts that present the corrected dataset.

40 60

2006 (n=187)

Tracker No tracker 73 27

2010 (n=227)

Tracker No tracker 85 15

2014 (n=297)

Tracker No tracker

(37)

The first conclusion that I am able to draw based upon the findings of this empirical research is the fact that in the last eight years the number of websites that contain a web tracker has clearly grown. In 2014, 85 percent of 297 researchable websites contained at least one web tracker. In 2006 this was only 40 percent. This shows that web trackers have become more common on the Web. The web tracker has clearly become a substantial part of the Internet infrastructure.

Referenties

GERELATEERDE DOCUMENTEN

And last of all I’d like to thank my family, my parents Tom and CA, my brother Sam (and the upcoming little one) and sister Claire, who are a constant source of

One is observing the evolution of unit test performance as the project evolves over time, the other is observing the evolution of live performance of a deployed service as this

Nee, ik heb (nog) nooit overwogen een Postbankproduct of –dienst via de Postbanksite aan te vragen (u kunt doorgaan naar vraag 14). Ja, ik heb wel eens overwogen een

Bijlage 3: Vragenlijst Imago/ Identiteit – onderzoek medewerkers Pentascope?.

‘Het aspect gezondheid wordt soms onderbelicht, terwijl het essentieel is voor de participatie.’ Henneke Berkhout stelt dat grofweg 10 procent van de statushouders zelfstandig de

The common data model is designed via a bottom-up approach using results of interviews, observations at different logistics service providers, analyzes of open data on websites

⇒ CCC stands for Collaborative Content Creation websites. ⇒ MP stands for Media Provision websites. ⇒ MG stands for Metadata Generation websites. ⇒ SN stands for Social

It is apparent that both the historical information life cycle and the knowl- edge discovery process deal with data integration issues before these two last stages. Certainly,