Big Data and the Automation of Scientific Method

(1)

Method

Tessa Ligtenberg

(2)

2

Introduction

With the invention of the computer, methods in science have started to develop and change. The invention of the calculator enabled scientists to make more complex calculations. With further developments in computer science, it was possible to use simulations and databases. In current science, the impact of computer methods is large. The emergence of ‘big data’ has given scientists more methods, such as ways to collect a lot of data and to analyse this data with the use of software. Big data is a societal development with a large impact, in marketing, health, and security.

Governments and companies use big data to control traffic or to refine advertisements. In science, these developments are most noticeable in biomedical science and genetics, and in social science. This is because in these fields, large amounts of data can be easily collected, for example in social media, and because the data generated requires a lot of analysis, for example in the case of

sequenced DNA. These new methods have advantages. For example, the large amount of data could make conclusions more believable, because there are more test subjects. This means that

conclusions are easier to generalise.

In my thesis, I explore the influence of these big data methods on scientific reasoning. Before I explain how I tackle this question, I pick apart the different parts of the question. How are big data methods applied? The methods in big data lead to a larger role for computerised methods, and less direct involvement from scientists. There are different ways this can happen, for example in the collection of data and the analysis of data. This means that aspects of science become automated. We could even imagine a completely automated science, in which the conclusions come out of a machine, ready to be applied. An example of this is automated diagnosis: the doctor enters

symptoms in to a computer program, which then gives a diagnosis based on the symptoms. In many fields of science, this is not far developed. Also, automated diagnosis does not provide us with scientific conclusions; it does not provide us with new knowledge about the diseases. In this thesis, I discuss ways to automate science, and why this may be easy or hard to do, depending on the data or on the conclusions that the researchers want to make.

By ‘scientific reasoning’ I mean how scientists reach their conclusions from available data on

phenomena. To reach these conclusions, scientists use many different methods and tools, which may influence the conclusions. For example, the application of mathematics to physics, in Newtonian mechanics, changed the type of conclusions in physics. Instead of concluding with properties of things, the conclusions were mathematically structured. There is a lot of literature about how scientists reach their conclusions, for example on falsification. This describes how scientists should formulate their hypotheses and test them in order to come to their conclusions. Falsification is discussed in the context of discovery: it deals with the process by which scientists come to their conclusion. I discuss this in addition to the context of justification, how scientists should justify their conclusions. In this thesis, I discuss how the application of certain computerised methods leads to changes in scientific reasoning. The way scientists reach and formulate their conclusions changes with the use of these methods. The three most important issues I touch upon are the role of theory, induction, and causal reasoning. These issues are strongly interrelated.

Many different methods are used to automate science, but in my investigation I focus on methods that are applied to data mining. These methods are used to gather information from large datasets, and include methods that can find correlations, make classifications, and cluster data. I focus on these methods, because the consequences of these methods are much debated, and there is an established discussion on machine learning from which I will draw. Science journalist and tech entrepreneur Anderson (2008), for example, writes about these methods of data mining:

(4)

4 “There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” (Anderson 2008, p. 108) The methods of data analysis, according to Anderson, lead to the ‘end of theory’. We no longer need scientific theories or hypotheses: the search for correlations is enough. Many authors disagree with this view, and argue that theory is still necessary (for example Leonelli 2012, Pietsch 2014). Because of this debate about theory, I focus on data mining methods.

To reflect on these methods I do not only draw from current discussions surrounding theory, but also from the philosophical discussion on machine learning and artificial intelligence (AI). These methods are methods used in data mining, and therefore the discussion may give important insights. This analogy is another reason to focus on data mining methods. I focus mainly on the discussion of inductivism in this literature, and the rise of inductive reasoning as a consequence of these methods. The analogy between the methods that are used in big data practices and these AI methods will prove useful to the discussion of these methods in science.

The evaluation of the conclusions from this discussion, such as on the role of inductivism, is done with case studies from biomedical science, social science, physics, and the humanities. This is because scientific practice is changing, and the issues that were at stake in the discussion of AI methods may have changed. I want to give a broad overview of fields of science. The conclusions to this evaluation are threefold. First, the idea that we can do without theory is naïve, and we should abandon it. Second, science is not exclusively inductive in the use of big data methods. Instead, it uses much more hybrid methods. Third, the goals of science are not only aimed at finding predictive laws. Rather, scientists search for causal connections, correlations, and groups or clusters.

The structure of my thesis is as follows: in the first chapter, I introduce the practice of big data and how the methods from big data are used in science. I also discuss the belief that science will become theory-free and the discussion surrounding it. In chapter 2, I move to the discussions surrounding AI and machine learning. I discuss approaches in the philosophy of AI, the method of machine learning, and how this is applied in science. I also discuss how this could lead to a more prominent use of induction. In the third chapter, I evaluate this claim about inductivism, along with claims about theory and the role of scientific laws. I conclude with the three points described above.

(5)

5

Chapter 1: Big Data and Data-Intensive Science

The goal of this chapter is to analyse the concerns with automation within practices that make use of big data methods, such as health, government, and social media. These methods are also applied in related sciences, such as medical biology and communication sciences. This requires clarification of the role of big data methods in these types of practices and the concept of automation. In order to do this, this chapter consists of three steps: first, I elaborate on big data and its methods. A definition of big data in terms of methods and practices rather than size and qualities of datasets will be most useful for our purposes. Second, I discuss how the methods are used in several sciences such as biology and physics, and provide some possible concepts of automation within science. In the third step, I discuss the influence of this automation on scientific reasoning. My main focus will be a concern that is a consequence of automation in data-intensive science: the so-called ‘end of theory’. I discuss what this change is supposed to mean and evaluate if it is a good description of changes within science. After this, I discuss some other possible consequences of the use of big data methods for scientific reasoning, such as the role of induction and causal reasoning. This paves the way for the discussion of machine learning methods in chapter 2.

1.1 - Big Data: Definitions, Methods, and Practices

‘Big data’1_{is a buzzword that is often heard nowadays, in all kinds of practices such as health,}

government, and social media. It is also bringing about a change in scientific methods. Section 1.2 focuses on the role of big data methods in science, and what role these methods play in automating scientific reasoning. In this section, I give a general discussion of big data. I discuss some possible definitions in terms of the size of datasets, and definitions in terms of practices and methods. This definition in terms of practices and methods serves my purposes of investigating automation best. I shortly touch on non-scientific practices of big data. I start with this discussion of big data, because the methods in it are applied to sciences. It also provides a good philosophical starting point, since there has been written a lot on the subject.

Definitions of data

For definitions of big data, it is useful to know what data are and what they mean. I do not discuss this extensively. Readers interested in a philosophical discussion of possible definitions of data, are directed to Floridi (2008) and Lyon (2016). These authors provide a general overview of possible definitions of data. I only make a few remarks about a definition of data.

The received view is that data are recorded facts, which are simply given. When my credit company registers my transactions, they collect facts about my purchases. Philosopher of information Floridi (2008, p.2) calls this the epistemic interpretation of data, and criticises it on two accounts. First, this view does not account for processes such as data compression or cryptography. Data compression, for example, reduces the number of data to represent some unencoded data. These changes in data cannot occur if data are simply facts. Second, ‘facts’ and ‘data’ are equally complex concepts, so defining data as facts does not help. Even when we put these concerns aside, there are some

problems, from a scientific perspective. Philosopher of biology Leonelli (2015, p. 814) points out that scientific data are constructed, in for example experiments. This means that data do not reveal anything by themselves. What counts as data, according to Leonelli, depends on whether or not it is

1_{I will spell big data with lowercase letters in my thesis, because it has many different applications and is used}

(6)

6 used as evidence for a certain knowledge claim at a certain time, and on whether or not it can be circulated2_.

The informational interpretation of data could represent an alternative (Floridi 2008, p. 3). In this interpretation, data are pieces of information about something else, rather than recorded facts. My personal data are information about me as an individual. From all the data about me, researchers would be able to form a picture of me. This also works better with Leonelli’s concerns, since the role of the data depends on how it can be used to provide information about others. However, Leonelli also emphasizes the fact that data can be circulated, and that data can become disassociated from their context.

Floridi (2008, p. 3) has two objections to the informational interpretation. First, information itself is often stated in terms of data: information is meaningful and truthful data. Secondly, there are data that are not information, such as music or films that are stored on a certain device. Pictures or messages are often talked about in terms of data. In response to the first objection we could say that from a scientific perspective, we are not interested in a definition, but rather simply in a criterion for what data to use. Distinguishing data that are informative (or truthful and meaningful) from data that are not informative is enough for scientific purposes. There are some possible issues with this

position. First, data are not impartial or infallible. We should be critical of how scientists have collected the data and how the data is applied in research. Second, the distinction of informative and non-informative data is not very evident in science. Scholars still study data that are ‘not informational’ according to Floridi. In the humanities, scholars still try to extract meaning from ‘non-informational data’ such as books and films. Again, it is important to look at how researchers use the data in order to get to information, and what this information entails.

Information scientist Borgman (2012) describes data can be collected in different communities: “An investigator may be part of multiple, overlapping communities of interest, each of which may have different notions of what are data and different data practices.” (Borgman 2012, p. 1061). Data is often collected with a specific purpose in mind. As Leonelli writes, the role of data is dependent on how they are used, and that data are mostly used to support knowledge claims. The knowledge claims can take many shapes, because the interpretation of a book or film is different than claiming to know how certain medicine affects a disease. In the first case, the claim is not necessarily causal, but rather about what the message or significance of a work is. In the second case, the claim is causal, since we assume that the medicine works with the disease in a certain way.

In the use of data, there is the underlying idea that data can be used to get information about something. Moving forward, I will keep this informational interpretation in mind, while also keeping in mind that it is important to look at the specific purposes for which data is collected, and how the data are used to get to information.

Definitions of “big data”

We should move on from this discussion of data to the notion of big data. What does this mean, and why and when have data become ‘big’? Big data has been applied in many practices. An example of big data is to use information from navigation systems to guide traffic, so that it runs smoothly. Authors in Critical Data Studies (CDS) try to analyse the concept of big data. The problem of big data was originally one of big datasets, which could not be stored on a regular computer (Crawford, Miltner & Gray 2014, p. 1664). Some of these technological difficulties persist: we are still looking for

2_{This last characteristic is very important for Leonelli, because data is not scientific unless everyone could have}

(7)

7 ways to store the data that we gather, such as cloud computing. However, the size of the dataset is not the distinctive quality of big data, since many datasets that we consider big data, such as a collection of social media messages, are smaller than datasets that we do not consider big data, such as large-scale surveys (boyd & Crawford 2012, p. 663).

We could look at specific qualities of the dataset, instead of looking at the size of the dataset. A standard way to define big data, that is used in official documents, is a definition according to the ‘three V’s’: volume, velocity, and variety. Big data is not only big in volume, but it is also collected at a fast rate and almost in real time, and it is very varied (Canali 2016, p. 3). Sometimes, two other ‘V’s’ are added to this: veracity and value (Marr 2014). Veracity refers to the idea that big data is

trustworthy because of its size. Value refers to the idea that data can be made valuable by companies or governments. However, philosophers of information and technology criticise these concepts, as they are relational (Canali 2016, p. 3, Floridi 2012, p. 435-6). This means that we need further specifications as to what it is relational to. For example, we do not know what qualifies as a ‘large volume’, because it is not compared to a smaller volume.

Social geographer Kitchin (2013, 2014) adds four other qualities to datasets in big data to the 3 V’s: big data is exhaustive of entire populations, and very detailed in resolution and indexical of

individuals. It is also relational in that it can join sets from different fields, and flexible, which means that it can be extended and scaled up or down (see also Figure 1). These qualities are applied to different big datasets within science by Kitchin and McArdle (2016), and they conclude that the most important ones are exhaustivity and velocity. They also state that not all big datasets have all of the seven qualities outlined by Kitchin, and emphasise that we should be mindful of different kinds of big datasets. Philosopher of science and technology Pietsch (2015b, p. 141) replaces exhaustivity with representation: the dataset should contain all the relevant configurations of the examined

phenomenon.

Figure 1: Qualities of big and small data (Kitchin & McArdle 2016)

These qualities are only qualities of the datasets. However, big data is also a practice, one that uses certain technologies and makes certain analyses. These analyses and technologies have a societal impact and are applied within societal practices. A definition of big data from Critical Data Studies (CDS) by boyd and Crawford is:

“We define Big Data as a cultural, technological, and scholarly phenomenon that rests on the interplay of:

- Technology: maximizing computation power and algorithmic accuracy to gather, analyze,

(8)

8

- Analysis: drawing on large data sets to identify patterns in order to make economic,

social, technical, and legal claims.

- Mythology: the widespread belief that large data sets offer a higher form of intelligence

and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” (boyd & Crawford 2012, p. 663, emphasis in original) Similarly, Floridi (2012) states that the problem of big data is not one of large datasets, but one of finding small patterns within the datasets. So in talking about big data in general we should not ignore that it makes use of certain methods and technologies, and that there is the idea that big data can solve our problems.

This definition in terms of methods and practices is also most useful for my purposes, since it emphasises that big data tries to find patterns in datasets. Within studies of data-intensive science, the focus is often on methods (for example Leonelli 2014). The focus of this thesis is also on the use of these technologies and methods within different sciences. These technologies can help us understand how reasoning becomes automated in big data practices, and data-intensive science. In section 1.2, I will dive deeper into this automation of science by big data methods. In the remainder of 1.1, I briefly discuss methods that are used in big data and to what kind of practices these methods are applied. This helps to get a picture of what big data practices look like.

Big data methods and practices

In this section, I discuss the methods that are applied in big data practices and some problems with these methods. In big data practices, data is collected, stored, and analysed, often using automated methods. Data collection happens through social media logs or other programs that can collect and track digital activity, but also through government records, highly specialised instruments and scientists employing the public to help collect data (Borgman 2012). Corporations, governments, and scientists can also collect data manually and individually, but this is time-consuming and only a small part of the data used in most big data practices. Big data practices mostly make use of data collection that can be automated, through the use of digital resources or specialised instruments.

These large sets of data have to be further analysed and processed. As Borgman (2012, p. 1066) points out, the labour necessary to process data is not equivalent to the labour used to collect the data. In big data practices, the analysis of data is often automated, but some data is easier to analyse than others. Data generated by one highly specialised instrument, for example a mass spectrometer, which analyses chemical components, is relatively easy to process using automated methods, because researchers understand what the context was in which this data was generated. For social media data, which is collected from the archives of different companies, is harder to analyse,

because the context is unclear. Researchers have to be mindful of this. An example of the risks of this is the Google Flu research, in which the results turned out to be indicative of searches about the flu rather than the actual flu spreading (Halford & Savage 2017). We should keep this in mind while discussing the methods used in data analysis.

The extraction of useful knowledge from datasets is also called data mining. Information

management scientists Gandomi and Haider (2014) describe methods and algorithms used in data mining. The easiest way to do data mining is with data that have numerical values or data that arrives in the form of text. These can be processed by algorithms extracting certain words or certain

statistical values. The analysis of images, and audio and video data is more difficult (Gandomi & Haider 2014). An example of the former is analysing Twitter data, which has both numerical values (in the form of time posted or length) and contains text (the tweet itself, relevant hashtags). To this, algorithms can be applied which search for words that express a sentiment, such as ‘happy’. An

(9)

9 example of the latter could be analysing satellite photographs, or videos of crowd movement. These are more difficult to analyse, because often the data needs to be transcribed (in the case of audio) or there are a lot of possible starting points for an analysis (in the case of video there is both audio and images). The difficulty in these cases lies mostly in converting this data into data that can be more easily analysed, and the algorithms and programs for converting this data are often not yet

sophisticated enough. In chapter 2, I discuss the application of machine learning algorithms more in-depth.

For large datasets, traditional statistical analyses are not sufficient (as described by computer scientist Scarpa 2014). There is no need for sampling, for example, because the goal is to establish patterns on the basis of all the data. The data are supposed to be exhaustive of the whole

population. There is also the need to be mindful of the patterns that are established. In large

datasets it is easy to establish patterns that are not really there. The statistical concept of significance should therefore be reviewed (see also Gandomi & Haider 2014). In data mining, new computational methods are used in addition to more traditional statistical methods. These methods often fall into the realm of machine learning, which I will discuss in chapter 2. These techniques are not only used for predicting values based on probability, but also for clustering data according to certain qualities. Computational methods are replacing analysis done by humans (Scarpa 2014). Therefore, these are the methods I focus on when it comes to automation of scientific reasoning.

The next section discusses the use of big data methods in science. However, the methods are applied to many other practices. Applications of big data are within social media, health, security, traffic control, and sales. It helps companies to better advertise to their customers, can be used in border control, and help set a diagnosis for patients. All in all, a large number of areas have put their faith in big data, and it is believed that it can make these practices more efficient and generate new

knowledge. This is what boyd and Crawford (2012) mean by ‘mythology’. In these practices too, certain aspects of reasoning are animated. The impact in these practices is however beyond the scope of this thesis.

In the following, I understand big data as a collection of methods and practices, applied to large datasets. I use this conception, because I am interested in the impact of these methods on scientific reasoning. The size and qualities of datasets can also have an impact on scientific research, such as the influence of large sample sizes. However, this is not the main focus of my thesis. In both the methods and practices, and the size and qualities of the dataset is a lot of variety. Not all datasets have all the desired qualities, and not all practices use the same methods. This is why I refer to big data as a collection of methods and practices, rather than as one practice.

1.2 - Data-intensive science and automation

In this section, I introduce big data methods in science. I describe the emergence of big data methods within science, and in what fields it is applied now. Science which makes use of these methods and large datasets is also called ‘data-intensive science’. In this section, I hope to come to a clearer concept of what the automation of scientific reasoning, with the use of big data methods, entails. This concept of automation can help us see the consequences of automation in section 1.3, and help us compare to the automation of science described in chapter 2. Automation takes different shapes, depending on the field of science that it is applied to.

According to Pietsch (2015b), one of the distinctive features of data-intensive science is “the

automation of the entire scientific process, from data capture to data processing to modeling.”

(Pietsch 2015b, p. 141, emphasis in original). A similar description can be found in Leonelli (2012), who describes automation as a possible quality of data-intensive science. Leonelli criticises this on

(10)

10 account of science still needing theory, which makes automation implausible. However, this is a problem that is discussed in the 1.3. The idea of automation in Leonelli is narrower than in Pietsch, because it does not span the whole scientific process, even though Pietsch also shows that

automation still requires a lot of theory.

The question of theory has already come up. The use of theory is opposed to automation, because it presupposes that we need scientists who can interpret and regulate the process. For example, Leonelli (2014, p. 4) points out that scientists do a lot of work in curating databases. This role of theory is extensively discussed in section 1.3. It does add a difficulty to our discussion, since

automation of science and its consequences for scientific reasoning are hard to discuss separately. It seems that automation is defined in terms of scientific reasoning: when scientists are no longer doing the reasoning, the scientific process is automated. This can be separated into three questions which are intertwined: (i) what are some of the ways reasoning is taken over from scientists, (ii) is this something which is possible and (iii) how does reasoning change when it is done by computers instead of scientists? In this section, I focus on the first question and partially on the second question. I discuss some features of different fields of data-intensive science, and look at how we may conceive of automation in these fields.

Large datasets have been used in science for quite some time. Authors who work in the history and philosophy of science show this. Early modern research in natural sciences such as biology was for the most part aimed at the collection of empirical facts (Müller-Wille & Charmantier 2012). This led to the creation of techniques of data management, for example by Linnaeus. As for the social sciences, as early as the 19th_{century there were collections of data about citizens (Beer 2016). Social}

scientists took it upon themselves to analyse these collections of data and draw conclusions from them. These techniques were not automated, and the organisation and analysis had to be done by the scientists themselves. The collection of data was also not automated, and done by either scientists or government officials.

Because of these earlier uses of large datasets, Leonelli (2014, p. 8) argues that the emergence of big data within the biological sciences is not as revolutionary as it is often believed to be. In biology and the natural sciences, there are already methods in place for managing data. Also, biology is still concerned with causal explanation, and not so much with only correlations. However, Leonelli (2014, p. 8) does identify two shifts in data-intensive science. The first is the new prominence attached to data. It is now seen as valuable in many ways, socially, economically, and scientifically. The second is the emergence of new technologies and skills to handle and analyse data. As stated in section 1.1, the conception of big data that Leonelli uses is mostly in terms of technologies and practices. With the rise of these new technologies, there is also a rise of automation within biological research. An example of the use of automation techniques in biology is the study of ‘omics’, which identifies genetic factors which may lead to certain diseases (Canali 2016). The study of omics is used by many philosophers of biology as an example of big data research. In the study of omics, algorithms are used to search for connections between genetic makeup and certain health effects. However, both Canali and Leonelli point out that this requires the effort of scientists as well, in collecting and entering the data into databases, curating these databases (Leonelli 2014, p. 4), and in using the correlations for further research and in selecting results based on theoretical considerations (Canali 2016, p. 2). Much of the data is not yet ready for analysis, and needs to be curated and labelled so that algorithms can process the data. The question is thus if biology can become completely automated. Leonelli (2014, p. 10) believes that the impact of big data within biology is limited, and that there will be no full automation, but does discuss the possibility of a bigger impact in economic and social sciences.

(11)

11 At the same time, we could argue that the use of big data methods in social sciences is not very revolutionary. Sociologists White and Breckenridge (2014) point out that social sciences have used observational methods for a long time, and the use of big data could be seen as only an extension of these earlier methods. What changes is the fact that it is easier to test theories, because of the larger sample sizes. Automation is more prominent in social sciences than in for example genetics. Data can be collected automatically, for example by looking at credit card information (Pietsch 2015b) or social media data (boyd & Crawford 2012). Because of the use of social media and the digitalisation of many activities, such as banking, the automatic collection and analysis of data can be more easily integrated into social science.

As I discussed in 1.1, we should keep in mind that data that can be easily collected is not always ready for analysis, or could be quite complex to analyse (Borgman 2012). Sociologists Halford and Savage (2017) also give reasons to be sceptical of big data applications in social science, as a lot of the data used in social science is captured from situations with a specific context, which means that not all conclusions can be generalised. They give an alternative view of how big data methods can be used in social science, in which authors draw from many different sources and research data to provide background for an encompassing theory. However, the data these researchers use is often still collected automatically and analysed with the use of digital methods.

Gray, a computer scientist at Microsoft (whose lecture is transcribed in Hey et al. 2009), is less sceptical, and argues that the use of big data in science has led to a new, fourth paradigm, in addition to three earlier paradigms of experimental science, theoretical science, and computer simulations. According to Gray, big data methods are in fact revolutionary. Gray draws mostly from the physical sciences. For example, he describes the search for particles at the Large Hadron Collider (LHC), in which data is collected, selected, and analysed mostly with the help of computer programs and algorithms. A discussion of the LHC project is also found in 3.1. Gray believes that science has changed fundamentally with the emergence of big data: “The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline.” (Hey et al. 2009, p. xix). For Gray, a new feature of data-intensive science is the fact that data is captured by instruments and processed by software, and is only later delivered to scientists. This is similar to the idea of automation that is provided by Pietsch: the collection and analysis of data is automated, though the scientists still need to work to interpret the data. However, in physics the collection of data requires building some extremely specialised instruments, such as the LHC. As such, it takes a different shape than within social sciences, where the data are collected through specialised, but already existing software.

In the humanities big data technology is used as well, especially when trying to analyse large bodies of text. This is applied to fields such as machine translation (Pietsch 2015b, p. 144). The data is sometimes also collected using automated methods, in taking samples of text from for example Twitter. Other times, however, the data has to be digitalised first. An example of this is a database of ancient texts, or the digitalisation of certain works in order to compare style or word choice.

We can see that a very important component of data-intensive science is indeed the automation of scientific processes, even though authors are sceptical of its impact. This automation takes different shapes within different fields. In some sciences, such as social sciences, it is easier to collect or generate data with the use of automated methods. In other fields, such as biology, this is much more difficult. The analysis of data is done with automated methods in most of the fields, for example by trying to establish correlations between data points with the use of algorithms. This type of analysis will be the main focus in section 1.3. This type of analysis often leaves it to the scientists to design

(12)

12 the research question and to interpret the results. Even if the scientists have designed data collection and analysis, this can still mean that the process is automated (Pietsch 2015a, 2015b). The theory then comes in the design of the programs, not in running the programs themselves. I discuss these different roles for theory at length in section 1.3.

1.3 - The influence of automation on scientific reasoning

The goal of this section is to discuss the debates within big data on automation of scientific reasoning. The main focus is the debate on the ‘end of theory’ (Anderson 2008), but I also discuss related concerns, such as induction and causal reasoning. As discussed in 1.2, one of the main forms of automation in data-intensive science is the establishment of correlations using algorithms. The influence of this type of automation will therefore be the main focus. As I discussed in 1.2, reasoning is automated when it is no longer done by humans. In this section, I concentrate on the question whether or not it is possible to fully automate reasoning, and how reasoning changes when it is no longer done by humans. This implies there is already a difference between how computers reason and how humans reason. In tackling the question of how scientific reasoning changes, we should pay attention to the ways in which reasoning is automated. We should also pay attention to how we use ‘scientific reasoning’, which is often used to denote the logic behind scientific discoveries. How do we get from certain types of scientific data to knowledge or a theory? Within data-intensive science, there are various discussions that concern this subject, such as discussions on the role of theory or induction.

The use of theory

An important debate within data-intensive science concentrates on the question whether there is a need for theory. The position that there is no longer a need for theory within data-intensive science is the consequence of automation. The idea is that if we can find correlations with automated methods, we no longer need theory to explain them. There is also a discussion on whether or not automation is completely possible, since it always requires some sort of theory.

Anderson (2008) argues that data-intensive science will mean the ‘end of theory’. As I discussed in the introduction, Anderson believes that the patterns and correlations established in data analysis are enough, and we do not need further theory or interpretation. Anderson’s example of working in this way is the method that is used by Google. Google uses algorithms to establish which pages are the best in response to a certain search query. Also, they use algorithms to establish which ads will work best on certain pages. This does not need any further theorising, all we need to know is that it works. Examples in science are a bit harder to find. Anderson gives one example of gene sequencing, where researchers sequence ecosystems in order to find correlations which may indicate certain species. Without knowing anything about the species, they know that the species are supposed to be there. A similar position to this ‘end of theory’ is that of agnostic science, as presented by

mathematicians Napoletana, Panza & Struppa 2010.

These types of claims are what Kitchin (2014) calls ‘empiricism’. Empiricism aims to come to impartial conclusions on the basis of data, without the use of hypotheses, models, or theory. Kitchin contrasts this with ‘data-driven science’, which he describes as follows: “As such, the epistemological strategy adopted within data-driven science is to use guided knowledge discovery techniques to identify potential questions (hypotheses) worthy of further examination and testing.” (Kitchin 2014, p. 6). There are here two ways in which theory enters in data-driven science: first, with the use of ‘guided knowledge discovery techniques’, where the question and method are already informed by theory.

(13)

13 Second, the hypotheses that are generated using data-driven science require further investigation, which is informed by these hypotheses and additional theory.

The empiricist position of the ‘end of theory’ is controversial within data-intensive science. However, most authors agree that science does not become theory-free. I discuss some of the arguments below. Data-driven science, as presented by Kitchin, seems to be the better alternative. I start with the arguments concerning the prominence and feasibility of theory-free empiricist science, before moving on to normative arguments.

As discussed in section 1.2, Leonelli (2012) criticises the idea of a lack of theory within data-intensive science by showing that there is still the need for theory to make the methods work. The conclusions within data-driven science need to be made by scientists. This calls for a more nuanced picture of theory-ladenness within data-intensive science. Pietsch (2015a) attempts to provide such an account for data-intensive science. He distinguishes external and internal ladenness. External theory-ladenness concerns framing the problem, which deals with theoretical assumptions about the relevance and stability of the factors and correlations identified in the data. We assume that a correlation found within the data is a stable connection under the circumstances that we investigate, and that it is not a coincidence. This external theory-ladenness is also important to avoid problems such as the Google Flu issue (see 1.1).

Internal theory-ladenness concerns the use of hypotheses about certain connections. We

hypothesize that there will be a correlation between A and B, and then investigate this. According to Pietsch, we do not have internal theory-ladenness in intensive science. He compares data-intensive science to exploratory experimentation, where a researcher tries all the different configurations that he or she considers relevant.

Philosopher of biology Ratti (2014) makes a similar point, taking data-driven science as the starting point for hypothesis-driven research. Ratti distinguishes between mining studies that may function as exploratory experiments (in the same way that Pietsch describes), and studies aimed at eliminating hypotheses. The mining studies provide very general connections that serve to guide future research. These are related to the empiricist theory-free studies. The latter, ‘hypothesis-driven’ studies, take a ‘hybrid’ approach, in which a large amount of hypotheses is tested using data. The examples Ratti uses are from genomics, in which thousands of possible genes are hypothesised as causes for medical conditions. The influence of one gene on a medical condition is one possible hypothesis, which can be eliminated using large datasets. The found correlations lead to some hypotheses that are not eliminated, and can be further tested in other research.

From these arguments we can conclude that theory-free science has a limited presence. The most pure form of theory-free science would be the mere establishing of connections without any further use of theory, but this is hardly present within science. The closest thing to it may be the exploratory data mining that both Ratti and Pietsch describe. This is closer to Kitchin’s idea of data-driven science, in which there are two ways theory can come in. First, there is the construction of the data mining itself, which has external theory-ladenness. We have to assume that the correlations we find are under stable circumstances, and that we have enough data to test all the possible correlations. Pietsch assumes that in exploratory data mining, we are looking for causal connections. This is not necessarily so, since defenders of theory-free science have argued that there is no need for a causal connection. Also, there are nowadays also technologies which can cluster data themselves, without need for the scientists to cluster them, as described in 2.2. The data mining is then not very

constructed itself. So it is possible to have even less theory than Pietsch presupposes. The second point at which theory can come in is the further investigation of conclusions of exploratory data

(14)

14 mining. From the conclusions, researchers start further studies, for example experiments or more informed analyses of data. This role of further investigations is also discussed in chapter 3. The prominence and feasibility of theory-free science or empiricism is very limited within data-intensive science. Data-driven science is the prominent model. I should remark that all the examples used in this discussion come from the fields of biology and genomics, which is only a small part of data-intensive science. However, these discussions are relevant to other fields as well and may be generalised. Some fields require their data to be selected using certain theoretical assumptions (such as physics) and some fields need theory to select the important qualities of data (such as the social sciences). For a more extensive discussion of other fields, see 3.1.

The arguments I have discussed so far are descriptive arguments. We should also look at the more normative arguments, which may underlie these descriptive arguments. Theory-free science leads to a loss in understanding, because it is unclear why and how certain correlations are found.

Understanding is an important goal of science. A lack of understanding will probably not stop policymakers or companies in making changes that can be effective, but we should question if science should move in the same direction. Anderson (2008) and other defenders of theory-free science claim that science can only advance using data analysis in this way, because of the complex phenomena we study.

Pietsch (2015b) remarks that data-intensive science may make use of another form of explanation, exactly because of the complexity of the phenomena. He identifies two ways of explanation:

“(…) (i) to explain by giving an argument that derives what is to be explained from a number of general laws or rules thereby relating a phenomenon to other phenomena and achieving unification (…) ; (ii) to explain by citing the causal factors that can account for a certain event, where these factors are difference-makers and can be identified by eliminative induction.” (Pietsch 2015b, p. 168)

Explanation in data-intensive science is mostly of the second kind, in which a certain event is explained in terms of preceding factors. The role of eliminative induction in this will be discussed in the next part of this section. These types of explanation are more prominent in sciences that deal with complex phenomena, such social sciences, psychology, medicine, and history (Pietsch 2015b, p. 169). The conditions of what is scientific in this type of explanation are not complete understanding or explanation in terms of general laws. This type of explanation however still requires some

theoretical concepts and interpretation. If we want to explain why someone is committing a criminal act and pointing to their socioeconomic background, we assume that the socioeconomic factors are a specific category, and that these factors are causally relevant. So while it is true that certain complex phenomena cannot be explained using overarching laws, explanation still requires theory. If we want the goal of science to be explanation and understanding, we cannot use theory-free science.

In conclusion, the ‘end of theory’ position is a naïve position about data-intensive science and scientific research. I discuss the fact that this position is naïve more extensively in 3.3. The closest thing to theory-free research are exploratory data mining studies, but these still require some theory in their construction and lead to more theory-informed research. Also, the role of theory is essential to goals of science such as understanding and explanation. This is why science cannot be completely theory-free. There are still changes in scientific reasoning, also concerning theory, which will be discussed in the next part of this section, which is about induction.

(15)

15

Induction

The changes within scientific reasoning following the rise of data-intensive science are for a large part related to induction. Inductive reasoning is seen as an important quality of data-intensive science (Leonelli 2012), which helps in guiding further research. In 2.3, I also discuss the role of induction in machine learning methods, and in chapter 3, I discuss the role of induction in data-intensive science extensively.

Induction is making a generalisation after many observations. For example, after observing a lot of ravens, one could conclude “all ravens are black”. This conclusion could change if one observes a raven that is not black. A bit more complex are cases where one wants to draw a conclusion with probability involved. In these cases, one also investigates a lot of different instances, but puts the conclusion in terms of probability. These conclusions are often of the form “given X, the probability that Y is …”, for example “given that this person smokes, the probability that this person will develop lung cancer is…”. This type of reasoning is also applied within data-intensive science. The data can be seen as a large amount of observations we can draw conclusions from and find correlations in. It is related to the discussion about theory, because there is the belief that induction can be done mostly without a theory to guide it, and that the conclusions are relatively theory-free. However, many inductive conclusions do require theory (Leonelli 2012, Kitchin 2014). The conclusions may require less theory, because they are not used to confirm or disconfirm a hypothesis. So induction can require less theory than for example falsification, but it does require interpretation and input from the researcher.

Related is the method of eliminative induction, in which a phenomenon is investigated under different variations (Pietsch 2015b). Pietsch (2014) distinguishes enumerative and eliminative

induction. In enumerative induction, instances of the same phenomenon are all seen as confirmation of a conclusion. One looks at similar instances and draws a conclusion from this. For example,

observing a black raven is confirmation for the truth of the statement ‘all ravens are black’. This can be problematic, because one only investigates cases that are similar, and not cases in which there is a certain parameter variation. Another type of induction is eliminative induction, in which many different instances are investigated, in order to identify which factors are relevant. Since Pietsch is interested in causal reasoning, these factors are also seen as causally relevant (Pietsch 2014). Because of the variations that are tested, the factors can be identified as causally relevant (Russo 2009). Eliminative induction is used in data-driven science (Pietsch 2015b, Ratti 2014), because in this scientists also check many different possible factors and configurations, and identify relevant factors. Pietsch (2014) states that eliminative induction is used in exploratory experimentation, which he compares to data mining studies (2015a).

So eliminative induction seems to be more prominent in data-intensive science, because in the large data sets many different connections are under study and relevant factors are identified. This also depends on the size and qualities of the dataset. Sometimes not all possible configurations are present, and sometimes data are selected precisely because they are from similar instances. There is the assumption that the possible relevant factors are measured within the dataset, and that there are no other factors, or ‘hidden’ or latent variables. Also, the connections that are established are not necessarily causal within data-intensive science. What this means for causal reasoning has to be investigated. The consequences of data-driven science for induction will be further investigated in chapter 2. The discussions within philosophy machine learning can give a more nuanced picture of the role of induction within data-intensive science. In chapter 3, I discuss the role of inductive reasoning within data-intensive science as well.

(16)

16

Causal reasoning

Another branch of the discussion about scientific reasoning in data-intensive science is about causal reasoning (Canali 2015, Pietsch 2015b). Authors such as Pietsch believe that we can find causal relations within large datasets. This can only happen if we use some theory, to distinguish causes and effects. Causal reasoning is therefore not a part of theory-free science.

We can wonder whether causal reasoning changes with the rise of data-intensive science. After all, the ways we find causal connections do not differ radically from how experiments are conducted. We still look for factors that give a certain effect, through examining the various circumstances in which it takes place (Russo 2009). The assumption that one factor is prior to another in a causal chain is still at work in experimental practices. This reasoning also applied when we try to interpret the results of a mining study. In these mining studies, finding relevant factors can be done on a larger scale, due to the size of datasets and the methods that we use for analysis. This changes our conclusions: we can identify more factors and therefore find causes in more complex phenomena (Pietsch 2014, 2015b). Our predictions can become more accurate, because we have more data. In this sense, the scientific reasoning does change. However, there is also a danger of interpreting weak correlations as causes. Because of the large datasets, correlations are found more easily, and could be interpreted as causal, even if they are very weak. The role of causal reasoning and explanation is also discussed in section 3.2. In this, I also discuss the form of the conclusions of causal reasoning. Is this simply pointing out causal factors, or should it be in the form of predictive laws?

Conclusion

Big data play an important role nowadays. In this thesis, I understand big data in terms of methods and practices, rather than in terms of the size and quality of the dataset, because I am interested in the consequences of these methods for scientific reasoning. The use of these methods is very present in data-intensive science. How these methods are used in data-intensive science differ from field to field. In some fields both data collection and analysis are automated (genetics), in other fields only parts of data collection and analysis are automated (humanities).

I focused my discussion of the consequences of automation on the consequences for the automated search for correlations. According to some authors, the consequence of this automated search for correlations was the emergence of a ‘theory-free’ science, but I have argued that this is feasible nor desirable. Science still requires a lot of theory to draw conclusions from data, because the data has to be organised, and the programs have to be programmed to recognise certain features. We also want to use science to explain things, not only to predict them. This is why theory-free science is not a consequence of the automation of scientific reasoning. In the next chapter, I further explore the consequences of automation for induction, by drawing from discussions on machine learning and artificial intelligence.

(17)

17

Chapter 2: Machine Learning and Scientific Reasoning

The goal of this chapter is to draw on the discussions from the philosophy of machine learning and Artificial Intelligence (AI), in order to see how the conclusions from these discussions can be applied to the discussion on the use of big data methods in data-intensive science, as discussed in chapter 1. AI and machine learning methods are applied to big data practices, which is why I discuss machine learning and data mining. The philosophy of AI and machine learning has been around longer than the philosophy of big data, since discussions about big data and its methods have only gotten attention in the last decade. It could therefore be useful to look at the philosophy of AI and machine learning, see what we can learn from it and compare it to the discussions about big data and data-intensive science. In the discussion of AI, there are different beliefs about what the goal of artificial intelligence is. I go into the different positions by philosophers of science Paul Thagard and Donald Gillies in section 2.1. In section 2.2, I discuss the technology of machine learning and how it is applied in scientific practice. I conclude with the changes in scientific reasoning that follow from data mining in section 2.3. These changes are mostly related to induction and similar forms of reasoning.

2.1 - Approaches in the Philosophy of AI

In this section, I discuss different perspectives on Artificial Intelligence. AI is aimed at computer programs doing certain tasks by themselves, and machine learning technology is used in AI. In section 2.2 I discuss the application of machine learning technologies specifically. It is useful to first discuss different ways to model AI and how these perspectives influence the use of AI technology in science. I discuss two possible ways to approach AI.

In AI, there is a division between what Thagard (1988, p. 3) calls ‘neats’ and ‘scruffies’. The ‘neats’ are interested in building AI as a system of logic. Their starting point is using logical and mathematical principles for solving practical problems. Gillies (1996, p. 19) calls this the Turing tradition, and he traces it back to the development of the Turing machine. It mostly makes use of deductive logic. The ‘scruffies’ base themselves on psychology. They look at what kind of inferences people make and then try to simulate these in an AI program. These human inferences could be based on logic, but there are also other principles used, such as inductive reasoning and inference to the best

explanation. Both of these approaches can have their application within science, but the role of the technologies differs.

‘Neat’ and ‘scruffy’ AI methods are used differently within the scientific practice. Thagard (1988) describes two ‘scruffy’ types of AI methods. He describes simulations of human reasoning, which are used to replicate scientific conclusions, and also the construction of Artificial Neural Networks (ANNs).

An example of a simulation of human reasoning is the simulation of the discovery of Kepler’s laws by giving a program the same information that Kepler had (Gillies 1996, p. 20). Thagard (1990) calls this ‘explanation-based learning’ (p. 265). In these types of simulations the whole process of reasoning would be automated. The focus is not on data collection and analysis, but rather on reasoning from certain premises, and the use of concepts by computer programs. In these cases, the program is given a certain concept, or tries to extract a concept based on just one instance. For example, the program can form the concept ‘to kidnap’ by trying to describe the situation in which someone is abducted for money. Thagard believes that programs could also learn to extract concepts from certain instances and combine these concepts, and come up with scientific explanations. He describes a program that could associate certain concepts with each other, and shows how this

(18)

18 program can come to associate ‘sound’ with ‘wave’, through other concepts such as ‘instrument’ and ‘stringed instrument’ (Thagard 1988, p. 22). Every concept is also associated with some rules, such as how waves move. Thagard (1988) uses this description to picture human reasoning in a

computational way. Gillies points out that these simulations of reasoning have not been successful in providing new conclusions, but only in replicating earlier conclusions. This means that this

explanation-based method is not very useful for research, but can be used as an analogy for human reasoning.

There is however a rise in the use of technology inspired by scruffy AI in modern-day science in the use of Artificial Neural Networks (ANNs) (for an explanation by computer scientists see LeCun, Bengio & Hinton 2015). Thagard (1990) calls these types of methods ‘connectionist’. Artificial Neural

Networks consist of layers of small units or ‘neurons’. The units of the first layers process the input, and provide the output to another layer of units. This layer then processes this output from the first layer. Finally, there is an output layer which for example gives a classification. Figure 2 (Thagard 1990, p. 267) gives a very simple example of a neural network. Depending on the activation of the units ‘mane’, ‘udder’, and ‘tail’, hidden units will give a certain value, which leads to the output of a certain animal. The network can become more effective through backpropagation, and adjust the weight of the different connections. For example, the unit ‘mane’ could be seen as not relevant to the output for horse. However, this leads to a wrong classification. Through backpropagation, the program can adjust this link. These networks can be used to analyse things like pictures, because one layer can identify the different parts of the picture, one layer can identify the different shapes, and this can be refined over several layers.

Figure 2 - An example of a very simple neural network (Thagard 1990)

ANNs are now used successfully in different branches of science. There are some studies by

psychologists which make use of ANN’s in order to simulate human reasoning (Keshavan & Sudarsan 2017), and draw conclusions on how psychosis works. This is an application of the simulation of human reasoning, but it is not explanation-based. ANNs could also be used to conduct sophisticated data analysis. Examples of the use of ANNs in research can be found in economics (Namazi,

(19)

19 More prominent in current science, and especially current data-intensive science, is technology in the tradition of ‘neat’ AI. In this, scientists use algorithms to analyse data, with logical principles and rules that can be given by mathematical formulae. This differs from the ‘scruffy’ methods that have been discussed, in the sense that the ‘scruffy’ methods do not use mathematical formulae, but rather the association of concepts and the different ‘neurons’ in an ANN. The conclusions that ‘neat’ AI methods present are in terms of clearly identifiable rules or mathematical conclusions, such as correlations or probabilities.

An example of ‘neat’ AI that is given by Gillies is a study into the structure of proteins, where a program analyses the structure of certain proteins and comes up with rules for finding the structure. This is done by analysing a large amount of protein structures that are coded. It is done differently from how humans would do it, by trying different classifications and see how they work. The program presents a system of rules based on qualities of the data such as “if at point x there is y protein to be found, the structure will be z”, which focuses on the qualities of the data and not on concepts that have been formed by the program. The reasoning is also not distributed through different units. The AI in this case is not used to recreate human reasoning, but is rather based on principles of logic. These types of methods are used as an addition to science, and not as a simulation of what scientists do (Gillies 1996, p. 156). Automation would then mean the automation of specific tasks, such as classification or clustering tasks. I will also discuss this type of automation in machine learning technology, in section 2.2.

2.2 - Machine Learning and Data Mining

A specific technology which is used in AI and which can help us to understand the examples and methods discussed in 2.1 is machine learning. Machine learning is a way to make computer programs adaptable and make the programs find their own solutions to problems (Alpaydin 2010, p. 15). Machine learning techniques are prominent in big data analysis, for example in classifying and clustering data. The technologies that Thagard and Gillies describe in their analyses are also machine learning technologies. To understand the consequences that they describe, it is useful to know more about machine learning technology. In this section, I explain two different strands of machine learning, supervised and unsupervised learning, and their relation to big data practices. I also discuss their use within data-intensive science.

In the 1970’s , the idea came up that instead of people using algorithms so that the computer programs could solve a specific issue, the algorithms could instead be used so that computer programs could learn how to solve problems (Kubat 2015, p. xi). This would be more effective and could provide us with new solutions. This learning can happen by way of examples, just as people learn certain things. When you ask me what my mother looks like, I could give you a description of the features of my mother’s face, but it would be much more effective to show you a few photos. From this you extract the features you need to remember in order to recognize my mother. The same happens with machine learning.

Another example of how this can work is with a set of pies that Johnny likes and a set of pies that Johnny does not like (Kubat 2015, p. 2). This is called the ‘training set’. We want to be able to predict if Johnny will like a pie or not. A machine learning program can then try a first classification, for example ‘if the pie is round, Johnny will like it’. It can then add further classifications, such as the pie being round and white or square and dark. Finally, the program will have a set of rules that can be used to classify (most of) the pies. Then we use a testing set of other pies to see if the classification works. This type of machine learning is called ‘supervised learning’. In this, the output values and the

(20)

20 training set are already provided by the researcher. In the example of Johnny’s pies, the output values of ‘Johnny likes’ and ‘Johnny does not like’, and a training set of pies were already provided. In unsupervised learning on the other hand, the values are not yet provided (Alpaydin 2010, p. 11). An example of a method used in this is clustering, in which the machine learning program makes clusters based on certain values for properties. It can for example group customers of a store according to age or average amount spent. In this, the researcher does not provide age groups, such as 18-25, but the computer program clusters the groups by itself. This is done by calculating different points so that all data will have the closest distance possible to a certain point. For example, the amount of money spent by store customers can be divided into two clusters, so that for both clusters, all points will be closest to the centre of the cluster.

The machine learning techniques are also used in data mining, of which I discussed the applications in chapter 1 (Alpaydin 2010, p. 2). Machine learning is closely related to big data. Supervised machine learning techniques can be applied to large datasets in order to organise the data into different categories. Training sets of data are required for successful supervised learning. An application of supervised learning is for example diagnostic software. A doctor could enter certain symptoms into a system, and then a disease is matched to the symptoms. The computer program ‘knows’ which symptoms correspond to a certain disease, because of examples of different diseases. In science, supervised learning is also used for classification, which can reveal certain properties of the data set. Gillies (1996, p. 50) uses the example of trying to find out the secondary structure or shape of

proteins based on the primary structure or the sequence of residues. With a handful of examples, the program can find the secondary structure of proteins based on the primary structure. What is

automated in this case is classification, that is too complex for the scientists themselves to do. Supervised machine learning can thus be used to find patterns in training sets that are available, and these patterns and classifications can be further investigated. In the example of the proteins,

scientists can further investigate why the rules that the program found are successful.

Unsupervised learning can also be applied to find patterns in large datasets. This is related to the discussion of exploratory studies in chapter 1: it can be used to find patterns or correlations that may be used for further research. With unsupervised learning, there is more automation than with supervised learning. It is not the classification that is automated, but the clustering of properties and finding patterns. It can be applied to social media data, which can be clustered based on the words that appear in the messages.

So automation with machine learning can happen in different ways. The most prominent one is through the automation of specific tasks, mostly in the tradition of neat AI. A complete simulation of human reasoning in the tradition of scruffy AI is not possible yet, but scruffy AI technologies such as Artifical Neural Networks are increasingly used in scientific practice. The automation of specific tasks can consist in classification, clustering, or the spotting of patterns. Classification is done with

supervised learning, whereas clustering or finding patterns is done by way of unsupervised learning. ANN’s are used for these types of classification tasks such as image recognition. Section 2.3 focuses on the consequences for scientific reasoning of specific automated tasks, because these are most widely used.

(21)

21

2.3 - Machine learning and scientific reasoning

In this section, I explore the consequences of automation using machine learning methods in for scientific reasoning. As discussed in section 2.2, automation with machine learning methods is mostly the automation of specific tasks. Gillies discusses the consequences for induction of these specific tasks. Before I discuss this at length, I first discuss Thagard’s idea of simulation of science in the tradition of scruffy AI, and what this may tell us about scientific reasoning. Furthermore, I discuss the role of theory, analogous to my discussion of theory in data-intensive science in 1.3.

Simulations of science

As I have discussed in 2.1, Thagard (1988) believes that ‘explanation-based’ machine learning can be used to learn about scientific reasoning. According to Thagard, hypotheses are constructed using abduction (1988, p. 83), or ‘inference to the best explanation’. The program will look for the best explanation for a certain state of affairs. A new theory or a new concept could be introduced as an explanation. Evaluation of theories can then be done in terms of how much it explains (consilience) and how much concepts are needed (simplicity). Finally, it also looks at similar theories in other fields (analogy) (Thagard 1988, p. 98). Thagard’s aim is therefore mostly descriptive, though he does make some normative claims concerning the nature of science.

However, this use of machine learning does not lead to a change in scientific reasoning. Machine learning methods that are not explanation-based can lead to a change in scientific reasoning, because lots of different data can be analysed at once, and a lot of different conclusions can be drawn. This does not have to do with neat or scruffy approaches to AI, because Artificial Neural Networks can also be used for data analysis. In this use, the methods are an addition to scientific reasoning. Thagard claims that we can get a clearer picture of scientific reasoning from using simulations of the scientific process. Machine learning then becomes a method for philosophy of science. For a similar claim, see Korb (2004).

Williamson (2004) and Bensusan (1999, 2000) make a more moderate claim (not identifying philosophy of science with machine learning). They believe that both philosophy of science and explanation-based machine learning study what they call ‘inductive strategies’ (Bensusan 2000, p. 1). This is why machine learning may be used as a method for the philosophy of science. This is under the assumption that human reasoning resembles the reasoning in machine learning. Bensusan (2000, p. 4) does not exclude the possibility that the induction done by machines may differ from the induction done by humans. In this latter case, it is more useful to study machine learning as a method used in science than as a simulation of the scientific process. However, this explanation-based learning is not widely used in science, so it does not reveal a lot of new scientific practices. It is worth noting that this literature concentrates on simulations óf explanation and scientific reasoning, instead of using simulations as a tool within science. There is a thriving literature on simulations as a method and a tool in science. Gray (whose lecture can be found in Hey et al. 2009) mentions the use of simulations as a third paradigm before big data. Certain biological, physical, and social systems can be simulated in order to learn about them. For example, an ecosystem could be simulated to show what will happen to wildlife when there is a shortage of food. This is different from the simulations of the scientific process, in which the computer programs are used to mimic scientists.

Explanation-based technologies are not widely used in science, and is also not used a lot as a simulation of human reasoning. What they can reveal is therefore limited. Machine learning technologies that are not explanation-based can bring about changes in scientific method and