• No results found

The influence of interpretable machine learning on human accuracy

N/A
N/A
Protected

Academic year: 2021

Share "The influence of interpretable machine learning on human accuracy"

Copied!
56
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

The influence of interpretable machine learning on

human accuracy

A study on the increased accuracy of a LIME-explanator on a classification

test

Rens Sturm

Master Thesis

MSc Marketing Intelligence

(2)

2

The influence of interpretable machine learning on

human accuracy

A study on the increased accuracy of a LIME-explanator on a classification

test

By: Rens Sturm University of Groningen

Faculty of Economics and Business (FEB) Department: Marketing

Master: Marketing Intelligence

June 2020

First supervisor: K. Dehmamy Second Supervisor: J. Wierenga

Saffierstraat 22, 9743LH Groningen 06-83773819

r.d.sturm@student.rug.nl

(3)

3

Management Summary

The field of machine learning is growing at an unprecedented rate increasing the applications in everyday life. A new prominent role of machine learning is the automatization or support of decision making for businesses, courts and governments, helping them to make faster and better decisions. The rapid expansion of machine learning decision making has caused unease with academics, consumer groups and legal experts. While the reasons for unease vary, one of the major aspects is the fact that while the machine learning models make or support decisions, they do not provide an explanation nor supporting arguments. Together with the fact that machine learning models are, like all systems, fallible, there has been an increasing demand for interpretable machine learning models.

This research tests whether an interpretation mechanism included in the machine learning model increases the trust in the model and whether an explanatory mechanism makes the decision-maker more accurate. To do this, the researchers trained a neural network, which is a type of machine learning model, using a dataset containing all passengers on the Titanic which sank in 1912. After training the machine learning model, it was made interpretable by using a Local Interpretable Model-agnostic Explanation (LIME). The interpretable model was used by 145 participants to estimate whether certain individuals had survived the Titanic-disaster.

It was found that using an explanation mechanism positively influences the accuracy of the

participants in estimating survival-rates in a significant way (B = 0.1037, p = 0.003). Second, having an explanatory mechanism also increase the trust of the participant in the model (B=0.8955, p = 8.8e-08). We found women to be slightly better at predicting survival rates than men, even though the explanation for this might have more to do with the methodology. We found no correlation between expertise on the Titanic and accuracy, nor between experience with machine learning models and accuracy.

Managers can use these insights in critical areas where a boost in accuracy can increase benefits a large amount. Companies that are using machine learning already, for recommendation or

automated decision making, can use explanatory mechanisms to increase the trust of the consumer in the model. Further research is needed for a generalization of interpretable machine learning in other areas than decision-making support and classification problems.

(4)

4

Preface

The writing of a thesis is a rite of passage underwent by all students. For me the journey was thoroughly enjoyable being able to be immersed in the field of machine learning and write about it. Combining the new field of machine learning with old history like the Titanic was especially

satisfactory. I am happy that I was able to sneak in a little history at last. Troughout the writing I was helped by a few people I would like to thank.

First of all my supervisor Keyvan Dehmamy. While help from your supervisor can be expected, Keyvan went truly above and beyond my expectations, something I am very grateful for. Second I would like to thank my fellow student Darius Lamochi with who helped me from the start. At last I want to thank my friends and family who helped me, especially Daan Romp who was so nice to check my thesis for spelling.

I wish that you will enjoy reading my thesis.

Rens Sturm

(5)

5

Table of content

Chapter 1: Introduction ...6

1.1: Relevancy of the problem ...6

1.2: Problem analysis ...6

1.3: Academic and practical relevance...7

1.4: Originality of the research question ...7

1.5: Outline thesis ...8

Chapter 2: Theoretical framework ...9

2.1: Growth in appliance of machine learning ...9

2.2: Interpretable and non-interpretable models ... 10

2.3: Techniques to make uninterpretable models interpretable ... 12

2.4: Appliance of interpretable black box modelling in decision support... 15

2.5: Conclusion ... 16

Chapter 3: Research design & Methodology... 19

3.1: Introduction to neural networks ... 19

3.2: Model development & dataset description ... 19

3.3: Research method ... 22

3.4: Data collection ... 23

3.5: plan of analysis ... 24

3.6: Conclusion ... 24

Chapter 4: Data analysis ... 26

4.1: Sample analysis. ... 26

4.2: Reliability, validity, representativity ... 26

4.3: Hypotheses and statistical tests ... 27

4.4: Interpretation ... 30

4.5: Conclusion ... 31

Chapter 5: Discussion, limitations and recommendations ... 32

5.1: Reflective discussion on the results ... 32

5.2: Limitations ... 34

5.3: Academic and managerial conclusions... 35

5.4: Conclusion ... 35

(6)

6

Chapter 1: Introduction

In this chapter the research problem is introduced and its theoretical and managerial relevancy discussed. We will show that the research problem is urgent, important and original.

1.1: Relevancy of the problem

Machine learning, an algorithm that improves by experience (Langley & Simon, 1995), hasachieved a wide level of application. It is, amongst others, used in defense, research, production, air and traffic control, portfolio selection and decision support (Zuo, 2019). The market for machine learning is predicted to grow 186% annually until 2024 (Zion market research, 2017). An example that shows the potential of machine learning decision support is a study where US police chiefs used machine learning to predict whether agents were at a danger of overreacting at civilians. The final decision to intervene was left up to the chief, but by using machine learning support accuracy increased by 12% (Carton et al. 2016). A Bank of England survey showed that two-thirds of finance companies in the UK use machine learning as decision support (Bank of England, 2019), making machine learning relevant in the 21st century.

Increased use of machine learning has evoked feelings of unease in consumers (Cio Summits, 2019). Reasons are that used are that machine learning models often do not give an explanation nor decide perfectly (Guidotti et al., 2018). In response to the growing feeling of unease, the European Union has passed legislation that requires, by law, that consumers have a “right to explanation” after having been subjected to a decision by an automated process (Regulation, 2016). Consequently if a human agent uses a machine learning model as input for the decision, the model must be explained. Models of which an explanation can be given are defined as interpretable or white box. Models that cannot be explained are defined as uninterpretable or black box. Thus due to the growing concern and growing unease the need for interpretable machine learning models has become urgent.

1.2: Problem analysis

(7)

7

A further complication is the absence of consensus about the definition of interpretability. For some academics and managers mechanical knowledge of the model is sufficient. Others want to know what input was decisive in the decision-making process (Guidotti et al., 2018). Secondly a discussion has been conducted on how to measure interpretability (Doshi-Velez, Kim, 2017). Some researchers have stipulated that increased human accuracy is an important metric to keep in mind (Doshi-Velez, Kim, 2017) and in some cases even the most important one.

1.3: Academic and practical relevance

The challenge of making machine learning interpretable is almost as old as the field of machine learning itself. In 1983, the first solutions of explaining why the model predicted a certain output were proposed, laying the foundation for later innovations (Swartout, 1983). Since then, research has mainly focused on explaining machine learning models. Removing part of the model, changing the input to see how the output changes or calculating certain values have been proposed in order to make these uninterpretable black box models understandable (Guidotti et al. 2018). Absent is research on the effect of interpretation methods on the human decision maker and his/her accuracy. Considering the use of machine learning for this goal, researching the topic of machine learning and human accuracy provides an original and relevant research question.

Establishing a correlation between interpretable machine learning and human accuracy would help establish increased human accuracy as an important metric for interpretable machine learning and help settle the discussion on how to measure interpretability.

Due to the EU legislation, companies have to use interpretable machine learning for their

procedures, making it all the more relevant. Since companies use machine learning during decision support (Bank of England, 2019) it is relevant to know whether interpretability helps in this process. If accuracy increases it could be used by managers to make better decisions in critical situations, helping the organization reach its goals.

1.4: Originality of the research question

In this report we will look at the research question: “What is the influence of an explanation of the

output of a model on the accuracy of the human agent?” Answering this question will add to existing

(8)

8

judge the machine learning model on, it has (to the best of our knowledge) not yet been tested in an experimental set-up.

1.5: Outline thesis

This thesis builds on earlier research and experiments. In chapter two we will discuss the theoretical framework of the thesis and draw a number of hypotheses from the existing literature on what seems likely, but is not proven. In the third chapter we will explain how we have set up an

(9)

9

Chapter 2: Theoretical framework

During this research we will analyze the role of machine learning recommendations on behavioral decision making. This research uses insights of other researchers as a foundation, framework and stepping stone. In this chapter we will briefly describe relevant literature about machine learning, interpretable and non-interpretable machine learning and decision making. Existing literature is used to formulate hypotheses that will be used for the conceptual framework at the end of this chapter. These hypotheses will be tested in the next chapter.

2.1: Growth in appliance of machine learning

To understand the relevance of interpretable machine learning we have to discuss the context of machine learning. In this paragraph we will look at applications of machine learning methods, why it is popular and distinct between situations where a human agent is the final decision-maker or where the machine learning algorithm is autonomous. This distinction will demark the theoretical reach of the research.

In the past few years, machine learning has greatly improved in performance. In 1997 IBM’s Deep Blue defeated world champion Kasparov in chess which is a structured game with rigid rules

(10)

10

Proposed explanations of superior performance point at limited human cognitive capacity for processing new information (Shahid, Rappon & Berta, 2019). Machine learning models can handle large quantities of complicated information while humans find handling many data points difficult. Humans often work with small samples of personal experience while machine learning models work with larger samples (Bode, 1998). Also humans have a limited capacity to mentally calculate

interactions, while especially Neural Networks have less of a problem with this (Shahid, Rappon & Berta, 2019). Machines can work through large numbers of scenarios, while humans have a tendency to lock in early (Nickerson, 1998). The basic explanation seems to be that the computational power of computers is greater than human processing power.

Daniel Kahneman, Nobel laureate and professor of human decision making has famously stated that he thinks simple algorithms consistently outperform human decision making due to a lack of

sensitivity to noise (Kahneman, Rosenfield, Ghandi & Blaser, 2016). A literature overview done in 1979 showed that there is a consistent superiority in algorithmic decision making over human decision making (Dawes, 1979). This confirms early stated theories about machine learning

superiority. However, critics point at the fact that historical data may not be representative. Before the 2008 crisis, house-prices steadily went up causing algorithms to assume that this would forever happen. However the steady climb caused a financial crash, and house prices decreased sharply at an unprecedented rate (Trejos, 2007). Other critics point out that humans are better than machine learning models in some tasks (Thiel & Masters, 2015). Young children have no difficulty in making a distinction between dogs and cats, while machines do.

Thus, machine learning is getting better at many tasks humans have previously performed.

Automation and decision support are two ways machine learning can add value. Models are better at handling vast quantities of information and thinking through them, whilst humans outperform computers in other areas. With these benefits we predict that humans will be more correct and precise in their predictions (human accuracy). Therefore, we conclude that when machine learning is used, and a human agent can understand it’s output, human accuracy increases.

2.2: Interpretable and non-interpretable models

(11)

11

In the introduction we stated that the definition of interpretability is open for discussion (Miller, 2019). In this research we will define the interpretability of the model as the degree to which a human can understand the cause of a decision (Biran & Cotton, 2017; Miller, 2019). Guidotti et al. (2018) defined two components of uninterpretable models: 1) Opaqueness and 2) number of parts. A model is opaque when internal workings cannot be observed. Neural networks are mostly

uninterpretable because internal workings are unobservable, and because they contain many parts. Models that provide no explanation (also called black boxes) lead users to see the model as less trustworthy (Ribeiro et al, 2016). Providing an explanation increases the acceptance of movie recommendations (Herlocker, Konstan, Riedl, 2000). In a 2003 study, participants rated the recommendation model as much less trustworthy if it gave no recommendation and used it significantly less if no change was made (Dzindolet et al, 2003). Users report feelings of violated privacy if black box models make recommendations (Van Doorn & Hoekstra, 2013), and a feeling of unfairness (Zhang & Bareinboim, 2018).

Second, a black box model might not work as good as one thinks. It happens that a model ‘cheats’ by focusing on an artificial feature. For example a complicated neural network could accurately predict whether a tank was of the United States Army or not by recognizing if the photo had clouds or not (Guidotti et al, 2018). Owned tanks were photographed with good weather, enemy tanks with bad weather. Another model could recognize whether the animal was a wolf or a husky by spotting snow in the picture (Guidotti et al, 2018). The reason for this ‘cheating’ is simple, the model does not know these features are ‘off limits’. If the model could give an explanation, these kind of mistakes could be spotted and fixed before they cause damage. Statistics on these kind of errors are missing, but the risk is there.

Lastly, black box models that are trained on historic data can take over undesirable patterns. A trained model in 2016 predicting the risk of crime recidivism showed a large racial bias against people of color (Guidotti et al, 2018). Also, in 2016 Amazon’s decision to not offer same-day delivery in minority neighborhoods was largely influenced by a predictive machine learning model (Letzter, 2020). These are all things that companies want to avoid of course, since the human rights

declaration specifies that treatment should not depend on race ("Universal Declaration of Human Rights", 2020).

While uninterpretable models carry disadvantages, so do interpretable models. A major

(12)

12

While there have been few opponents against explainable models, they have been used little (Ribeiro et al, 2016). Many agree that explainability is good to have but few are willing to offer up accuracy and a competitive advantage in order to obtain it. The 2016 EU legislation has made it mandatory to provide an explanation when one is given. What constitutes as an explanation however, is open for interpretation. This ruling made the need for accurate explainable models, and a consensus on what an explainable is and how it can be measured is all the more urgent.

In conclusion, uninterpretable models carry some major disadvantages but are still preferred in practice due to their increased accuracy. The ruling of the EU has made it, at least in the European Union and its member states, necessary by law to use explainable models in scenarios of decision making. Due to the opaqueness and multitude of parts we conclude that the more complex a machine learning model is, the less interpretable a human agent will find the model.

2.3: Techniques to make uninterpretable models interpretable

Previously, the need for accurate interpretable models was specified. Popular solutions try to explain the model so a human can understand instead of simplifying it, so accuracy and interpretability are both high. We will look at popular methods, and explore the influence of explanation mechanisms on trust and understanding. Later, this relation will be used in the conceptual framework.

In the previous paragraphs the distinction between black box and white box models was made. There is a third option: explainable black box models. These models are opaque and contain many parts but give as a third step an explanation of what they did to enhance interpretability (Guidotti et al., 2016). These kind of explanations are not at the expense of accuracy.

Black box models can be explained in many ways. We will look at three popular methods: ablation, simplification and saliency.

A basic method of understanding black box models is ablation, where parts of the model are

removed to see how the accuracy changes (Xu, Crammer, Schuurmans, 2016). Important parts of the machine learning model are thus observed. A major downside of ablation is that it is computationally expensive and does not respond well to scaling (Sundarajan, 2017).

Simplification models try to recreate a simpler version of the original model (Yang et al, 2018). A

(13)

13

Lastly, saliency designs increase and decrease part of the input to see when the model reaches certain tipping points (Sundarajan, 2017; Guidotti et al., 2016). A popular method is the Integrated

Gradients method, which scales a pixel from 0% strength to 100% to determine the slope and the

optimal point of the slope. By only turning the pixels on which are necessary for the model to understand the picture, it becomes more clear for a human why the model recognizes the picture as such. An example is shown in figure 1 and 2 below.

Figure 1 and 2: a picture of a turtle analyzed with Integrated Gradients (Sundarajan, 2017).

A second popular saliency method, which we will use in our research, is the Local Interpretable

Model-agnostic Explanations (LIME) approach (Guidotti et al., 2016). The LIME-technique changes

the input of the model marginally and sees how the output of the model differs, thus deducing the influence of the input-variables. An advantage of the LIME-technique is that it can be used for multiple machine learning methods, the method is so-called model agnostic, and is therefore flexible in use.

Saliency methods do not explain the inner workings of the machine learning model (Guidotti et al., 2016), they merely explain the relationship between the input of the model and the output (Ribeiro, Singh, Geustrin, 2016). This means that some aspects remain unobservable, and thus their problems will remain.

(14)

14

human to understand why the decision was made thus increase interpretability. However, It should be noted that a 2019 study found that giving a good explanation does indeed increase trust and understanding, but giving a bad explanation decreases trust (Papenmeier, Englebienne, Seifert, 2019). The sample size (N=327) of this study was large enough to be reliable. If the sample size would have been small, the contradictory finding could have been a false negative. It should be noted that this finding has, to the best of our knowledge, not yet replicated by the authors themselves or third parties. It is still possible that the study is a false negative. If the study is replicable then a difference in study setup or an underlying nuance is likely.

Why is an interpretable model more trustworthy than an unpredictable one? A very basic explanation can be found in evolutionary psychology: the unknown carries risk, and we are risk-averse (Zhang, Brennan, Lo, 2014). If we delve deeper, we find additional theories: providing an explanation gives the participant the option to assess the fitness of the machine for that particular case. In a 2016 study, researchers found that participants rate models as more trustworthy even though they make faults, if the human has the possibility to change the model (Ribeiro, Singh,

Geustrin, 2016). However, it should be noted that the sample size for that particular study was rather small (N=100). Other studies do find that explanations increase trustworthiness if the human has the possibility to deviate from the final recommendation (Kim et al, 2016; Gkatzia et al, 2016; Biran and McKeown, 2017). A second basic explanation is that an explanation provides the human actor with the possibility to examine internal logic. It is well established that computer programs do not follow ‘common sense’ and make ‘silly mistakes’ (Ribeiro, Singh, Geustrin, 2016).

Providing an explanation thus seems to increase interpretability and trustworthiness, even though there are some contrary findings. When an explanation is given, it becomes clearer why the model gave a certain prediction. Even though the internal workings of the machine, nor the interaction between parts however are not explained. Hence we draw the following hypotheses:

H1: When a machine learning model contains an explanatory mechanism, the human agent increases in accuracy.

H2: When a machine learning model contains an explanatory mechanism, the human agent sees the machine learning model as more trustworthy

(15)

15

2.4: Appliance of interpretable black box modelling in decision support

Next we will discuss the potential of interpretable black box modelling. We will describe the expected relation between interpretable models and human accuracy.

As we have seen before, machine learning can often predict better than humans. Judges often have to decide whether a suspected criminal can wait out his trial at home (often by posing a bail) or in jail if he/she is a flight risk or expected to commit crimes while at home. A 2017 study developed an algorithm that could decide better, and thus reduce jail rates by 42% without increasing crime. Human judges were too overwhelmed by ‘noise’, unrelated information (Kleinberg et al., 2017). The difference opened up the question: If a machine learning model judges criminals better than human judges, shouldn’t we leave the decision up to the bots? Opponents stipulate that this would be unwise and even unethical (Bostrom, Eliezer, 2011). Algorithms do not understand values like fairness and liberty, nor should we subject ourselves to computers we don’t understand. Even the researchers themselves were hesitant to replace all human judges by computers since their study wasn’t replicated and errors could be there (Kleinberg et al. 2017). However, they were optimistic. Kleinberg et al. however do not specify what their recommendation is for scenarios where a model A. does not have (sufficient) data to work with, B. doesn’t understand import variables (like certain values) or C. where humans are better in part of the job. Hence there are limitations to when machine learning is better than a human. In situations where the human nor the machine learning model is superior in all the features used for prediction, there is an opportunity for synergy. A third strand of the argument proposes a middle way. Do not replace humans with algorithms but augment them (Rubin, 2019). This way the human and machine both work at what they are best at. In 2017 Doshi-Velez and Kim proposed that we should judge decision-aiding models on whether they help the decision-maker or not since this is close to the goal of the model. Using this middle way would allow the human agent to control for variables the model does not understand, or does not work well with. It would also simplify who is responsible if the decision does not pan out. Out of these benefits we confirm our first hypothesis that interpretable machine learning increases accuracy in the human agent.

(16)

16

way (Taylor, 1975). While the claim has been made that age has a negative effect on the ability to deal with new technologies, like machine learning, the study found no empirical evidence to back this claim up (Taylor, 1975). Other studies found elderly participants to be more risk-averse than younger participants (Cauffman et al., 2010). The researchers found this effect to be consistent in cases where it was advisable to be risk-averse as when it was not in their interest (Cauffman et al., 2010). Elder people however have found to be more rigid in their thinking due to a decline in fluid cognitive ability, i.e. the ability to reason and problem solve without relying on previous experiences (de Bruin et al., 2010). Based on these findings we can conclude that if one classifies machine learning as a new technology, it is likely that elder participants find it harder to work with it. This conclusion needs yet to be tested.

Hence we draw the following hypothesis: H4: the age of the human agent influences the causal link

between machine learning support and human agent accuracy negatively.

The influence of gender on decision-making, especially in managerial positions, has been a controversial topic for centuries. While women are, on average, underrepresented in politics ("Women in Parliaments", 2020) and in business executive positions ("Female Business Leaders", 2020) there does not seem to be a convincing biological factor affecting decision making. Women do seem to be more risk-averse than men, regardless of the level of ambiguity (Powell & Ansic, 1997). Secondly women do score higher on the personality ranking of agreeableness (Chapman et al., 2007). This may mean that women are more likely to follow the advice of the machine learning model. Hence we draw the following hypothesis: H5: the gender of the human agent influences the causal

link between machine learning support and human agent accuracy.

Age and experience are marginally correlated, but not per se the same thing. A manager of sixty years old using a machine learning model for the first time is aged, but not experienced (in machine learning models). While it seems common knowledge that extended experience causes superior performance, it may not be the case. In a study under physicians, the researchers found no

correlation between age and performance (Ericsson, 2006). The same researchers found cases where experience leads to worse performance due to overconfidence (Ericsson, 2004). Other studies have indeed found no link between experience and performance (Ericsson & Lehmann, 1996). Thus we expect to find no link between experience and decision accuracy.

2.5: Conclusion

(17)

17

influences accuracy. For the human agent to work well with the machine learning model, he or she must trust the model to be accurate and capture the whole question. The level of trust positively influences the accuracy of the human agent. Adding an explanatory mechanism influences two relationships in the conceptual framework. First of all it mellows the impact of the complexity of the model on the interpretability of the model. It makes the relationship less negative. An interpretable model causes the human agent to be more accurate because he/she can fill in the gaps the algorithm does not understand. Adding an explanatory mechanism moderates the effect. Second the

explanatory mechanism increases trust. By understanding the model better the human agent can judge the model more accurate which increases trust. For an overview, see figure 3.

Figure 3: conceptual framework

The conceptual framework crystalizes five hypotheses:

H1: When a machine learning model contains an explanatory mechanism, the human agent increases in accuracy.

H2: When a machine learning model contains an explanatory mechanism, the human agent sees the machine learning model as more trustworthy

(18)

18

H4: the age of the human agent influences the causal link between machine learning support and human agent accuracy negatively.

H5: the gender of the human agent influences the causal link between machine learning support and human agent accuracy.

(19)

19

Chapter 3: Research design & Methodology

To research whether interpretability influences accuracy we have set up an experiment since it has (to the best of our knowledge) not been researched before. The experiment will take place in 3 steps:

1. The development of a neural network that can predict the survival of the passengers 2. Taking of a survey where the availability of an explanatory mechanism is manipulated 3. Analysis of the data

Before elaborating on the design of the experiment we would like to give a brief summary about the nature of a neural network, the chosen type of machine learning algorithm used for the experiment. This is not explained in the literature overview since it does not take part in the conceptual

framework, though a basic understanding can help in the understanding of the research design. If one is already familiar with neural networks, this part can be skipped.

3.1: Introduction to neural networks

A neural network is a type of machine learning model. The task of a neural network is to translate the input/independent variable (IV’s) into an output/dependent variable (DV) (Goodfellow et al., 2017). The neural network does this by sending the input to the hidden layer (see figure 4) that increases or decreases the input by a certain amount. The new value gets then forwarded to another hidden layer, or an output layer if the model performs satisfactory. In the beginning the model will not know exactly by how much to increase of decrease the values. This is done during the

training-stage of the model. The designers of the model give the neural network a long list of IV’s together with the DV”s. The model adjusts a little for each time it gets the question wrong, which is called

back-propagation (Goodfellow et al., 2017). This is repeated until the model cannot improve any

further. When the model is trained adequately the model is tested using a new list of questions where the answers are not provided.

3.2: Model development & dataset description

(20)

20

is trained on a Kaggle-dataset. After development we made the model interpretable by adding a Lime-explanator.

In order to develop a good model we need good data. The dataset used is provided by Kaggle, an online learning community for data science and machine learning (TechCrunch, 2020). The Titanic dataset used comes from historical data. The dataset is used by scholars, for example in this paper (Chatterjee, 2018) and is verified by a Titanic remembrance group (non-profit) (Titanic Survivors, 2020). The training dataset contains 891 observations of 12 variables.

Due to the completeness of the dataset no observations were removed. In order to predict survival a few variables were removed. The ticket-code, name, cabin name, age, fare and embarking location showed little correspondence with the survival rate or showed incompleteness. Only the

independent variables class, Sex (gender), the presence of siblings or a spouse and the presence of a partner or child was maintained. Also the dependent variable, whether the individual has survived, was maintained. The presence of siblings, spouses, partners or child was transformed to a binary variable were they were either present or not present in order to keep the model simple for the participants as well. See table one for the data-types used after the transformation of the data.

Variable Explanation Data type

Class The class of the cabin in which the passenger in question stayed during his/her voyage on the Titanic.

Factor with 3 levels: First class, Second class, Third class

Gender The gender of the passenger Factor with 2 levels: Male, Female SibSp The presence of a sibling or spouse on the ship

on the moment of sinking

Factor with 2 levels: present or not present

ParCh The presence of a partner or a child on the ship

Factor with 2 levels: present or not present

Survived Whether the passenger survived. This is the dependent variable of the machine learning model

Factor with 2 levels: did survive or did not survive

Table 1: independent variables of the neural network and their data type

(21)

21

model showed an in-sample training accuracy of 82.24 percent, and an out-sample accuracy 81.25%. This was rounded down to 80% accuracy when the model was presented in a question in the survey. The input layer contained four nodes for each of the input variables. These values were forwarded to the three hidden layers.

The hidden layers used 300 nodes, together with an ‘uniform’ weight initialization and an Adam-optimizer. The weight initialization causes the model to converge faster and therefore reach faster optimal accuracy. The ‘uniform’ weight initialization assumes that all weight in the dataset follow the same (uniform) distribution meaning that all weights have the same value (Thimm & Fiesler, 1997). This was not entirely the case in our dataset. However since the uniform weight initialization is the most neutral it is hard to get a wrong model using that specific weight initialization. Seeing that the accuracy of the neural network was fairly high, even though the initializer was suboptimal, it was decided that the initializer worked satisfactory. As an optimizer for the neural network the “relu” function was chosen. The “relu”-optimizer is known to be simple and not too taxing for the computer system running it (Alrefaei & Andradóttir, 2005). It is widely used in other scientific papers (Jiang et al., 2018).

Lastly the output layer transformed the values of the hidden layers into a single value, the probability of that passenger surviving. The output layer again used a “uniform” initialization and used an

“sigmoid” optimizer which translated the value into a probability which can be used by the human agent. The model was compiled using an “Adam” optimizer. The “Adam” function works very simple. If a node receives a value of less than zero, it pays forward a value of zero (Kingma & Ba, 2014). The loss-function was binary cross-entropy.

(22)

22

3.3: Research method

To test the hypothesis we conducted a survey. First of all participants received a general introduction together with some base statistics about the survival rate on the Titanic. The general accuracy of the machine learning model was also presented in order for participants to make a fully informed decision.

Second the participants were randomized between two groups: the control group simply received a machine learning recommendation and the treatment group received a machine learning

recommendation with a LIME-explanatory mechanism. The treatment group received an additional paragraph, explaining the LIME-mechanism and how to interpret it. Each group got to see eight historical passengers together with their features and a prediction of the neural network on whether they had survived the Titanic disaster or not. The participants then had to decide if they thought these passengers had survived or not. These questions could not be skipped and there was no none-option in order to make sure we also collect data on fringe cases where there is no clear answer. For an example of a question for the control group, see figure 5.

Figure 5: example question control group

For an example of a question of the treatment group, see figure 6.

Figure 6: example question treatment group

(23)

23

conjoint analysis. We balanced the attributes of the passengers like class and children onboard to make sure no correlations could be found. The second group received with the neural network prediction also an interpretation of the model with the LIME-explanatory mechanism. After each prediction participants of both groups were asked to judge the interpretability of the model.

The survey closes with some general demographic information that participants were allowed to skip if they preferred. These general questions contained an attention check which tests whether

participants are actually reading the questions or mindlessly filling in the blanks. Those failing the attention check were excluded from the results. No reward was given (or offered) for those that completed the survey. For the full survey see appendix three.

3.4: Data collection

During the survey information was gathered on five variables which correspond with the conceptual model. In table two these five are described together with how they were asked and how they were measured.

Variable Survey Question Measurement

human accuracy “Has the passenger survived the Titanic?” Percentage correct

Interpretability “Do you understand why the neural network has given this prediction?”

Likert-scale

Trust “How much would you trust this model to make decisions in real-life”?

Likert-scale

Explanatory mechanism

Whether there is an explanation (1) or not (0)

NA

Expertise “Have you read or watched any non-fiction books/movies about the Titanic except for the movie Titanic made in 1997, starring Leonardo DiCaprio?”

Yes/no

Table 2: Independent variables conceptual framework and measurement method

(24)

24

managed to collect 122 responses which limited our analysis somewhat. Second the sample needs to be representative with people coming from all layers of the population. Age, gender and education will be recorded before in order to monitor this. An increased share of one group is expected since randomized data shows patterns.

Participants were recruited from the personal network of the author, as well as his acquaintances. A secondary source of participants will come from Facebook and reddit groups for survey collection. There will be no monetary incentive for participating in the research.

3.5: Plan of analysis

After executing the experiment we will analyze the data. For the first three hypotheses we compare the test-sample to the control-sample. For this we will need to use a multiple regression. For the influence of consumer characteristics we will use a multiple regression. In order to calculate the accuracy of the participants we will first calculate a test-score by comparing the correct answers to the given answers. For an overview see table three.

Hypothesis Statistical method

H1: When a machine learning model contains an explanatory mechanism the human agent increases in accuracy.

Multiple regression

H2: When a machine learning model contains an explanatory mechanism the human agent sees the machine learning model as more trustworthy

Multiple regression

H3: When a machine learning model contains an explanatory mechanism the human agent sees the machine learning model as more interpretable.

Multiple regression

H4: the age of the human agent influences the causal link between machine learning support and human agent accuracy negatively.

Multiple regression

H5: the gender of the human agent influences the causal link between machine learning support and human agent accuracy.

Multiple regression

Table 3: hypotheses and statistical methods

3.6: Conclusion

(25)

25

(26)

26

Chapter 4: Data analysis

In this chapter we will analyze that data collected from the experiment described in the previous chapter to confirm or reject the hypotheses stated before. For this we will use various statistical methods. We will explain how the process of analysis has been executed for maximum transparency. We will end with an interpretation of the results. For the full code used for the analysis, please advise appendix two.

4.1: Sample analysis.

On the 5th of May data collection started, it was ended on the 15th of May, collecting 145 responses. The data was exported to a comma-separated file (CSV). After rejecting participants who have failed the attention check (including those that quit halfway) 122 respondents remained.

To calculate accuracy each estimation was checked with the correct answer and the answer was deemed TRUE (correct) or FALSE (incorrect). The mean of these answers, where TRUE = 1 and FALSE = 0 gave us a score. The highest score was 87.5%, the lowest 12.5%. Of the participants 55 (45%) was female and 67 (55%) was male. The average age was 29 years. Most (107 participants) are Dutch, with 7 Germans, 3 Belgian and 5 participants of other nationalities. 59% (73 participants) stated to have never read or watched anything about the Titanic, excluding the popular 1997 movie. 41% (49 participants) did.

There were some technical problems. Due to an editing error a question to the control-group (which did not receive a LIME-explanator) referred to the LIME model which was not there. One participants complained about this. A second participant found the question to be vague. She asks: “Do you mean whether I understand the model or whether I agree with it?” No other participants complained about this, so it is possible that it was a limited confusion. Lastly we collected less responses than expected.

4.2: Reliability, validity, representativity

For a good analysis the sample of the population who did the experiment needs to be representative, valid and reliable (Fischer & Julsing, 2019). As we saw before the gender division in the sample is roughly equal. There are a few more women than men, but not enough to significantly skew the population.

(27)

27

unexpected that they will use machine learning models for decision support, it is not distorting the sample. This previously mentioned distortion is not good. When we look at ethnicity we see another distortion. The vast majority of participants is Dutch, which means not all ethnicities are well-represented. This is a limitation of the study.

Figure 7: age distribution of the sample

Due to the fact that the real answers were not shown to the participant no maturation or learning effect could take place ensuring internal validity (Nair, 2009). The chance that the findings are an outlier, and more testing would lead to less extreme scores (also called regression to the mean) is a possibility though unlikely seen the high significance as described in the next paragraph. Participants were divided at random, ensuring no systematic group bias occurred. Lastly the drop-out rate of the participants is worrying. It can not be established whether the twenty-three participants who dropped out were the result of a systemic flaw or randomly dropped out. This may prove to be a threat to the validity of the research and needs to be studied in any replication study.

4.3: Hypotheses and statistical tests

The first hypothesis states that participants with a LIME-explanator are more accurate (when a

machine learning model contains an explanatory mechanism the human agent increases in accuracy).

(28)

28

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.215e-01 4.109e-02 15.125 < 2e-16 ***

LIME 1.037e-01 2.793 e-02 3.712 0.0003 ***

Age -7.766e-05 1.104 e-03 -0.070 0.9440

Gender -6.903-02 2.832 e-02 -2.438 0.0167 *

Knowledge on

the Titanic -7.027e-02 3.505 e-02 -2.005 0.0473 *

Experience with

machine learning -2,572e-03 3.940 e-02 -0.065 0.9481

Table 4: the influence of LIME-explanatory mechanism on accuracy when controlling for consumer characteristics

The second hypothesis states that participants with a LIME-explanator trust the model more (when a

machine learning model contains an explanatory mechanism the human agent sees the machine learning model as more trustworthy). Again we performed a multiple regression to control for

consumer characteristics. We found that the treatment group (M=3.81, SD=0.661) sees the machine learning model as significantly more trustworthy (B=0.8955, p = 8.8e-08) than the control group (M=2.78, SD=0.917). Hence we can conclude that having a LIME-explanatory mechanism increases trust significantly. There is no relation between the age of the participant and the trust they place in the machine learning model. We concluded so after having done a Pearsoncorrelation (r(120) = -0.93, p = 0.352).

Variable Estimate Standard Error T-value P-value

Intercept 2.5885 0.2307 11.222 < 2e-16 LIME 0.8955 0.1568 5.711 8.8e-08 Age 0.0021 0.0062 0.344 0.7317 Gender 0.2834 0.1589 1.783 0.0772 Knowledge on the Titanic -0.2883 0.1967 -1.466 0.1455 Experience with machine learning 0.4098 0.2212 1.853 0.0664

(29)

29

The third hypothesis states that participants with a LIME-explanator see the model as more

interpretable (when a machine learning model contains an explanatory mechanism the human agent

sees the machine learning model as more interpretable). After performing a multiple regression we

find that the treatment group (M=3.63 , SD=0.484) sees the machine learning model as significantly more interpretable (B=0.3966, p = 0.0003) than the control group (M = 3.25, SD=0.568). Thus we can conclude that having a LIME-explanatory mechanism increases interpretability of the machine learning model significantly.

Variable Estimate Standard Error T-value P-value

Intercept 3.1764 0.1563 20.305 2e-16 LIME 0.3966 0.1062 3.733 0.0003 Age 0.0033 0.0042 0.785 0.4343 Gender -0.0014 0.1077 -0.013 0.9895 Knowledge on the Titanic -0.1866 0.1333 -1.400 0.1642 Experience with machine learning 0.1250 0.1499 0.834 0.4059

Table 6: the influence of LIME-explanatory mechanism on interpretability when controlling for consumer characteristics

Lastly we looked at the influence of the participants characteristics (the age of the human agent

influences the causal link between machine learning support and human agent accuracy negatively)

and (the gender of the human agent influences the causal link between machine learning support and

human agent accuracy). We used a multiple linear regression to predict the accuracy of the

participant by their age, gender, whether they had any expertise on the Titanic and whether they had ever worked with machine learning methods before. A significant link between these variables was found F(4, 117) = 2.882, p = 0.028, with an R2 of 0.09. After controlling for the number of variables in the model we found an adjusted R2 of 0.0568. We find that men are slightly worse at predicting (β =

-0.06, p = 0.033) than women. Having read or watched non-fiction about the Titanic decreased

accuracy (β = 0.08, p = 0.042). Age (β = 0.68, p = 0.475) nor experience with machine learning (β =

(30)

30

Variable Estimate Standard Error T-value P-value

Intercept 0.6768 0.0403 16.781 <2e-16 Age -0.0008 0.0011 -0.717 0.4748 Gender -0.0643 0.0298 -2.159 0.0329 Knowledge on the Titanic -0.0759 0.0369 -2.059 0.0417 Experience with machine learning 0.0378 0.0399 0.949 0.3448

Table 7: Estimation of participant characteristics on the level of accuracy

With an ANOVA-test we tested whether participants rate their own gender as more likely to survive than passengers of the other gender. We found that while female participants rate it more likely (F(1, 120) = 5.882, p = 0.017) that female passengers survived, male participants did not rate it more likely (F(1, 120) = 1.019, p = 0.315 that male passenger survived. Reasons for this will be explained in the discussion.

4.4: Interpretation

After having done the statistical tests, and having rejected or accepted all of the hypotheses formulated earlier, we can draw some conclusions of the influence of the LIME-explanatory mechanism on the decision-making-process.

First of all we can conclude that having a LIME-explanator has a significant positive effect on the interpretability of the neural network. Participants who got a LIME-explanation rated the machine learning model as significantly more interpretable than participants who did not.

Second with all other variables being held we can also conclude that a LIME-explanator has a positive effect on the trust in the model. In the literature overview (chapter two, paragraph three) we already gathered some explanations on why this would happen and after collecting the data and using statistical methods we can conclude that providing a LIME-explanation mechanism have a significant and positive effect on the trust one places in the machine learning model.

(31)

31

with a fairly high degree of confidence that having a LIME-explanator does make participants more accurate than participants without a LIME-explanator. It should be noted that the second group, even though they didn’t receive a LIME-explanator, did receive the same machine learning model with the same level of accuracy.

Lastly we concluded that women are slightly more accurate than men. We also found that women are more optimistically than men in predicting their own gender. In the next chapter we will describe why we do not think that this is a conclusion we can generalize to other cases. We will therefore refrain from interpreting these results.

4.5: Conclusion

After having gathered and analyzed the data we have concluded that providing a LIME-explanator positively influences the trust a participant places in the machine learning model. We also concluded that the LIME-explanator increases interpretability significantly, and it makes the participants

(32)

32

Chapter 5: Discussion, limitations and recommendations

In the previous chapters hypotheses were formulated and tested. We have accepted some of the hypotheses and rejected others. In this chapter we will discuss the findings, the limitations of the research performed and conclusions that we can draw from this research, for managers as well as academics.

5.1: Reflective discussion on the results

During this research we have investigated the influence of an explanation on trust as well as accuracy. As we saw in the previous chapter we can conclude that the amount of trust an average participant has in the machine learning model to predict accurately increases significantly if an explanation is provided. Participants also become more accurate in predicting the survival rates of passengers when they receive an explanation. The exact cause of the increase in accuracy remains uncertain. The increase in accuracy can be explained in two different ways, which both circle back to the literature overview of chapter two.

1) The Human Replacement-explanation: a possible explanation is that an increase in trust causes participants to rely more on the machine learning model than on their own intuition. Since

algorithms are in general more accurate than human judgement (Dawes, 1979) this could cause the increase in accuracy.

2) The different-viewpoint-explanation: participants that received the LIME-explanatory mechanism viewed their model as significantly more interpretable. They understood the reason why the model predicted a certain outcome and due to this understanding had an opportunity to overrule the model. If the neural network predicted that Paul Chevre would survive because he traveled first-class, the participant may have an extra piece of information. Perhaps he or she has visited the grave of Paul Chevre, or has seen a documentary where he offered up his seat to a child. Without the LIME-explanation it would be unclear what information has been taken into account whilst making the prediction.

In the literature overview we have condensed a few reasons on why it might be preferable to have a human make the end-decision. He can be held accountable, can’t ‘cheat’, and understands the difference between a proxy-goal and the end-goal (Doshi-Velez, Kim, 2017). The Human

(33)

33

explanation is true companies should do the opposite, and let their most experienced employees work with the machine learning models since they are the reason for the superior performance. The clash between these two ideas show the importance of further research in the field of

interpretable machine learning. From the gathered data we might get the impression that knowledge about the Titanic has a negative influence on accuracy since the coefficient was negative (-0.076) and significant. However it should be kept in mind that having read a book or seen a documentary about the Titanic hardly provides the detailed knowledge needed for this survey. On the other hand, the research showed that human agents do become more accurate when using interpretable machine learning model. This is confirmatory evidence for the previously mentioned strand of literature that states that humans can benefit from working together with machine learning models (Zuo, 2019; Doshi-Velez, Kim, 2017). More research is needed to provide a definitive answer on where this increased accuracy comes from.

A second finding to discuss is the fact that female participants rate female passengers on the Titanic as more probable to survive than male participants rate female passengers to survive. On the other hand, but male participants do not expect male passengers as more probable to survive than female participants rate male passangers. There are two psychological causes that can explain this

phenomena: the availability bias and anchoring.

The availability bias is the mental tendency to correlate the availability of a memory with the probability of it happening (Kidd et al., 1983). A popular example is that we estimate the chance of death by a terrorism attack to be more likely than death by a car-accident even though the latter is a thousand times more likely ("What do Americans fear?", 2020). A terrorism event is very memorable and will therefore be very available. The availability bias can explain why women see other women as more likely to survive. In the popular 1997 movie about the Titanic by James Cameron the male protagonist dies and the female counterpart survives. Since the movie is more popular under women it is likely that they are more affected by the availability bias than men (Todd, 2014).

(34)

34

over in the decision process. If so, the effect will disappear if a replication study uses a more neutral first question.

5.2: Limitations

During the research process certain decisions were made which limit the findings and the

generalizability of the conclusions made in the previous chapter. For clarity we have grouped these limitations in two categories: limitations regarding research design and other limitations.

First of all participants were asked a closed question with only two options: yes or no. These type of decisions happen in real life regularly (Fischhoff, 1996; Nutt, 1993), though more open decisions also occur and are not represented in the survey. The generalization of the influence of explanatory mechanism on open-ended decisions is uncertain at best.

Second the research had a high drop-out rate. 23 of the 145 participants dropped out, almost 16%. It is likely that this drop-out is random since both the control group and the treatment group are of the same size. However we cannot know for sure and the dropout rate should be kept in mind if the experiment is replicated. The high drop-out rate also caused a low response rate of only 122 respondents, well below the set target of a 150 respondents. While the hypotheses could still be tested with a high level of significance a replication with a larger sample size is preferable.

While the size of the sample group is one thing, the lack of diversity is another source of concern. As stated in chapter four the sample is mainly Dutch and young and does not represent the sample as a whole very well. Differences in age might therefore not be adequately interpreted.

Third is the generalizability of conclusions to other types of machine learning models or explanation mechanisms. During the research we have chosen for a neural network since it is opaque and complicated. When a decision tree was used, which is another form of machine learning, as the model results could have differed. The used neural network was also simple in setup, as more complicated neural networks capture deeper patterns it remains the question whether LIME-explanators can capture this increased complexity in its recommendation. It is hard to make a prediction on this, hence more research is needed.

(35)

35

5.3: Academic and managerial conclusions

Interpretable machine learning offers opportunities for companies in ways we are yet to learn fully. In general we can conclude upon this research that it is recommendable for companies which already use machine learning to make their algorithms interpretable for their employees and customers. Having interpretable machine learning models would mean that their employees are more able in decision-making and prevent mistakes from happening that computers simply miss. Second, by making their machine learning models more interpretable it creates more trust between customer and the company than if the model was uninterpretable which can evoke negative feelings. Lastly new EU-regulation requires companies to provide an explanation to their customers when requested anyway. Complying with this new legislation is without doubt a high priority for most firms.

For academics more research is needed on the influence of diverse machine learning methods and explanatory mechanisms on human accuracy. In the research executed we have used one machine learning method (a neural network) and only one explanatory mechanism. It could very well be that other machine learning methods in combination with other explanatory mechanisms would deviate from the findings found here, and that therefore the conclusions drawn here cannot be generalized. Secondly the effectiveness of interpretable machine learning model decision support is relatively untested in more unpredictable environments where not all the facts nor all the important variables are known. During the experiment the number of outcomes was fairly restricted (survived/did not survive) and the relevant variables on which to make a decision were known. Many decisions made are not as structured but do carry much importance and relevance for everyday life. While the importance is great, little research has been done and it provides a rare opportunity.

5.4: Conclusion

(36)

36

References

Al-Qaheri, H. Hasan, M. (2010). An End-User Decision Support System for Portfolio Selection: A Goal Programming Approach with an Application to Kuwait Stock Exchange (KSE). International Journal of Computer Information Systems and Industrial Management Applications.

Alrefaei, M., & Andradóttir, S. (2005). Discrete stochastic optimization using variants of the stochastic ruler method. Naval Research Logistics (NRL), 52(4), 344-360. https://doi.org/10.1002/nav.20080 Aronson, E., Wilson, T., Fehr, B., & Sommers, S. (2017). Social psychology (9th ed.). Pearson.

B. Kim, J. Shah, F. Doshi-Velez (2015). Mind the gap: A generative approach to interpretable feature selection and extraction. NIPS

Bank of England. (2019). Machine learning in UK financial services. London: Bank of England.

Retrieved from https://www.bankofengland.co.uk/-/media/boe/files/report/2019/machine-learning-in-uk-financial-services.pdf

Biran. O, Cotton. C, (2017) Explanation and justification in machine learning: a survey. Workshop on Explainable Artificial Intelligence (XAI), pp. 8-13.

Bode, J. (1998). Decision support with neural networks in the management of research and development: Concepts and application to cost estimation. Information & Management, 34(1), 33-40. doi: 10.1016/s0378-7206(98)00043-3

Bostrom, Nick; Yudkowsky, Eliezer (2011). "The Ethics of Artificial Intelligence" (PDF). Cambridge Handbook of Artificial Intelligence. Cambridge Press.

C. Yang, A. Rangarajan and S. Ranka, Global model interpretation via recursive partitioning. In IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 1563-1570, 2018, June.

Cauffman, E., Shulman, E., Steinberg, L., Claus, E., Banich, M., Graham, S., & Woolard, J. (2010). Age differences in affective decision making as indexed by performance on the Iowa Gambling Task. Developmental Psychology, 46(1), 193-207. https://doi.org/10.1037/a0016128

(37)

37

Chatterjee, T. (2018). Prediction of Survivors in Titanic Dataset: A Comparative Study using Machine Learning Algorithms. International Journal Of Emerging Research In Management And Technology, 6(6), 1. https://doi.org/10.23956/ijermt.v6i6.236

Cialdini, R. (2014). Influence: science and practise (6th ed.). Harlow, Essex: Pearson. Cio Summits. (2019). What consumers really think about AI (p. 2). Cio Summits.

COMPLEXITY | meaning in the Cambridge English Dictionary. (2020). Retrieved 19 February 2020, from https://dictionary.cambridge.org/dictionary/english/complexity

Computer Crushes the Competition on 'Jeopardy!'. (2020). Retrieved 26 February 2020, from https://www.cbsnews.com/news/computer-crushes-the-competition-on-jeopardy/

D Gkatzia, O Lemon, and V Rieser. (2016) Natural language generation enhances human decisionmaking with uncertain information. In ACL

DARPA Announces $2 Billion Campaign to Develop Next Wave of AI Technologies. (2020). Retrieved 20 February 2020, from https://www.darpa.mil/news-events/2018-09-07

Dataman, D. (2020). Explain Your Model with LIME. Medium. Retrieved 15 May 2020, from https://medium.com/analytics-vidhya/explain-your-model-with-lime-5a1a5867b423.

Dawes, R. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7), 571-582. doi: 10.1037/0003-066x.34.7.571

de Bruin, W., Parker, A., & Fischhoff, B. (2010). Explaining adult age differences in decision-making competence. Journal Of Behavioral Decision Making, 25(4), 352-360.

https://doi.org/10.1002/bdm.712

Doshi-Velez, F., Kim, B. (2017) Towards a rigorous science of interpretable machine learning.no ML: 1-13

Encyclopedia Titanica. 2020. Titanic Survivors. [online] Available at: <https://www.encyclopedia-titanica.org/titanic-survivors/> [Accessed 5 May 2020].

Ericsson, A. (2006). The Cambridge handbook of expertise and expert performance (2nd ed.). Camebridge press.

Ericsson, K. (2004). Deliberate Practice and the Acquisition and Maintenance of Expert Performance in Medicine and Related Domains. Academic Medicine, 79(Supplement), S70-S81.

(38)

38

Ericsson, K., & Lehmann, A. (1996). EXPERT AND EXCEPTIONAL PERFORMANCE: Evidence of Maximal Adaptation to Task Constraints. Annual Review Of Psychology, 47(1), 273-305.

https://doi.org/10.1146/annurev.psych.47.1.273

Female Business Leaders: Global Statistics. Catalyst. (2020). Retrieved 4 June 2020, from https://www.catalyst.org/research/women-in-management/.

Fischer, T., & Julsing, M. (2019). Onderzoek doen !. Groningen: Noordhoff Uitgevers.

Fischhoff, B. (1996). The Real World: What Good Is It?. Organizational Behavior And Human Decision Processes, 65(3), 232-248. https://doi.org/10.1006/obhd.1996.0024

Fong, R. Veldadi, A. (2017) Interpretable Explanations of Black Boxes by Meaningful Perturbation. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)

Goodfellow, I., Bengio, Y., & Courville, A. (2017). Deep learning. The MIT Press.

Greenemeier, L. (2020). 20 Years after Deep Blue: How AI Has Advanced Since Conquering Chess. [online] Scientific American. Available at: https://www.scientificamerican.com/article/20-years-after-deep-blue-how-ai-has-advanced-since-conquering-chess/ [Accessed 26 Feb. 2020].

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2018). A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5), 1-42. doi:

10.1145/3236009

Halford, G., Baker, R., McCredden, J., & Bain, J. (2005). How Many Variables Can Humans Process?. Psychological Science, 16(1), 70-76. doi: 10.1111/j.0956-7976.2005.00782.x

Hao, K. (2019). We analyzed 16,625 papers to figure out where AI is headed next. Retrieved 4 March 2020, from https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-where-ai-is-headed-next/

Herlocker, Konstan, Riedl. (2000) Explaining collaborative filtering recommendations. Computer supported Cooperative Work (CSCW)

Hinson, J., Jameson, T., & Whitney, P. (2003). Impulsive decision making and working memory. Journal Of Experimental Psychology: Learning, Memory, And Cognition, 29(2), 298-306. doi: 10.1037/0278-7393.29.2.298

(39)

39

Hu, X., Niu, P., Wang, J., & Zhang, X. (2019). A Dynamic Rectified Linear Activation Units. IEEE Access, 7, 180409-180416. https://doi.org/10.1109/access.2019.2959036

J. L. Herlocker, J. A. Konstan, and J. Riedl. (2000) Explaining collaborative filtering recommendations. Conference on Computer Supported Cooperative Work (CSCW).

Jiang, X., Pang, Y., Li, X., Pan, J., & Xie, Y. (2018). Deep neural networks with Elastic Rectified Linear Units for object recognition. Neurocomputing, 275, 1132-1139.

https://doi.org/10.1016/j.neucom.2017.09.056

Kidd, J., Kahneman, D., Slovic, P., & Tversky, A. (1983). Judgement under Uncertainty: Heuristics and Biasses. The Journal Of The Operational Research Society, 34(3), 254.

https://doi.org/10.2307/2581328

Kingma, D.P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv. https://arxiv.org/abs/1412.6980

Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). Human Decisions and Machine Predictions*. The Quarterly Journal Of Economics. doi: 10.1093/qje/qjx032

L. Xu, K. Crammer, D. Schuurmans. (2006) Robust support vector machine training via convex outlier ablation. AAAI

Langley, P., & Simon, H. (1995). Applications of machine learning and rule induction. Communications Of The ACM, 38(11), 54-64. doi: 10.1145/219717.219768

Letzter, R. (2020). Amazon just showed us that 'unbiased' algorithms can be inadvertently racist. Retrieved 26 February 2020, from https://www.businessinsider.com/how-algorithms-can-be-racist-2016-4?international=true&r=US&IR=T

M. Ancona, C. Öztireli and Gross, “Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Values Approximation,” In ICML, 2019

M. Bilgic, R.J. Mooney (2005). Explaining reccomendations: satisfaction vs promotion. Workshop on the next stage of reccomender systems research.

M.T. Ribeiro, S. Singh, C. Guestrin. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining.

(40)

40

Mary T. Dzindolet, Scott A. Peterson, Regina A. Pomranky, Linda G. Pierce, Hall P. Beck (2003) The role of trust in automation reliance. International Journal of Human-Computer studies.

Matthias, A. (2004). The responsibility gap: Ascribing responsibility for the actions of learning automata. Ethics And Information Technology, 6(3), 175-183. doi: 10.1007/s10676-004-3422-1 Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1-38. doi: 10.1016/j.artint.2018.07.007

Mueller, J., & Massaron, L. (2016). Machine Learning For Dummies. For Dummies. Nair, S. (2009). Marketing research. Himalaya Pub. House.

Netzer, O., Lemaire, A., & Herzenstein, M. (2016). When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications. SSRN Electronic Journal. doi: 10.2139/ssrn.2865327 Nickerson, R. (1998). Confirmation Bias: A Ubiquitous Phenomenon in Many Guises. Review Of General Psychology, 2(2), 175-220. doi: 10.1037/1089-2680.2.2.175

Nutt, P. (1993). The Identification of Solution Ideas During Organizational Decision Making. Management Science, 39(9), 1071-1085. https://doi.org/10.1287/mnsc.39.9.1071

Official Journal of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (2016).

Or Biran and Kathleen McKeown.(2017) Human-centric justification of machine learning predictions. In IJCAI, Melbourne, Australia.

Papenmeier, Englebienne, Seifert. (2019) How model accuracy and explanation fidelity influence user trust in AI. arXiv(2019)

Powell, M., & Ansic, D. (1997). Gender differences in risk behaviour in financial decision-making: An experimental analysis. Journal Of Economic Psychology, 18(6), 605-628.

https://doi.org/10.1016/s0167-4870(97)00026-3

Referenties

GERELATEERDE DOCUMENTEN

This contribu- tion looks into the neglect of legislative studies in traditional legal scholarship and the all but absence of it in academic teaching curricula of law.. The

Hypothesis 3b: The number of interactions in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn..

Moreover, the impact of using ensemble learning is explored, given various levels of test data artificially gen- erated based on missing at random (MAR)

With respect to labour migrants from CEE countries who in the period 2000- 2005 also entered the Netherlands on the basis of a Work Permit proce- dure, however, the

In Section 7 our B-spline based interpolation method is introduced, and is compared with three other methods (in- cluding Hermite interpolation) in Section 8.. Finally,

It is apparent that both the historical information life cycle and the knowl- edge discovery process deal with data integration issues before these two last stages. Certainly,

If indeed, PB-SMT and NMT systems do over- generalize certain seen patterns, words and constructions considerably at the cost of less frequent ones, this not only has consequences

For instance, in lexical decision, the match boundary represents the amount of accumulated evidence to give a “word” response, and the non-match boundary represents