An objective user evaluation of explanations of Machine Learning based advice in a diabetes context

(1)

An objective user evaluation of explanations of

Machine Learning based advice in a diabetes context

Elisabeth Nieuwburg

(2)

(3)

An objective user evaluation of explanations of

Machine Learning based advice in a diabetes context

Elisabeth G. I. Nieuwburg

July, 2019

Supervisors

Jasper van der Waa, MSc TNO, Delft University of Technology Dr. Anita Cremers TNO

(4)

(5)

Acknowledgements

At the end of an exciting research internship at TNO Soesterberg, I am very happy to present the written result. I like to look at this master thesis as the applied insights of a social scientist in the world of eXplainable Artificial Intelligence (XAI) and I hope it will be an inspiring read for researchers and interested readers both within and outside the field. However, this report is not the only result of my time at TNO. I learned a lot about (X)AI, research in a non-academic environment, the importance of interdisci-plinary studies, and importantly: about my own qualities and ambitions.

Learning is not something I do alone, so I want to thank everyone that helped me shape my ideas and supported me throughout this year. First of all, Rachelle Leerling, for the many insights into what life with diabetes type 1 is like. It is thanks to her comprehen-sive and lively examples that the use case for this study is so relevant and realistic. I would like to thank Jonathan Barnhoorn and Jos´e Kerstholt for their helpful and clari-fying methodological advice, Tim van den Broek for an interesting talk about intelligent decision support for diabetes patients and Mark Neerincx for bringing this study to a next level with his extensive knowledge and experience and his ever enthusiastic ideas. I thank my second supervisor Jaap Murre for taking the time and effort to evaluate a research project that largely took place outside the UvA, and my first supervisor Anita Cremers for her useful advice, for keeping it simple whenever I lost the overview and for always being so positive. Also, I would never have had such a good experience at TNO without my fellow graduate interns, who were always there for help or interesting discussions and made sure there were enough fun breaks to continue with fresh energy. A last and special thanks goes to Jasper van der Waa. I cannot imagine a daily supervisor more involved and supportive but at the same time also trusting to leave decisions in my hands. We came from very different backgrounds and I think that our joined knowledge combined with his engaged supervision made this project into something that I am very proud of.

(6)

(7)

Abstract

Intelligent systems based on Machine Learning (ML) become increasingly complex and intransparent. Researchers in eXplainable Artificial Intelligence (XAI) generate explana-tions for ML output to increase users’ understanding, trust and performance. Surpris-ingly, many studies do not test these effects in user evaluations. Studies that do, mostly use subjective measures. This study aimed to design an objective measure for XAI user effects based on three key components: well-defined dependent variables, a specified con-text and implicit measurement. This method was applied in two experiments in the context of an intelligent advice system for diabetes patients.

The effects of rule-based and example-based explanations on understanding (Exper-iment I) and performance estimation and persuasion (Exper(Exper-iment II) were measured. In a computer task, non-diabetic participants interacted with a simulated advice system for insulin administration. In Experiment I, participants receiving rule-based explanations correctly predicted the relevant input factor for the advice more often than people in the example-based or no explanation group. Participants’ prediction of the system advice itself did not differ significantly between groups. In Experiment II, explanations did not improve users’ ability to estimate the system accuracy. However, participants receiving explanations followed the system advice more often than controls.

The results from the experiments suggest that rule-based explanations are more suit-able to improve user understanding than example-based explanations. Also, explanations may have persuasive power, but more research is needed to examine how they can im-prove performance estimation. It was concluded that this research method is a good starting point for more objective user research in XAI.

keywords: eXplainable Artificial Intelligence (XAI), Machine learning (ML), Diabetes mellitus, explanation, objective measurement

(8)

(9)

3 Methods 31 3.1 Participants . . . 31 3.2 Inclusion criteria . . . 31 3.3 Procedure . . . 32 3.4 Experimental design . . . 34 3.4.1 General design . . . 34 3.4.2 Instructions . . . 35 3.4.3 Stimuli . . . 35 3.4.4 Explanation types . . . 37 3.4.5 Experiment I . . . 38 3.4.6 Experiment II . . . 40 3.5 Subjective measures . . . 42 3.6 Data analysis . . . 43 3.6.1 Experiment I . . . 43 3.6.2 Experiment II . . . 44

3.6.3 Statistical assumptions and outlier handling . . . 44

4 Results 47 4.1 Experiment I . . . 47

4.1.1 Objective measures of understanding . . . 47

4.1.2 Subjective measures of understanding . . . 50

4.1.3 Relation between subjective and objective measures . . . 51

4.2 Experiment II . . . 52

4.2.1 Objective measures of performance estimation . . . 52

4.2.2 Objective measures of persuasion . . . 53

4.2.3 Subjective measures of performance estimation . . . 54

4.2.4 Relation between subjective and objective measures . . . 55

4.3 Usability of the advice system . . . 56

4.3.1 Usability ratings . . . 56

4.3.2 Open questions . . . 58

4.4 Biases toward or against the system advice . . . 58

4.5 Qualitative evaluation . . . 60

5 Discussion 63 5.1 The user effects of rule-based and example-based explanations . . . 63

5.1.1 Understanding . . . 63

5.1.2 Performance estimation and persuasion . . . 65

5.2 Evaluation of the research design . . . 66

(11)

5.2.2 Strengths of the design . . . 69

5.2.3 Relation between objective and subjective measures . . . 70

5.3 Conclusions . . . 71

5.4 Directions for future research . . . 72

References 75

Appendices 83

A Prequestionnaire 83

B Postquestionnaire Experiment I 91

C Postquestionnaire Experiment II 105

D Motivation for the stimulus design 119

E Stimuli 121

(12)

(13)

List of Abbreviations

AI Artificial Intelligence DMT1 Diabetes mellitus type 1 EE Example-based explanation ML Machine Learning

NE No explanation

RE Rule-based explanation

(14)

(15)

1. Introduction

“My drawing No. 1 was like this:

I showed my masterpiece to the grown-ups and asked them if my drawing frightened them.

They answered: ‘Why should anyone be frightened by a hat?’ My draw-ing did not represent a hat. It was supposed to be a boa constrictor digestdraw-ing an elephant. So I made another drawing of the inside of the boa constrictor to enable the grown-ups to understand. They always need explanations. My drawing No. 2 looked like this:”

– Antoine de Saint-Exup´ery, The little prince

Intelligent decision support systems are used more and more in our present day society. With the rise of Machine Learning (ML), computer systems based on simple rules evolved into self-learning machines, able to make complex decisions without their tasks being explicitly programmed by humans. Despite the numerous opportunities created by these developments, there is a downside to the use of these complex systems: their reasoning becomes non-transparent to users. In other words, we cannot see what is going on ‘inside’. Especially for applications in safety-critical environments such as healthcare, finance and defence, it is crucial that humans understand and know when to trust decisions made

(16)

by intelligent systems (Adadi & Berrada, 2018; Guidotti, Monreale, Ruggieri, Turini, et al., 2018). The research field eXplainable Artificial Intelligence (XAI) studies how the decisions and predictions made by intelligent systems can be made more understandable for their human users by complementing system output with explanations. In this way, systems are not only self-learning but also self-explaining.

To test whether XAI methods have the intended effects (e.g. better understanding or task performance) in the user, it is important that the explanations generated by these methods are evaluated in human subjects (Doshi-Velez & Kim, 2017; Hoffman, Mueller, Klein, & Litman, 2018; Mohseni, Zarei, & Ragan, 2018). Surprisingly, a large part of the research in XAI does not include human evaluations. Researchers that do evaluate explanations in users often adopt surveys and interviews as research measures. Even though these subjective measures give a good account of subjects’ conscious appraisal of explanations, objective measures provide us with information about the behaviour of the subject without interference of the subject’s own reflections. Therefore, we propose an objective research method to study specific user effects of different explanations.

Importantly, the effectiveness of explanations for intelligent systems naturally de-pends on the user and context (Gregor & Benbasat, 1999; Mittal & Paris, 1995). In this study, we focus on a healthcare related scenario in which ML and XAI have high po-tential: personalised insulin advice for diabetes mellitus type 1 (DMT1). Patients with DMT1 administer the hormone insulin via injections to regulate their blood glucose levels. Insulin requirements vary between individuals and situations. Optimal doses for specific situations may be successfully learned by a ML algorithm. However, ML advice will more likely be accepted by patients when complemented with explanations that make the decisions of the system more transparent. Therefore, explanations of person-alised advice on insulin administration serve as the use case in the present study.

The remainder of this introduction covers the research area XAI and our focus within this field (Section 1.1), the possible contribution from the social sciences in improving user studies in XAI (Section 1.2) and a discussion of intelligent decision support for DMT1 (Section 1.3). We conclude this introduction with a preview of our approach in designing an objective measure for XAI user effects, discussed in depth in Chapter 2, and by formulating our research questions and hypotheses in Section 1.4.

1.1 eXplainable Artificial Intelligence (XAI)

The recently revived research area of XAI focusses on how to obtain and provide users with an understandable account of the reasoning behind decisions made by systems based on Artificial Intelligence (AI). With the increasing complexity of intelligent systems, it is a growing challenge to design methods that produce explanations for system decisions that are both faithful to the original model, and effective from the perspective of the user

(17)

(Adadi & Berrada, 2018; Herman, 2017). This section discusses the relevance of research in XAI for applications in society. Also, we provide a short overview of different XAI techniques and we highlight our focus among these approaches.

1.1.1 Why XAI?

When the American Defense Advanced Research Projects Agency (DARPA) launched its XAI research programme in 2017, the aims of XAI research were formulated in terms of understanding, appropriate trust and effective management (Gunning, 2017). These three objectives are highly interrelated and together they form a good account of the relevance of XAI for society.

The first concept, understanding of the reasoning behind intelligent system output, is essential for decision support systems to be used in everyday life. A poor understand-ing of the way an intelligent system came to a decision makes it difficult for users to judge whether there is a valid basis for the decision. For example, patients and medical experts receiving clinical advice from an intelligent system may not take the advice seri-ously unless a plausible explanation is given. Similarly, recommendations from a movie recommender system are more likely to be followed if people understand why a specific movie would fit their personal taste. In safety-critical environments, a lack of explana-tions may even lead to ethical concerns, as shown by various instances from the media where non-transparent systems made undesirable choices. Well-known examples include risk assessment tools biased against black prisoners (Angwin, Larson, Mattu, & Kirchner, 2016) and deadly accidents as a result of the decisions of self-driving cars (Levin & Wong, 2018; Yadron & Tynan, 2016). A better understanding of the decisions of self-learning machines could prevent such complications when applying ML algorithms in society.

Beside the growing societal urge to understand intelligent systems, explanations have become legally required as well. A right to receive explanations for system decisions was included in the General Data Protection Regulation (GDPR), the renewed framework for data protection laws recently adopted by the European Parliament (The European Parliament and the Council of the European Union, 2016). It states that when personal data is used in automated decision making, data subjects should receive ‘meaningful information about the logic involved’ (p. 41). The possibility that developments in AI may become restricted by law underlines the importance of XAI even more.

A good understanding of the logic behind automated decisions also promotes the sec-ond aim listed by DARPA: appropriate trust in users. Intelligent systems are never 100% accurate, often due to incomplete or non-representative data sets (Bussone, Stumpf, & O’Sullivan, 2015; Kong, Xu, & Yang, 2008). Therefore, it is important that user trust is appropriate, that is, not too high and not too low. Explanations serve to promote people’s acceptance of the system, but importantly also create the risk of over-reliance

(18)

if they are too persuasive (Bussone et al., 2015). If people are able to accurately judge a system’s capacity and learn to recognise its weaknesses, there is a higher chance of successful use than when the system makes mistakes that appear to be resulting from a ‘black box’.

This brings us to the third concept described above: effective management. The desired effect of an artificially intelligent application is to increase effective task perfor-mance in cooperation with humans. In our use case, this translates to an improvement in diabetes management in patients with the help of an intelligent advice system. However, this cooperation is only effective if users have an accurate performance estimation: an awareness of when to trust, and when not to trust the system. Explanations that enable the user to understand the logic behind the choices of a computer system improve this awareness and therefore effective management of intelligent systems.

1.1.2 XAI methods

The simplest way to provide insight into the decisions made by an intelligent system is to design a model that is inherently transparent and interpretable. For example, the underlying logic of symbolic models based on decision trees or classification rules is rela-tively comprehensible for end users (Freitas, 2014). However, easily interpretable models are often not fit for highly complex problems: there is a trade-off between interpretabil-ity and accuracy (Adadi & Berrada, 2018; Herman, 2017). Therefore, a large body of research in XAI is focussed on the development of methods to obtain explanations for output generated by more complex, sub-symbolic models, such as neural networks and deep learning algorithms.

XAI methods providing explanations for complex ML algorithms can be classified in two different ways. Firstly, a distinction exists between global and local explanations (Mohseni et al., 2018). Whereas the objective of global explanation methods is to explain the workings of the complete model, local explanations only explain a single instance of the model output. For example, a global explanation could contain information on the type of data the model uses and the calculations its output is based on. A local explanation explains the logic behind a single advice or decision.

A second distinction lies between model-specific and model-agnostic XAI methods (Adadi & Berrada, 2018). Techniques that provide model-specific explanations are lim-ited to the type of model that they are designed for. For example, a method designed to explain deep neural networks by integrating semantic information into the model (Dong, Su, Zhu, & Zhang, 2017) cannot be applied to explain the outcomes of a support vec-tor machine. Because it is more efficient to apply a single method to a diverse range of AI models, there is a growing interest in model-agnostic XAI methods. Model-agnostic

(19)

techniques treat the underlying model as ‘black box’ and construct explanations based on relations between model input and output (see Figure 1.1).

ML System Black Box XAI Method Input Output Explanation User

Figure 1.1. A schematic representation of model-agnostic XAI methods. XAI methods (displayed in grey, dashed lines) can be seen as a layer around the ML model (displayed in black), generating explanations based on ML model input and output and treating the original model as a Black Box. Note. ML = Machine Learning; XAI = eXplainable Artificial Intelligence.

It is important to realise that the input and output of ML models may differ. For example, some ML models may be handling image data, while others are designed for tabular data. Output may consist of classifications, data clusters, learned rules or rankings.

The focus of the present study was on local, model-agnostic explanations for ML. Our use case concerned tabular input data (environmental and physical conditions of a diabetes patient) and classification output (advice on the optimal insulin dose). Within this scope, there are numerous techniques to provide explanations and the body of lit-erature on XAI methods is rapidly growing. For a complete summary of different types of XAI methods for black-box models, we refer the reader to extensive reviews by Adadi and Berrada (2018) and Guidotti, Monreale, Ruggieri, Turini, et al. (2018).

1.1.3 Rule-based and example-based explanations

We concentrated on two exemplary classes of model-agnostic XAI methods to illustrate our case: rule-based and example-based explanations. Both methods provide local expla-nations for ML classifications based on the relation between input features and output. However, the information presented to the user is rather different. Whereas rule-based explanation methods aim to extract conditional rules on which the output is based, example-based explanations provide the user with example data points similar to the current output. By data points we mean occurrences of system output with their cor-responding set of input features and their values. Possible linguistic formulations of rule-based and example-based explanations are presented in the first two rows in Table 1.1.

(20)

Table 1.1

Possible formulations of rule-based and example-based explanations and their contrastive counterparts

Type Explanation

Rule-based “The model predicted output E, because input feature A has value X.”

Example-based

“The model predicted output E. In a similar, previous case, features A and B had values X and Y and the model also predicted output E.” Contrastive

rule-based

“The model predicted output E and not output F, because input fea-ture A has value X and not value Y. If input feafea-ture A had value Y, the model would have predicted output F.”

Contrastive example-based

“The model predicted output E and not output F. In a similar, previous case, features A and B had values X and Y and the model also predicted output E. In another similar, previous case, features A and B had values X and Z and the model predicted output F.”

Note. The contrastive parts of the explanations are displayed in bold.

As the examples of explanations show, these two methods are feature-based: they gener-ate an explanation based on the most important input features for a specific output. In reality, a large number of features may be involved in the generation of a certain output. However, whether a feature value is relevant for the user depends on what he or she wants to know. For example, if the user inquires why output E was generated instead of output F, the most important features may differ from a situation in which the user asks for the contrast between output E and output G. Therefore, we focus on contrastive, or why-not explanations: explanations for the occurrence of a certain event (the fact) in contrast to a different event that did not occur (the foil) (Lipton, 1990). Literature from human-computer interaction shows that why-not rules have repeatedly been used to explain computer output (e.g. Lim, Dey, & Avrahami, 2009; Myers, Weitzman, Ko, & Chau, 2006; Vermeulen, Vanderhulst, Luyten, & Coninx, 2010).

Beside the technical advantage of reducing the number of relevant features for an explanation, contrastive explanations are also closely modelled to common explanation strategies used by humans (Hilton, 1991; Miller, 2018). For example, one may explain the event of a girl eating a pear (the fact ) by stating “she is hungry.” This explanation assumes a contrast with not eating anything (the foil ). An alternative foil may be eating an apple. This foil leads to a different explanation, for example: “because she is allergic to apples.” Although humans often infer the foil from context and therefore leave it implicit, this example illustrates that contrastive explanations are a typical way of explaining in humans.

An important reason to focus on rule-based and example-based explanation tech-niques is the fact that they are suitable to generate contrastive, and therefore human-like

(21)

explanations. Recently, researchers in XAI started developing techniques for the genera-tion of contrastive rule-based (van der Waa, Robeer, van Diggelen, Brinkhuis, & Neerincx, 2018) and contrastive example-based (Adhikari, Tax, Satta, & Fath, 2018) explanations for ML, both with promising results. As can be seen in the last two rows of Table 1.1, contrastive rule-based explanations state the exact feature value(s) or thresholds that discern between one output and another, while contrastive example-based explanations present similar data points with identical and different outcomes. A visualisation of the differences between the two methods in generating contrastive explanations is provided in Figure 1.2. Rule-based explanation Output E Example-based explanation Output F

Figure 1.2. A visualisation of contrastive rule-based and example-based explanations. Rule-based expla-nations explain the threshold for selecting output E rather than output F. Example-based explanations present the example data points closest to the threshold between out-put E and outout-put F.

1.2 User evaluations in XAI

Although the research area of XAI revolves around providing explanations for human users, only a small part of the XAI techniques developed was actually evaluated in human subjects (Adadi & Berrada, 2018; Doshi-Velez & Kim, 2017; Miller, 2018). Recently, researchers in XAI have started to recognise the importance of user evaluations, but the area is still in its infancy and the methods used to evaluate the effects of explanations can be improved. This section highlights the problems with current user evaluations within XAI and introduces suggestions for improvement based on research practice in the social sciences. In Chapter 2 we provide an in-depth review of the current state of user studies in XAI and the shortcomings that we identified, as well as a description of how we translated the proposed improvements into a concrete research design.

(22)

1.2.1 Considerations on current user studies

The topic of evaluating the effectiveness of XAI methods has recently gained attention in the literature. Reviews by Doshi-Velez and Kim (2017), Hoffman et al. (2018) and Mohseni et al. (2018) reveal a rising interest in a more systematic approach for the eval-uation of XAI techniques. The research described in these reviews and the results from our own literature study (see Chapter 2) lead to a number of important considerations on user evaluations in XAI.

First of all, Doshi-Velez and Kim (2017) identified three stages of evaluation in XAI. The first step is to conduct functionally-grounded evaluations using computa-tional measures. This is a suitable method if, for example, a model is not yet fully developed or human experiments are not feasible for other reasons. The second stage is to conduct human experiments with a simplified version of the application: human-grounded evaluations. The final round of evaluations is application-human-grounded : human experiments with the full-fledged version of the application. The authors noted that at present, functionally-grounded evaluations are most commonly used in XAI while human-grounded and application-grounded experiments are rare.

Secondly, all three reviews emphasised the importance of carefully specifying the ob-jectives of an XAI method and matching the evaluation measure to these claims. Studies differ much in their descriptions of dependent variables. Even though concepts such as comprehension of the system, trust in the system and human-system task performance are frequently mentioned explanation goals in the literature, different authors use dif-ferent terms and definitions for their dependent variables. Therefore, it is important to clarify the user effects to be measured beforehand.

Furthermore, even though the effectiveness of explanations is largely dependent on contextual factors such as the user profile, the problem to be solved and the situation (Mittal & Paris, 1995) and different contexts require different levels of detail in the explanations (Doshi-Velez & Kim, 2017), the specification of the user context of the evaluations was rarely prioritised. Researchers often selected a use case based on the data set available, or came up with a fictitious story that was easy to understand, without taking the influence of context on the results into account. Especially in human-grounded and application-grounded evaluations, an ill-specified context domain may pose problems for the eventual applicability of the explanation method.

Finally, we observed that the majority of the studies discussed in these reviews relied on measures of self-report and subjects’ conscious evaluation of explanations. Because the use of a wider range of research methods including objective measures could contribute to a more solid research practice, the following section introduces insights on user research from the social sciences that can be used to improve user evaluations in XAI.

(23)

1.2.2 A classification of research methods from the social

sciences

We argue that if XAI techniques are to be applied in real-world systems, human eval-uations are an essential component of the development cycle. Fortunately, as we saw in Section 1.2.1, the XAI community is starting to recognise this importance. However, more attention should be devoted to the design of the evaluation methodology and to the question whether a given method is suitable for the research objectives.

Oftentimes, choices in method design restrict the conclusions that can be drawn from a study. For example, measuring trust by asking participants how trustworthy they consider a system differs essentially from measuring trust by testing whether participants follow or ignore system advice during a task. These two research methods lead to different conclusions: in the former case, we may draw conclusions about people’s own feeling of trust, while in the latter case, we examine what people’s behaviour tells us about their trust levels.

In the social sciences, there is a distinction between objective and subjective research methods. Whereas subjective methods rely on self-report and include interviews, surveys and ratings to test people’s opinions or judgements regarding a certain matter, objec-tive methods are designed to test people’s behaviour without interference of their own assumptions, beliefs and interpretations. This distinction is illustrated by the example above: asking about peoples’ trust levels is a subjective measure, whereas testing people’s tendency to accept an advice is objective.

A similar yet slightly different distinction exists between explicit and implicit mea-surement. In explicit user research, participants are aware of what they are tested on. For example, they may be told beforehand that a study is about the understanding of the logic behind a system, which helps participants focus on the goal of the task: getting to understand the system. Implicit methods, however, are used to test a certain dependent variable that the user does not know of (Vandeberg, Smit, & Murre, 2015). For example, users may think that the goal of the study is to evaluate the layout of a system, while they are actually tested on their comprehension of the mechanism behind it.

As pointed out above, the majority of the user evaluations in XAI are conducted using subjective measures. Surveys, interviews and user ratings are popular methods to test different effects of XAI techniques in human users. Studies that do use more objective methods, such as knowledge tests or quizzes, rarely apply implicit measures. In the social sciences, the benefits of behavioural methods have been recognised for a long time. Humans are not always consciously aware of why and how they perform certain behaviour (Nisbett & Wilson, 1977) and multiple studies have shown that peoples’ judgements and estimations of their own behaviour are not in line with their actual behaviour (e.g. Epley & Dunning, 2006; John & Robins, 1994; West & Brown, 1975;

(24)

Wilson & Gilbert, 2003). This also applies to people’s often inaccurate estimation of their own understanding of concepts, which has been termed the illusion of explanatory depth (Rozenblit & Keil, 2002). Investigating people’s responses during a task under specific experimental manipulations has proven to be a more reliable measure in human research, because the confounding influence of subjective appraisal is avoided (Baumeister, Vohs, & Funder, 2007; Vazire & Mehl, 2008).

The importance of input from the social sciences in XAI was recently discussed by Miller (2018), who argued that research on the way humans explain concepts to each other is a valuable starting point for the design of explanations for AI. Also Mohseni et al. (2018) noted the importance of cross-disciplinary research for XAI and discussed valuable contributions from the areas of human-computer interaction and psychology. However, insights from the social sciences have not yet been applied to the design of methods for the evaluation of XAI. This is a contribution we aimed to make by proposing an objective and implicit research design for the measurement of specific effects of XAI explanations in the context of DMT1 management.

1.3 Use

case:

Personalised

advice

for

Diabetes

mellitus type 1

As pointed out in Section 1.2, the effectiveness of explanations is highly dependent on the context of the ML application (Doshi-Velez & Kim, 2017; Gregor & Benbasat, 1999; Mittal & Paris, 1995). Explanation types fit for an intelligent system involved in a military operation may differ from those useful to classify different types of organisms. Also, different types of users may require different explanations: laymen do not require the same amount of detail as experts and explanations presented to adults should differ from those presented to children.

Therefore, it is important to specify the context of the application before designing an evaluation method and to work from a specific scenario while constructing an experiment (Neerincx & Lindenberg, 2008). In this study, we focus on personalised healthcare: a context in which ML is highly promising, but in which explanations are essential for applications to be widely adopted in society. The scenario that serves as the use case in this study is personalised, intelligent insulin advice for DMT1 patients. In this section, we describe DMT1, the potential of ML for DMT1 management and the specific scenario used as context for the present study.

1.3.1 Diabetes mellitus type 1

DMT1 is a chronic autoimmune disorder in which glucose homeostasis is disturbed. Ac-cording to the International Diabetes Federation (2017), 28,200 children and adolescents in Europe are diagnosed with DMT1 each year. In people with this condition, the

(25)

pep-tide hormone insulin, responsible for the uptake of glucose in body cells, is not produced due to loss of beta cells in the pancreas. As a result, patients suffering from DMT1 need to regulate their blood glucose values by administrating insulin before food intake via subcutaneous injections or an insulin pump.

Adapting insulin administration to personal needs can be a challenging task, since blood glucose levels are influenced by many environmental and internal factors with ef-fects that differ for every patient (American Diabetes Association, 2004). DMT1 patients learn to calculate the dose of insulin needed based on the amount of carbohydrates in a meal, but the optimal dose may be influenced by circumstances such as environmental temperature, tension and physical exercise. Even for experienced patients, it may be hard to find the dose of insulin that stabilises their blood glucose level (Reddy, Rilstone, Cooper, & Oliver, 2016). Careful DMT1 management is very important, as high blood glucose levels due to low doses of insulin may cause complications on the longer term, while low blood glucose levels increase the risk of hypoglycaemia (a complication involv-ing impaired brain functioninvolv-ing, resultinvolv-ing in for example tremor, dizziness and impaired cognitive functions) on the short term. Therefore, personalised intelligent advice systems could be a promising tool in DMT1 management.

1.3.2 Applying machine learning in diabetes management

As recent reviews by Contreras and Vehi (2018) and Kavakiotis et al. (2017) show, research on AI applications for diabetes is rapidly expanding. Especially ML algorithms have potential in the generation of personalised advice, because of their ability to extract information and make predictions based on a specific data set. With the rise of tools for self-measurement of health and activity, it is becoming easier to record personal data on various physical factors. Therefore, a personal application that provides on-demand advice on the administration of insulin is not an unlikely scenario in the near future.

However, as also noted by Buch, Varughese, and Maruthappu (2018) in a commentary on the application of AI in diabetes management and diagnosis, introducing AI-based tools for diabetes care may be accompanied by a either a lack of trust, or even over-reliance in patients, both due to the intransparent nature of AI systems. Since patients’ well-being and possible health consequences are involved, it is important that appli-cation of ML in diabetes management advice is paired with explanations that improve understanding and appropriate trust (see Section 1.1.1), thereby promoting effective dia-betes management. Therefore, the use of explanations to optimise patients’ performance in maintaining constant blood glucose levels in cooperation with the advice system is central to the use case in this study.

(26)

1.3.3 Scenario: personalised insulin advice

In the scenario that this study was based on, a DMT1 patient experiences problems finding the optimal insulin dose for a meal. To assist in determining the dose that will result in a stable blood glucose level, the patient has an intelligent application that provides advice on insulin intake before a meal. Based on different internal and external factors in the current situation of the patient, the system may advise to increase or decrease the insulin dose that the patient usually administers. Any advice to change the insulin dose is accompanied by a contrastive, or why-not explanation that highlights which factor(s) caused the advice to deviate from a normal dose. For example, the advice system could advise a lower dose than normal, with environmental temperature as the most important factor for this advice.

It is important to note that the intelligent advice system in this study was fictitious. This scenario was based on the case of insulin administration problems in DMT1 patients and the potentiality of ML and XAI in this respect. However, our focus was on the effectiveness of explanations in this scenario, rather than on the development of an actual advice system.

1.4 Research aims and hypotheses

The aim of the present study was twofold. Firstly, we intended to design an objective measure for the effects of different explanations for ML advice. In Section 1.2.1, we saw that carefully defining the intended user effects of the explanations (the dependent variables) is essential in selecting and designing a method to measure them. Also, we dis-cussed the importance of specifying the context and the targeted user of the application before designing a user study. In Section 1.2.2 we explained that a good way to avoid interference of people’s personal thoughts and opinions is to apply implicit measures. To contribute to a more structured approach in XAI evaluations including objective mea-surement, we identified well-defined dependent variables, a specified context and implicit measurement as the key components for objective user evaluations in XAI (see Figure 1.3). Our aim was to design a research method based on these guidelines.

(27)

Objective XAI Evaluations Implicit Measurement Well-defined Dependent Variables Specified Context

Figure 1.3. Three key components of ob-jective user evaluations in XAI. Note. XAI = eXplainable Artificial Intelligence.

Our second objective was to apply this research method in two behavioural experiments in the context of the DMT1 scenario described in Section 1.3.3. A first experiment compared the effects of rule-based and example-based explanations on users’ global un-derstanding of the advice system. A second experiment measured the effects of the two explanation types on users’ ability to correctly estimate the performance of the system. We termed these two dependent variables understanding and performance estimation. Below, the two dependent variables are specified in more detail.

Understanding: the extent to which the user understands the overall reasoning behind the decisions made by the intelligent system. In the present study, this refers to an understanding of the relations between input values and system advice.

Performance estimation: the extent to which the user can correctly estimate the performance of an intelligent system that is not 100% accurate. In this study, this translates into the ability to recognise when to accept and when to reject its advice.

Beside these two main dependent variables, the design of the second experiment also gave us insight in an interesting third concept: persuasion. The participants’ task in the second experiment was to decide whether to accept or to reject the advice from an intelligent system. Therefore, we did not only analyse whether they correctly accepted the system advice, but also whether they were likely to accept the system advice regardless of its accuracy. This third dependent variable was defined as follows.

(28)

Persuasion: the extent to which the user is convinced to accept the intelligent system output. In the present study, this is represented by users’ tendency to follow the system advice, regardless of whether the advice is correct or incorrect.

1.4.1 Research questions

In accordance with our research aims, we formulated two main research questions, each consisting of multiple subquestions. The first research question was of methodological nature and concerned the suitability of our research design (Research Question 1a) and how it compares to subjective measures of the same concepts (Research Question 1b).

Research Question 1a: Is a research method that entails well-defined dependent variables, a specified context and implicit measurement suitable for the objective measurement of user effects of XAI explanations?

Research Question 1b: How do the results from the designed objective measure of user understanding and performance estimation relate to the results obtained with subjective measures?

The two experiments that we conducted using the research method that we designed also served as a first objective study to test the effectiveness of two exemplary XAI methods. The second research question therefore addressed the user effects of contrastive rule-based and example-rule-based explanations. The three dependent variables (understanding, performance estimation and persuasion) were subject to three different subquestions (Research question 2a, 2b and 2c, respectively).

Research Question 2a: What are the effects of contrastive rule-based and example-based explanations of personalised insulin advice on user understanding?

Research Question 2b: What are the effects of contrastive rule-based and example-based explanations of personalised insulin advice on user performance estimation? Research Question 2c: What are the effects of contrastive rule-based and

example-based explanations of personalised insulin advice on user persuasion?

The design of the method addressed in Research Question 1 is described and motivated in Chapter 2. The second research question was investigated in two experiments, with Experiment I examining Research Question 2a and Experiment I addressing Research Questions 2b and 2c. The methods and results of these experiments are presented in Chapters 3 and 4.

(29)

1.4.2 Hypotheses

We expected our objective research method to be an important contribution to the field of XAI, because current user evaluations are often poorly designed or mainly use subjective measures. Whereas objective, behavioural tasks have been common practice in the social sciences for decades, there is room for improvement in XAI user research. Therefore, our hypothesis for Research Question 1a was as follows.

Hypothesis 1a: A research method that entails well-defined dependent variables, a specified context and implicit measurement is suitable for the objective measure-ment of user effects of XAI explanations.

With respect to the relation between our objective research method and subjective mea-sures of the same concepts, we expected to find an inconsistency between the results of subjective and objective measures. Subjective measures do not fully reflect internal processes, because humans are not aware of everything that happens within their own mind (Baumeister et al., 2007; Nisbett & Wilson, 1977; Vazire & Mehl, 2008). Some people may overestimate their own understanding of the system’s workings and its per-formance (Rozenblit & Keil, 2002), while others may have better understanding and performance estimation than they think. Therefore, we hypothesised that there was no correspondence between subjective and objective measures.

Hypothesis 1b: Results obtained using objective measures differ from those obtained with subjective measures.

Regarding the first experiment, participants’ understanding of the logic behind the sys-tem advice was expected to be promoted by the two types of contrastive explanations, as compared to advice without any explanations. Explanations make the opaque system more interpretable, thereby giving the participants a better insight in its mechanisms. Also, the contrastive nature of the explanations is easily understandable for human users (see Section 1.1.3). When comparing rule-based to example-based explanations, we ex-pected rule-based explanations to improve users’ understanding of the system’s reasoning more than example-based explanations. Rule-based explanations explicitly state the fea-ture value that would have led to a different advice on insulin administration, whereas example-based explanations provide example situations rather than exact threshold val-ues (Adadi & Berrada, 2018). Therefore, rule-based explanations should allow users to learn the system’s reasoning more easily than example-based explanations.

Hypothesis 2a: Contrastive rule-based and example-based explanations positively af-fect user understanding, as compared to no explanations. This efaf-fect is larger for rule-based explanations.

(30)

For the second experiment, we expected both contrastive explanation types to have a positive effect on people’s ability to judge when the system is correct or incorrect. Expla-nations provide insight into the (possibly incorrect) information the advice system bases its decisions on. For example, the system may often be inaccurate when certain input features are involved. In addition, comparison with a contrasting event provides infor-mation on the decision boundary learned by the system, which is relevant inforinfor-mation in judging the system’s performance in new situations.

Similar to our hypotheses for user understanding, rule-based explanations were ex-pected to be more effective in improving participants’ performance estimation than example-based explanations. Performance estimation depends on identifying the incor-rect decision boundaries that the system has learned and rule-based explanations reveal a direct connection between the rule and the incorrect output. Because example-based explanations do not state explicit rules and require multiple instances to identify the in-correct rule, rule-based explanations were hypothesised to be more effective in improving performance estimation.

Hypothesis 2b: Contrastive rule-based and example-based explanations positively affect user performance estimation, as compared to no explanations. This effect is larger for rule-based explanations.

Although we expected rule-based explanations to be more effective than example-based explanations in improving both user understanding and performance estimation, example-based explanations were expected to be more persuasive. Humans commonly apply rea-soning based on previous cases to solve problems, a phenomenon referred to as case based reasoning (Aamodt & Plaza, 1994). Also, examples are a powerful persuasive conversa-tion tool (Lischinsky, 2008). Research showed that examples have a larger influence on people’s perception of certain information than general statements involving quantities, percentages or abstract information, despite these being more informative (Brosius & Bathelt, 1994; Gibson & Zillmann, 1994). Therefore, presentation of previous example cases was expected to persuade users to follow the system advice more than conditional rules.

Also, because explanations in general often have an important argumentative function in human conversations and serve to convince the listener (Antaki & Leudar, 1992; Miller, 2018; Moulin, Irandoust, B´elanger, & Desbordes, 2002), both contrastive explanation types were expected to be more persuasive than advice without explanations.

Hypothesis 2c: Contrastive rule-based and example-based explanations positively affect user persuasion, as compared to no explanations. This effect is larger for example-based explanations.

(31)

2. Research design

This chapter elaborates on the importance and implementation of objective user evalu-ations in eXplainable Artificial Intelligence (XAI) and the approach that we adopted in our research design. Section 2.1 covers a review of current user evaluations in XAI, struc-tured according to the key components of objective evaluations in XAI that we identified in Section 1.4. In Section 2.2 we describe and motivate the choices that we made in the design of the objective method used in this study and we discuss how this method can be adapted and generalised to evaluate different XAI methods and AI systems in human users.

2.1 Current state of user evaluations in XAI

Despite acknowledging the importance of explanations for users of artificially intelligent (AI) systems, many researchers introducing new XAI techniques solely include computa-tional (or in the terms of Doshi-Velez and Kim, 2017: funccomputa-tionally-grounded) evaluations to test their effectiveness. Surprisingly, numerous recent publications and prepublications in XAI do not include any user evaluations. From interpretable models (e.g. Letham, Rudin, McCormick, & Madigan, 2015; Xu et al., 2015) to black-box explanations (e.g. Dong et al., 2017; Guidotti, Monreale, Ruggieri, Pedreschi, et al., 2018), from model-agnostic techniques (e.g. Casalicchio, Molnar, & Bischl, 2018; Mishra, Sturm, & Dixon, 2017) to model-specific methods (e.g. Li, Song, & Wu, 2019; Santoro et al., 2017) and from local (e.g. Montavon, Lapuschkin, Binder, Samek, & Müller, 2017; Smilkov, Tho-rat, Kim, Viégas, & Wattenberg, 2017) to global explanations (e.g. Valenzuela-Escárcega, Nagesh, & Surdeanu, 2018; Yang, Rangarajan, & Ranka, 2018): the user effects of many new XAI techniques are unknown because of a lack of evaluations in human subjects.

Although we recognise that it is not every researcher’s direct aim to study the effects on end users and not every research lab has the resources to conduct extended human experiments, we argue that human evaluations should be an inherent part of the eventual design loop of XAI techniques.

Focussing at the designs of studies that did include human evaluations of the effects of AI explanations, we found that user experience of explanations for AI was usually evaluated using subjective measurement. Measures such as surveys, ratings and

(32)

inter-views were used to evaluate user satisfaction (Bilgic & Mooney, 2005; Ehsan, Harrison, Chan, & Riedl, 2018; Lim & Dey, 2009), goodness of the explanation (Hendricks et al., 2016), acceptance of the system output (e.g. Herlocker, Konstan, & Riedl, 2000; Ye & Johnson, 1995) and trust in the system (e.g. Berkovsky, Taib, & Conway, 2017; Holliday, Wilson, & Stumpf, 2016; Nothdurft, Heinroth, & Minker, 2013).

Subjective measurements provide valuable insights in how subjects evaluate different explanations for AI output. In some cases, subjective measures are the most suitable method. For example, researchers interested in the evaluation of explanation methods by data science or ML experts, rather than naive participants, depended on the expert judgements of these specialists (e.g. Dey & Newberger, 2009; Krause, Perer, & Ng, 2016). Also, subjective measures are convenient tools to measure the desired effects of the ex-planations in a simple and quick way. For example, the easiest way to measure trust is to ask whether people trust the system and a simple check for the persuasiveness of a system is to ask people how likely they would be to follow a system advice. However, as mentioned in Chapter 1, to investigate how people truly respond to a system, it is important to also evaluate their interaction without interference of the subjects’ own be-liefs and expectations. Therefore, we should take up the challenge to construct objective measurements for XAI user research.

In Chapter 1, we identified three focal points for objective evaluations in XAI: well-defined dependent variables, a specified context and the use of implicit measures (see Figure 1.3). In research methods currently used to evaluate the effectiveness of XAI techniques, these components are prioritised to different extents. Here, we review current user evaluations in XAI in light of these key components.

2.1.1 Defining dependent variables

Although explanations for intelligent system output may be designed with different ob-jectives in mind, we observed that the goals researchers aim to achieve with their expla-nations are not always clearly specified. The XAI community is starting to recognise the importance of defining the proposed effects of the explanations beforehand and match-ing the research measure to these effects (Doshi-Velez & Kim, 2017; Hoffman et al., 2018; Mohseni et al., 2018). However, the description of the dependent variables was ambiguous in several publications and moreover, the presentation of these concepts to the participants who had to judge them was not always clear.

A confusing concept in the literature was ‘satisfaction’. For example, Ehsan et al. (2018) asked users to rate satisfaction with the expressions of an intelligent agent, but the authors did not describe what exactly satisfaction entailed. Also, they only addressed this concept in one user rating, without considering possible subcomponents of satisfaction (e.g. emotional satisfaction, satisfaction about the language or layout used, satisfaction

(33)

of the agent’s trustworthiness, et cetera). A single rating of satisfaction was also used by Lim and Dey (2009) in a study on explanations for context-aware applications, leaving the exact nature of users’ (dis)satisfactory experience unclear. In a study on movie recommender systems, Bilgic and Mooney (2005) used a completely different definition of satisfaction, more related to the accuracy of the explanations. According to the authors, an explanation for a recommendation was satisfactory when it helped the users accurately assess the quality of an item.

Another troublesome concept was explanation ‘goodness’. Hendricks et al. (2016) had participants rate which of two visual explanations for an image classifier was ‘better’. It was left unclear to the reader as well as to the participants what would constitute a ‘good’ explanation. Miller (2018) suggested a number of different criteria for good explanations, for example their probability, relevance, soundness and completeness. A completely different definition of explanation goodness was provided by Hoffman et al. (2018), who described goodness as the a priori, decontextualised quality of explanations rated by XAI researchers, as opposed to satisfaction, which was defined as users’ a posteriori evaluation of explanations in context.

The large variation in different definitions of concepts such as satisfaction and good-ness in the literature suggest that more specific and narrow concepts are more suitable as dependent variables in a user study. Several studies did select clear dependent variables and fit their measures to these concepts. Examples of well-defined dependent variables in current XAI evaluations include perceived trust (Madsen & Gregor, 2000), mental model accuracy (Kulesza et al., 2013) and acceptance of the system (Herlocker et al., 2000).

Nonetheless, the terminology used to describe explanations goals largely differs be-tween studies. Therefore, multiple authors have attempted to introduce more structure and uniformity with respect to explanation goals in XAI. An overview of different classifi-cations of dependent variables from the literature is presented in Table 2.1. The different explanation goals in this table can be grouped under four main categories:

1. Understanding-related explanation goals: aims to improve the users’ under-standing of the workings of the intelligent system. Explanation goals include Jus-tification, Education, Mental models, Transparency, Conceptualisation, Learning, Transparency and Scrutability.

2. Performance-related explanation goals: aims to improve the performance in the task that the user accomplishes in cooperation with the intelligent system. Explanation goals include Performance, Effectiveness and Efficiency.

3. Trust-related explanation goals: aims to promote the users’ trust in the per-formance of the intelligent system. Explanation goals include Acceptance, Trust, Reliance and Persuasiveness.

(34)

Table 2.1

Classifications of XAI explanation goals from the literature. Reference Explanation goals

Herlocker, Konstan,

and Riedl (2000) Justification, User involvement, Education, Acceptance Hoffman, Mueller,

Klein, and Litman (2018)

Explanation Goodness, Explanation Satisfaction, Mental models, Curiosity, Trust, Performance

Mohseni, Zarei, and Ragan (2018)

Mental Model, Human-machine task Performance, Explanation Satisfaction, User Trust and Reliance Nothdurft, Heinroth,

and Minker (2013)a

Justification, Transparency, Relevance, Conceptualisation, Learning

Tintarev and Masthoff (2012)

Effectiveness, Satisfaction, Transparency, Scrutability, Trust, Persuasiveness, Efficiency

a _{Note that whereas explanation goals and user effects are usually used interchangeably,} Noth-durft et al. (2013) distinguish explanation goals from user effects such as human-computer trust.

4. Experience-related explanation goals: aims to optimise the user’s experience in interacting with the intelligent system. Explanation goals include User involve-ment, Relevance and Curiosity. Explanation goodness and Satisfaction may also be grouped under this category, depending on the definition provided.

We argue that researchers should use the terminology that best fits their objectives and the way that the dependent variable is measured. However, to enable community-wide comparisons, we should recognise how our user effects relate to those measured in other studies. The categories described above are an attempt to facilitate such comparisons.

It is also important to realise that the explanation goals described in this section are often interrelated and do not mutually exclude each other (Nothdurft et al., 2013). For example, feelings of trust may be influenced by a good understanding and efficient per-formance may be dependent on a positive user experience. Therefore, relations between explanation goals should be taken into account in both research designs and interpreta-tions of study results.

2.1.2 Specifying a user context

A second observation on XAI user evaluations was that they differed considerably with respect to the selection and motivation of the context domain of the system and its ex-planations. Oftentimes, the context was selected based on the data set available rather than based on the proposed societal applications of the method. For example, Hen-dricks et al. (2016) evaluated explanations for a system that classified images of birds,

(35)

because of the rich text annotations in the visual data set of North-American birds. Similarly, Ribeiro, Singh, and Guestrin (2016) conducted two experiments on a system that classified whether a document was about Christianity or about atheism and a third experiment with a system that distinguished between images of wolves and huskies. In another study, Kulesza, Burnett, Wong, and Stumpf (2015) had subjects interact with a system that classified whether a message was about hockey or about baseball. In the latter two studies, the context was determined based on the freely available data in the popular 20 Newsgroups data set, rather than on a realistic scenario in real life.

Other studies embedded their experimental task within a convenient, easily under-standable context story, with less consideration of its relevance and credibility. These studies were usually less dependent on existing data sets and high-performing machine learning (ML) models, but created their own basic rule-based models instead. Examples include a study by Narayanan et al. (2018) in which a supposed intelligent system recom-mended an alien’s food and medication, and research by Nothdurft, Richter, and Minker (2014) in which participants had to complete a party planning task while being advised by a mockup intelligent system. Even though these user studies provided conceptual evidence for the users’ responses to the explanations, no conclusions can be drawn with respect to the effectiveness in realistic future real-world applications of these methods.

In contrast, some studies examining the effects of explanations for AI were designed with a clearly specified use case in mind. Krause et al. (2016) developed a model to predict the risk that patients will develop diabetes and worked from a scenario in which clinical researchers needed explanations for these predictions. This context was well-motivated and the demands of both data scientists (subjects of the evaluation) and stakeholders (the clinical researchers) were taken into account. In another study, the effects of explanations of an intelligent system were examined within the context of a European project about balance disorders (Bussone et al., 2015), with the ultimate goal to develop a decision support system for physicians working in this area. An entirely different but also applied context domain formed the background of a study by Stumpf, Skrebe, Aymer, and Hobson (2018). This study evaluated the effects of explanations for smart heating systems, with the objective to improve the systems produced by an energy management software company. The fact that these studies were conducted within specified use cases allowed the researchers to draw well-founded conclusions about the real-world effects of their explanation methods.

Contrary to the notion that testing the effects of explanations requires a well-defined context (Mittal & Paris, 1995), it has been argued that a context scenario based on real life comes with prior domain knowledge and expectations in the participants, which may have confounding effects on their performance in the task. Herman (2017) refers to the phenomenon of user expectations that may influence the evaluations of explanations as implicit human cognitive bias. She warns for the possibility that in the development

(36)

cycle, this cognitive bias may finally be at the expense of the faithfulness of the model. Several researchers therefore aimed to minimise this bias in their methods. Narayanan et al. (2018) selected their alien context to avoid any effects of domain knowledge. Likewise, Lim et al. (2009) first modelled the use case of their experiment to a typical application of context-aware decision systems: the prediction of physical activity based on sensory data such as heart rate and body temperature. However, because subjects were influenced by their knowledge about the bodily effects of physical activity, they selected an abstract domain (with letter coding for the different input features) instead.

In our opinion, specifying the user and the user context before evaluating an XAI tech-nique in human subjects is essential. One can develop model-agnostic XAI techtech-niques, but not validate them in a context-agnostic way. Even though the trade-off between ac-curacy and transparency should be taken into account and it is important to reduce any confounding bias from an experiment, we argue that a specified user context introduces a very valuable bias: one that will eventually make the eventual application as fit to the real world as possible.

2.1.3 Designing implicit measures

A third consideration was that participants in current XAI user evaluations were of-ten fully aware of what the researchers aimed to test. In other words, many of these methods were explicit. Methods used in subjective research are naturally explicit: judge-ments of the effectiveness of explanation goals inevitably require conscious awareness of these goals. However, also more objective measures applied in XAI user research were usually explicit. Studies focussing on understanding-related explanation goals are often intrinsically more objective, because tests of understanding, knowledge and learning are less prone to influence of personal judgements. Different researchers investigated the mental models that participants constructed based on explanations. Some studies ex-amined these mental models by having participants provide a description of how the model worked (Kulesza et al., 2013; Lim et al., 2009). Others measured participants’ understanding of the system’s workings by asking the subjects for the reasons for a spe-cific output (Lim et al., 2009; Ribeiro et al., 2016). Although objective, these research methods were not implicit: subjects were aware of what the researchers wanted to test and they were asked to consciously report how much they understood.

Interestingly, a handful of studies has attempted to evaluate user effects more implic-itly. Mainly for the measurement of the understanding-related effects of explanations, a number of interesting and inventive behavioural methods have been used. Lim et al. (2009) assigned participants to different explanation conditions and had them inter-act with a simple, rule-based system in a learning phase. After this phase, participants were tested on their ability to provide the correct missing system input or output in a

(37)

‘fill-in-the-blanks’ test. In a similar vein, Kulesza et al. (2015) tested participants’ un-derstanding of a topic classifier by first presenting them with a small supposed training set for the classifier, either accompanied by explanations or not, and then asking them to predict the output of the classifier for new messages. Also Ribeiro, Singh, and Guestrin (2018) asked their subjects to predict a complex model’s output in unseen instances, first after browsing through a set of predictions without explanations and then twice more after seeing sets of predictions with two different explanations. What these three studies have in common is the division between a learning phase and a testing phase: participants were first presented with system input and output (Kulesza et al., 2015; Ribeiro et al., 2018) or freely interacted with the system themselves (Lim et al., 2009) and were subsequently tested on their understanding of the logic behind the system.

Also Lakkaraju, Bach, and Leskovec (2016) implicitly tested understanding of an interpretable model for disease diagnosis. They presented two groups of subjects with different interpretable models. Their task was to judge whether different statements about a patient’s characteristics and the resulting diagnosis were true or false. The average time spent to respond to the statements was recorded as well. The technique of having participants judge the accuracy of output based on a given input was also used by Narayanan et al. (2018). In their study, participants saw a number of input features, a system recommendation and a certain type of explanation and then indicated whether the output was consistent with the input.

Implicit measures were also used to test performance-related goals of explanations. In a study concerning a tool that provides explanations for functionalities of Microsoft Word (Myers et al., 2006), the helpfulness of the explanations was assessed by assigning participants to different explanation conditions and testing whether they could complete different text processing tasks together with the tool. The time spent on these tasks was also measured. Even though this study did not concern explanations for ML, measuring performance in a task while being assisted by different versions of a system proves to be a simple and objective way to investigate whether system explanations are useful.

Finally, an interesting design for the measurement of both performance and trust was adopted by Bussone et al. (2015) in their study on a clinical decision support system for balance disorders. In this study, seven medical experts were divided into two groups, receiving different explanations for the system’s suggestions. All participants had to provide a diagnosis for eight fictional patients with balance problems, while being advised by the system. However, in 50% of the cases, the system diagnosis was incorrect. In addition to assessing performance (i.e. the amount of correct diagnoses), this measure also addressed the number of times participants followed the system diagnosis; in other words: the persuasiveness of the system and its explanations.

To conclude, even though the area of user studies in XAI is still in an early stage, a few researchers have started to develop behavioural methods to test different user effects

(38)

implicitly. In the remainder of this chapter, we describe the implicit research method that we developed and the ways in which it can be adapted for future research.

2.2 Research design: a possible approach

In line with our suggestions for the improvement of user evaluations in XAI and moti-vated by both shortcomings and promising methodological ideas in previous research, we designed an objective measure for the user effects of rule-based and example-based explanations. Here, we describe and motivate our decisions for this design. Although some of the considerations discussed in this section may seem obvious to social scien-tists, this type of research is not common practice in the area of XAI. Therefore, we also discuss how this method can be used as a framework for future research in XAI. A detailed description of the research method and the exact experimental tasks is covered in Chapter 3.

2.2.1 Study design

The starting point for our research design was the use case described in Section 1.3: a scenario in which diabetes mellitus type 1 (DMT1) patients have difficulty managing their blood glucose levels, because of the subtle influences of internal and external factors such as tension and environmental temperature. As part of a larger project on AI for per-sonalised health, this specific use case took shape during an interview with a 23-year old DMT1 patient. She described that even though she had suffered from DMT1 her whole life, problems in managing her blood glucose levels occurred on a daily basis. Her current strategy was to prefer a low insulin dose to a high dose, because hypoglycemia results in immediate impaired functioning. To avoid the long-term consequences of the high blood glucose levels caused by this strategy, she indicated that an intelligent application could be very helpful in giving on-demand advice on insulin administration.

Once the use case was defined, the second step was to select which types of expla-nations would be appropriate to make the insulin advice system more transparent to the patient. The type of explanation was the independent variable in the study. Be-cause DMT1 patients are not ML experts, it was important to find straightforward yet faithful explanation types to accompany the system advice. Rather than complex visu-alisations or extended decision trees, conditional rules and typical example situations do not require any technical knowledge to understand. A second reason to select rule-based and example-based explanation types is that they both form prototypical categories of XAI techniques (Adadi & Berrada, 2018). Beside selecting suitable explanations for our DMT1 use case, we aimed to design a study relevant and applicable in a large part of the XAI research area. In line with this aim, we opted for these two exemplary XAI techniques.

An objective user evaluation of explanations of Machine Learning based advice in a diabetes context

An objective user evaluation of explanations of

Machine Learning based advice in a diabetes context

Elisabeth Nieuwburg

An objective user evaluation of explanations of

Machine Learning based advice in a diabetes context

Elisabeth G. I. Nieuwburg

July, 2019

Acknowledgements

Abstract

Table of Contents

List of Abbreviations

1. Introduction

1.1

eXplainable Artificial Intelligence (XAI)

1.1.1

Why XAI?

1.1.2

XAI methods

1.1.3

Rule-based and example-based explanations

1.2

User evaluations in XAI

1.2.1

Considerations on current user studies

1.2.2

A classification of research methods from the social

sciences

1.3

Use

case:

Personalised

advice

for

Diabetes

mellitus type 1

1.3.1

Diabetes mellitus type 1

1.3.2

Applying machine learning in diabetes management

1.3.3

Scenario: personalised insulin advice

1.4

Research aims and hypotheses

1.4.1

Research questions

1.4.2

Hypotheses

2. Research design

2.1

Current state of user evaluations in XAI

2.1.1

Defining dependent variables

2.1.2

Specifying a user context

2.1.3

Designing implicit measures

2.2

Research design: a possible approach

2.2.1

Study design