Explaining Individual Classifier Decisions

(1)

MSc Artificial Intelligence

Intelligent Systems Track

Master Thesis

Explaining Individual Classifier Decisions

by

Luisa Zintgraf

10634452

October 2015

42 EC March - October 2015 Daily Supervisor: Taco Cohen Examiner: Prof. Dr. Max Welling

Assessor: Dr. Stevan Rudinac

Machine Learning Group University of Amsterdam

(2)

Abstract

Classification algorithms can learn to distinguish between different categories by finding patterns in data, even if that data is high-dimensional or very poorly un-derstood by humans. With recent advances in Machine Learning, these algorithms become more powerful and are put into practise in many different areas.

However, we usually lack understanding about how these algorithms make decisions. In this thesis, we investigate how single predictions of classifiers can be explained for a human to better interpret the result. The explanations are given in form of a vector that has the same dimension as the input, with each entry reflecting the importance of that feature for the classification outcome. We build upon work of Robnik-ˇSikonja and Kononenko (2008), who propose a method that is based on the idea that the relevance of an input feature can be evaluated by simulating the absence of that feature, and observing how the output changes. We use two strategies to improve their method; we find a better way of approximating and simulating the absence of a feature, and we show how the method can be extended from a univariate- to a multivariate approach. Both strategies are exemplified for image data. Further, we propose a method to validate the explanations given for individual predictions. Usually, quality assessment of explanation methods is conducted by using artifi-cial data or involving experts on the data. As an addition to the explanation for classifier decisions, we show how the method can also be used to understand deep neural networks better and visualise the role of hidden layers in the decision process. In extensive experiments on image data we show that our method behaves better than the one proposed by Robnik-ˇSikonja and Kononenko (2008), as well as the well-established sensitivity analysis. We illustrate how an analysis on deep convolutional networks can be carried out. Additionally we show how the method can be put to use in a medical setting in experiments on bacteria data, together with the proposed validation method.

(3)

Acknowledgements

I want to thank my two supervisors Max Welling and Taco Cohen for their support throughout the last months. I had the best experience working on this thesis, and I owe much of that to you. Max, thank you for giving me the time and freedom I needed to explore this subject. I always felt that you trust me and would listen to any idea to the end, which encouraged me a lot. Taco, thank you for your time whenever I needed it and answers to all of my questions. Talking to you often helped me focus again, after trying to go into several directions at once. To my whole ex-amination committee, including Stevan Rudinac, thank you for agreeing on reading my thesis on such short notice.

A special thanks to my father who took the time to proofread my thesis, as well as to Kevin for helping me with the graphics.

Last but not least, I would like to thank my family and friends for their silent support by kindly accepting my physical and sometimes mental absence during this time. It helped me more than you can imagine.

(4)

Chapter 1 Introduction

Figure 1.1: What do you see?

Without reading further, consider figure 1.1 for a moment1_{. What do you see?}

Probably you can tell without much difficulties that there is a cat in the image, maybe even that it is a Siamese - although you most likely have never seen this particular cat before. If you thought there was a cat in the image, can you also say why you think so? Where exactly is the cat, and why you identify it as such? Probably you also have an answer to that question, which might be a mix of the abstract representation of a cat that is in your head and the parts in the image where the typical characteristics of a cat can be seen - e.g., the snout, eyes, ears, and the tail. This is not only an example of how good humans are at visual object recognition tasks, but it also makes another, very important point: we can come up

(7)

Figure 1.2: What do you see?

with some justification for the propositions we make. And we can do this irrespective of whether the proposition is correct or not: even if we make a wrong observation, say, that there was a dog in the image, we could point at whatever we think is a dog and explain why we think so.

Next, consider figure 1.2. What do you see? If you are not coincidentally an expert on the microbiota of the human gut, you probably cannot make much sense of the graph. But even if we were to ask somebody familiar to this type of data, they would not necessarily be able to tell me that the shown profile of bacteria abundances belongs to a person with Morbus Crohn. There are some heuristics that can help interpret the data, like the diversity of bacteria, but in general it is not easy to tell if the shown bacteria abundances are from a healthy or sick person. When faced with non-visual high dimensional data it can be very difficult for humans to identify patterns - even when it can be visualised in some way, for example in a graph. To determine if a patient has Morbus Crohn or not, a doctor will usually instead do a physical examination including different tests to make a diagnosis.

For a computer, there is not such a big difference between what we see in figure 1.1 and 1.2: both are basically just numbers, high-dimensional data. The image of the cat consists of around 550, 000 and the bacteria data on the other hand of around 3, 500 numbers. So teaching a computer to distinguish between cats and dogs, or healthy and sick patients, is essentially the same. We call computer programs that can learn to answer the question ”what do you see?” classifiers. By showing a classifier enough correctly labelled examples of the different classes (say, cats and dogs), it can learn to distinguish between these by recognizing patterns in the data.

(8)

INPUT BLACK-BOX

CLASSIFIER PREDICTION

Figure 1.3: Scheme of a black box classifier. The classifier takes some input and returns a prediction of what it saw in the input data. Insight into the process of making this prediction is often not possible for the human observer.

How well a computer program can solve this task depends on several factors, for example the dimensionality of the data and the number of examples available to show the algorithm. Recognising a cat in figure 1.1 is by far more difficult for a computer than for us, and many thousands of examples are necessary for it to learn this. On the other hand, there are problems where computers can exceed human expertise - like in the bacteria example of figure 1.2. A computer needs less than one hundred examples to correctly differentiate sick from healthy patients about 90% of the time.

However, contrary to humans, classifiers can usually not explain themselves and the propositions they make2_{. For this reason, we often refer to such classifiers as}

black boxes: we lack the intuition or understanding of what is happening between input and output (see figure 1.3). How exactly the predictions are generated is hard to understand since most classifiers resemble complex mathematical functions with many fine-tuned parameters. We can only measure how good a classifier really is by testing it on new data it has not seen before, and to which we know the correct class labels. The only additional information some classifiers can give us is how certain they are about their decision, for example by predicting a cat with 95% certainty, and a 3% chance of it being a dog, and 2% for any other class.

In this thesis, we want to shed light into the black box and understand what a classifier bases its decisions on by generating an explanation for each prediction it makes.

(9)

1.1 Motivation

Shed light into the black box - but why? One could argue that classifiers serve their purpose regardless of us understanding how they do it. And although classifiers have long been used solely for what they were developed for, i.e., classification, the demand for understanding how they make decisions has grown. Especially as clas-sifiers become more powerful with the ability to solve more complex problems, it becomes more difficult to understand what is going on. The motivation for address-ing this subject in the present thesis is two-fold; on the one hand addressed to the scientists who develop and train these classifiers, and on the other to those who use the classifiers (scientists, end-users, practitioners).

An example of where classifiers are put into practise is the medical domain, where doctors can use them as an additional tool to treat and diagnose their pa-tients. Classifiers can for example be used for the detection of diseases such as Alzheimer or HIV in MRI scans, for cancer detection and in cancer recurrence stud-ies, or in the development of personalised medicine and treatment, just to name a few. Using classifiers can offer great advantages for both doctors and patients, but naturally there is also a little bit of scepticism involved when incorporating com-puters in healthcare decisions. Therefore, having the classifier not only for the sole purpose of assigning a class to the input, but additionally being able to give an explanation for individual classifications, can offer new and interesting insights to the problem at hand and increase the confidence people have in the decisions made by the classifier. This is especially crucial when the classification outcome may have direct consequences on the treatment of a patient. Computer are supposed to as-sist us in such settings, and we cannot and do not want to just blindly trust their decisions. Take for example a doctor that wants to determine whether his patient has Morbus Crohn or not. Since the symptoms are very similar to ones from other diseases, it can be difficult to make a diagnosis and several tests are necessary, like a blood test or a colonoscopy. Adding a classifier to the tools of a doctor in this case can make the process of diagnosing easier, and might spare the patient other expensive or displeasing tests. It will also make the doctor less reluctant to accept the computer program as a diagnostic tool, and it provides the opportunity to gain more insight into how the data can be interpreted.

(10)

Another reason for wanting to understand how classifiers work is that we might be able to improve them by getting insights into how they make decisions. When training a classifier, we usually want it to perform with high accuracy, i.e., get as many predictions right as possible for unseen data. A particular type of classifier is the neural network. These classifiers are very powerful and theoretically one could construct any mathematical function with arbitrary accuracy by modelling it in a neural network. But what makes them so powerful also makes them hard to train in order to get the high prediction accuracy we strive for. Neural networks are highly nonlinear functions, which subsequently transform the input through mathematical computations. As computers become more powerful, neural networks become larger and more computations are executed between input and output. There are lot of parameters that determine these computations, and all of them have to be fine-tuned during the training process. For example, a network developed by Krizhevsky et al. (2012) which can tell that in figure 1.1 we see a Siamese cat (and not one of the other 999 classes it was trained on) has around 60 million parameters. The training process for networks of that size involves a lot of trial-and-error because they are so large that we have not figured out yet how to optimally train them. Therefore, developers are very much interested in understanding how exactly they work, and making sense of what is going on inside the black box. If we can understand better what is happening, we can more likely improve the classifier.

1.2 Goal

We want to understand classifiers, and be able to explain their decisions. But what exactly does this mean, and how can it be formulated? A computer program cannot explain itself in natural language like us, so we have to formalise what exactly we want. There are in principle two things we would want to ask the classifier, which are arguments that also a human could give: where in the input data is the evidence for the prediction, and what does a typical instance of a class (say, a cat) look like? There has recently been a lot of interesting work on the latter question with neural networks, where images are generated that show how a typical cat (or other image classes) looks to the network. In this thesis however, we will focus on the first question, i.e., where exactly in the input is the relevant information about the assigned class.

(11)

Usually the input to a classifier is a multi-dimensional vector, called feature vector. We will for now assume that the output is just the class which the classifier assigns to the input, for example ”cat” or ”sick”. We can then ask which features from the input vector were most important to the classifier’s decision. We will do so by assigning each feature a relevance measure, and store those in a vector of the same size as the input. The values are real numbers, reflecting if the particular feature contributed towards or against the assigned class, and to what extend. A contribution of zero means that the feature was irrelevant for the decision. In the cat example, we would assume to have a high relevance measure for the pixels that include the characteristics of the cat like the snout or tail, and it should give a low or negative contribution for the surroundings, like the road in the background. For images, we can visualise the relevance measures as an image of the same size by using a heat map to visualise the important regions. For non-visual data (like bacteria abundances), we have to come up with different methods for illustrating the result.

1.3 Contributions

There are two main strategies for explaining individual classifier decisions. One is the sensitivity analysis, where the relevance of a feature is given by how sensitive the classifier’s prediction is to small changes in the value of that feature. This method is widely used and accepted, but we will argue strongly for a different method, proposed by Robnik-ˇSikonja and Kononenko (2008). It is based on the idea of evaluating a feature’s relevance by looking at how the output of the classifier changes when that feature is unknown. Although less common than the sensitivity analysis, we will see that it has several advantages and can give more sensible explanations for individual decisions.

Our own contributions will build upon this work by proposing to use a more suitable way of simulating unknown feature values. Additionally, we will show how the method can be used in a multivariate analysis (i.e., testing multiple features at once instead of one by one), particularly for the case of image data. In extensive experiments we will show how these improvements lead to better and more precise explanations for individual predictions.

(12)

Further, we will propose a validation method to test if an explanation is correct. So far, the performance of relevance estimation methods has usually been assessed by either testing it on artificial datasets where the correct explanation is known, or by involving experts in the validation process. However, to our knowledge, there exists no analytical way of measuring how accurate an explanation is. Our validation method will not only assess how good an explanation is, but additionally it will also provide even more insight into how the decision was made. A practical example of how to use the validation method is given for the bacteria abundance data in the experiments section.

Lastly, we will illustrate how the method of Robnik-ˇSikonja and Kononenko (2008) can be used for understanding and visualising deep neural networks. We are going to exemplify this in an experiment and point at future work that can be done in this direction.

(13)

Chapter 2 Preliminaries

In this chapter, we will introduce the basic mathematical and conceptual foundations used throughout the thesis, although we will assume that the reader is familiar with basic linear algebra and probability theory. If not stated otherwise, the definitions in this chapter are in their essence taken from Bishop (2006), which the reader can refer to for a thorough introduction to the subject.

2.1 The Classification Problem

The general classification problem can be formulated like so: given a dataset com-prised of vectors and corresponding labels, find a function or program that can correctly map the vectors to their label. I.e., fit a classifier to the given data that maps D-dimensional feature vectors x to classes ck, k ∈ {1, . . . , K}, of K distinct

classes. The data is divided into training and test set. The classifier then learns from the training data, i.e., feature vectors x1, . . . ,xN and the true class labels t1, . . . , tN.

We will assume that these labels are known for all N training instances, which is called supervised learning: the classifier learns from labelled data.

A classifier can generally be defined as a function f mapping feature vectors x to outputs t ∈ {c1, . . . , cK}:

f : x 7→ t . (2.1)

When facing a classification problem, we first have to choose an appropriate classifier for the data, i.e., the form of the function f . Then, in the training process, the parameters of f are tuned so that it optimally maps input vectors to class labels. The performance of a classifier can be measured in different ways. The most

(14)

common way is to measure its prediction accuracy on a test set of data the classifier has not seen before,

prediction accuracy = number of correctly classified test instances

total number of test instances . (2.2) The above defines a non-probabilistic classifier. However, for our analysis we will be using probabilistic classifiers. These map the feature vector to a target vector t ∈ [0, 1]Kwhere PK

k=1

tk = 1. Each entry in t represents the confidence of the classifier

in the class that corresponds to the index of that entry,

f : x 7→ t =     p(c1|x) ... p(cK|x)     . (2.3)

The assigned class is the one with the highest class probability. If the classifier cannot output probabilities by nature, there are calibration methods to transform the results into probabilities and we will resort to such methods when necessary.

In our experiments (chapter 5), we will use support vector machines and different types of neural networks. Therefore, we will briefly explain what those classifiers are doing. However, we will not discuss how they can be trained, since for our purposes we can assume that we are given an already-trained classifier.

2.2 Support Vector Machines

A support vector machine (SVM) is a non-probabilistic classifier, which is designed for two-class classification problems. A linear SVM defines a (D − 1)-dimensional hyperplane that optimally separates the two classes in the sense that the margin (the smallest distance) between the hyperplane (also: decision boundary) and the closest sample is maximal. Figure 2.1 illustrates this for a linear SVM and two-dimensional data. Formally, the SVM defines a linear function

y(x) = w>x + b , (2.4) where the hyperplane is given by y = 0. For classification, the sign of y(x) will determine the assigned class:

f : x 7→    0 if y(x) < 0 1 if y(x) > 0 . (2.5)

(15)

Figure 2.1: Illustration of a support vector machine for two-dimensional data. The hyperplane (in red) is one-dimensional (a line) and separates the classes so that the distance between this line and the closest data point is largest (referred to as the margin).

In the training process, the parameters w (the weights) and b (the bias) are optimised so that the margin (as described above) is maximal.

However, the data might not be linearly separable, i.e., the classes might overlap, and thus a hyperplane as described above cannot be found. There are different strategies to handle this, see Bishop (2006).

Since the support vector machine is a non-probabilistic classifier, the class scores have to be post-calibrated to produce probabilities. The parameters for this are usually learned during training. An example is the method of Platt et al. (1999), which we have used in our experiments.

2.3 Artificial Neural Networks

A artificial neural network is a probabilistic classifier that can be used for problems with any number of classes. It defines a nonlinear function f which can in the-ory approximate any function with arbitrary accuracy. We will explain the basic functionality with the help of the neural network shown in figure 2.2. It consists of an input layer, one hidden layer and an output layer, through which the input is consecutively transformed by mathematical operations. In the first step, a linear combination of the input features is computed for each of the M hidden layers,

aj = D X i=1 w(1)_ji xi+ w (1) j0 , (1 ≤ j ≤ M ) . (2.6)

(16)

Figure 2.2: The general architecture of a neural network.

The superscript indicates that we are in the first layer, and a weight wji connects

input node i with the hidden node j. The parameter wj0 is the bias and connects an

artificially inserted node x0 having value 1 with the hidden node j. We call aj the

activations, which are then transformed further by a nonlinear activation function h(.),

zj = h(aj) . (2.7)

The zj are called hidden units. In figure 2.2, the input features and hidden units

are visualised as nodes of the network, and the weights wji are connecting them

pairwise. Since the weights wj are different for each hidden unit, the values zj differ

as well and extract different information from the input. The hidden layer consists of M hidden units which are in the next step mapped to the output layer by another linear combination, ak= M X j=1 w_kj(2)zj + w (2) k0 , (1 ≤ k ≤ K) (2.8)

where the ak are the output unit activations. These are again transformed to get

the probabilistic output vector y, usually with a logistic sigmoid function σ(ak) =

1 1 + exp(−ak)

. So in total, we can summarise the network by

y = σ M X j=1 w(2)_kj D X i=1 w(1)_ji xi+ w (1) j0 ! + w_k0(2) ! , (2.9)

(17)

Figure 2.3: Illustration of the two main mechanisms of a convolutional neural network: convolutions with weight sharing and pooling.

In figure 2.2 we have shown a two layer, fully connected network, which is the most basic architecture for neural networks. Depending on the problem, training networks with different architectures can greatly improve performance. Deeper net-works (with more than one hidden layer) are much more powerful, but it is also much harder to fine-tune the growing number of parameters. With different architectures, we can introduce some sparseness into the network or enforce some characteristics the network should fulfil in order to more efficiently solve a problem. We will use two types of neural networks in our experiments that both introduce some sparseness through different mechanisms to the network. One is an example for a special ar-chitecture network, the convolutional neural networks. Those networks are designed especially for visual object recognition tasks. Contrary to that, spike and slab neural networks are fully connected, but effectively become sparse by regularisation during training. Both network architectures will be shortly discussed here.

2.3.1 Convolutional Neural Networks

Convolutional neural networks (convnets) (LeCun et al., 1998) are useful for image recognition tasks and have shown superior performance in this domain compared to regular neural networks. They have less parameters than fully connected networks, and are built so that they satisfy desirable properties for image classification. A fully connected network could eventually learn the same function, but it will be much harder to train such a network.

(18)

recog-nition tasks, achieved through the specific network architecture. An example is translation invariance: the location of a cat on the image should not have any effect to the classification score. Further, the classifier should also, at least to some degree, be invariant to scaling, rotation and small deformations, all of which can be enforced by a convnet.

A convolutional layer has a different structure than a regular hidden layer of a neural network. Is consists of feature maps, which again consist of hidden units. Each such unit only takes as input a subregion of the input image (or the feature map in the previous layer), and all units in a feature map process their input patch in the same way. This is referred to as weight sharing: each unit’s value is calculated like in equations (2.6) and (2.7), but all units in a feature map share the same weight vector w. For example, a feature map could consist of units that detect edges in the original image. Each unit in that feature map would then look for edges in the subregion of the input image it looks at. A convolutional layer consists of several feature maps, each of which looks for different things in the input image. Another important mechanism in convnets is pooling, happening in a sub-sampling layer that comes right after the convolutional layer (see 2.2). For each feature map in the convolutional layer, one (smaller) map exists in the sub-sampling layer. The units in that layer take patches from the feature map and sub-sample from its values, for example by averaging or taking the maximum.

2.3.2 Spike & Slab Neural Networks

Neural networks, and especially deep networks, tend to overfit easily when there are few training samples and, relative to that, a large number of features. I.e., they do not generalise well since they model noise that is present in the training data. In our experiments (chapter 5), we will use a spike and slab neural network (SSNN), developed by Louizos (2015). These types of networks impose regularisation on the network parameters by learning which of them can be removed for better perfor-mance. To this end, a spike and slab distribution is learned over the input features and the activations of the hidden layers. In principle, this distribution tells the network which parameters to use and which to omit. Instead of actually removing nodes from the network, the input features and the activations of the hidden layers are multiplied with scalar values given (deterministically) by the learned distribu-tion, which reflect the parameter’s importance for classification. This will lead to

(19)

some input features being close to zero and thus more or less ignored, with the clas-sifier focusing on the other features. The same happens with the hidden activations, which makes the network sparse. For a detailed exemplification and other variants of spike and slab neural networks see Louizos (2015).

(20)

Chapter 3 Related Work

When trying to understand how classifiers make predictions, we want to find the features that are important when making a decision. There are various methods on relevance estimation, who come with different definitions for feature importance. We can differentiate between general and instance-specific methods:

• General relevance estimation assesses the features according to what the classifier focuses on in every decisions. Therefore, this method returns only one relevance vector which depicts the importance of the features across all decisions.

• Instance-specific relevance estimation assesses the features according to what the classifier focuses on regarding a particular decision. So for each instance, there is a relevance vector assigned to it describing the influence of individual input features on the prediction. These vectors can vary a lot between different instances, especially when there is a high inner-class hetero-geneity.

In this thesis we will focus on the latter, but to draw a line between the two cases, we will first very briefly look at two common examples of general methods in section 3.1: feature selection and feature ranking methods. Then in section 3.2 we will focus on instance-specific methods by first introducing the widely-used sensitivity analysis in 3.2.1. Starting point for our own contributions will be the work of Robnik-ˇSikonja and Kononenko (2008), which we will present in section 3.2.2. Finally in sections 3.3 and 3.4, we will present work that has been done for specific classifiers, the support

(21)

vector machine and neural networks since they are two popular classifiers and we are going to use both in our experiments.

3.1 General Feature Relevance Estimation

General relevance estimation methods find the features that are important across all decisions. Such kind of analysis is interesting for several reasons. Knowing which features are important in a general sense will tell us what the classifier looks at to get the information that is sufficient enough to make a decision. If we think back to the bacteria data (figure 1.2), a general method would be able to tell us which bacteria typically have an influence on the health of a person. Especially when working with high-dimensional data, being able to filter out the important versus the unimportant features can make the problem clearer and more comprehensible to humans. Another benefit from a general method is that training the classifier without the irrelevant features can boost its performance since some noise can be removed.

Both feature selection and feature ranking techniques are widely used and well known general relevance estimation methods. Thus we want to make a clear dis-tinction between those and the instance-specific methods we will deal with, as to not confuse the different types of approaches.

Feature Selection

Feature selection techniques aim to find the optimal subset of features - optimal in the sense that the prediction accuracy is high when the classifier is trained only on this subset. The selected features have to be expressive enough to separate the classes and should not include any unnecessary information that clutter the information. In the bacteria example, it could be possible that we only need to look at certain bacteria that are informative enough to discriminate between sick and healthy. In the cat example however, and for image classification in general, we should probably not train the classifier on just a part of all images, because the objects of interest can be anywhere. Note that feature selection techniques also tend to remove redundant features, not only irrelevant ones. This might be beneficial for the training process, but we have to be careful not to be fooled into thinking that all removed features are not important for the problem.

(22)

Feature Ranking

While feature selection only divides the set of features into relevant and irrelevant, feature ranking methods on the other hand specify how important the features are relative to each other, i.e., assign a score to each input feature dimension. As Kohavi and John (1997) have shown, general importance and optimality as in the sense described above do not necessarily coincide. A general feature ranking method produces a vector that reflects each feature’s relevance for the classifier, irrespective of the actual value of a feature. This means the classifier, in contrast to feature selection, is trained on the complete input space, and afterwards an importance measure is determined for each feature.

Feature Selection with Spike and Slab Neural Networks

The spike and slab neural network discussed in 2.3.2 uses regularisation (i.e., it penalises too large values of parameters) on the input- and the weight parameters of the network. It can be thought of as a mix of a feature ranking- and selection method, since the input features are assigned weights that reflect their importance for classification, but also they are multiplied with these weights to effectively reduce the number of input features as to optimise performance.

3.2 Instance-Specific Feature Relevance

Estima-tion

What we are interested in is giving an explanation for a single prediction by at-tributing a relevance measure to the input features. There are two main approaches to this: measuring how sensitive the classifier is to the exact value of a feature, and how the classifier changes its prediction when a feature value is unknown. We will present both methods here.

3.2.1 Sensitivity Analysis

Sensitivity analysis is a method that has been used for general- as well as instance-specific relevance estimation methods. It tests how sensitive a classifier is to a particular input feature by looking at how much the output changes when the feature

(23)

value changes a little bit. The sensitivity of a feature i is given by the partial derivative of the probabilistic output with respect to that feature,

si =

∂p(ck|x)

∂xi

, (3.1)

evaluated at the specific feature value xi (Baehrens et al., 2010). Here, x is the

input vector containing features xi and ck is one of the K possible classes (although

we will be most interested in the influence of the input on the predicted class). Simonyan et al. (2013) use sensitivity maps in connection with deep convolutional neural networks, which we will come back to in section 3.4.

There are two main drawbacks, however, to this method. It is not applicable to data where only extreme value changes lead to a change in the output (for example when the feature values are discrete). In addition, this method is not suitable for some classifiers. In section 3.3, we will show for example that the sensitivity analysis for a linear support vector machine can give misleading results.

An advantage of the sensitivity analysis is that it is usually relatively fast to compute. For many classifiers, the partial derivatives as shown above can be formu-lated analytically and thus evaluated fast. For neural networks, a single backpass through the network is sufficient to compute the derivative in (3.1). However, if both cases do not apply, one has to go into greater effort to numerically approximate the sensitivity map.

3.2.2 Prediction Difference

Robnik-ˇSikonja and Kononenko (2008) propose a method for instance-based rele-vance estimation which is based on the idea that in order to measure how important a feature value is, we can look at how the prediction changes if this feature is un-known, i.e., the difference between p(ck|x) and p(ck|x\i) for a feature xi. Here, x\i is

the set of all features except feature xi. If there is a large difference, the feature must

be important, and if there is little to no difference, the particular feature value has not contributed much to the assigned class. This method is in principle independent of the classifier, but it requires probabilistic outputs.

To evaluate the class probability p(ck|x\i) where feature xi is unknown, the

authors propose two different strategies. The first one is to literally set the value to ”unknown”, but only few classifiers allow this1_{. Therefore, the authors propose a}

(24)

way to simulate the absence of a feature. Since a feature value can be marginalised out like this,

p(ck|x\i) =

X

xi

p(xi|x\i)p(ck|x\i, xi) , (3.2)

they propose to approximate this by p(ck|x\i) ≈

X

xi

p(xi)p(ck|x\i, xi) . (3.3)

I.e., replace the feature value with all possible values it can take, and weigh it with the prior probability of that value under the assumption that feature i is independent of the other features, xi ⊥ x\i.

Once the class probability p(ck|x\i) is estimated, it can be compared to p(ck|x).

The authors propose three different ways of evaluating the prediction difference: • Information Difference,

infDiffi(ck|x) = log2p(ck|x) − log2p(ck|x\i) (3.4)

In their experiments all three methods perform similar, but they recommend using weight of evidence as a default choice, since they work better for a border case2_{. In}

order to avoid problems with zero-valued probabilities in equation (3.5), they use the Laplace correction p ← pN+ 1

N + K, where N is the number of training instances and K is the number of classes.

Several follow-up papers by ˇStrumbelj and Kononenko (2008) use a different approach for estimating p(ck|x\i): retraining the classifier with that feature left

out for all training instances. For us, this will be too time-consuming as we will

2_{Namely when the attributes are conditionally independent given the class. Also we have}

observed in our experiments that the results are more visually appealing compared to the other methods, which is why we stick to their recommendation.

(25)

be working with high-dimensional data. Also, this might not even reflect the true relevance of a feature: if an important feature is removed for all instances, the classifier will shift its focus to other features, and it is not clear how the prediction for the specific instance would change. By chance it could be that the class of interest is actually now more easily distinguishable from the other classes.

Obviously, the method of Robnik-ˇSikonja and Kononenko (2008) is a univariate approach: only one feature at a time is tested. As they state themselves, this is the biggest drawback of their method, since it could be necessary to change more than one feature value at a time to have an effect on the prediction. Think of the cat example: if only one pixel of the nose is changed from black to white, you would still be able to identify the cat with the same certainty as before. In a follow-up paper, ˇStrumbelj and Kononenko (2008) expand this approach in a multivariate way, testing every element of the power set of features. Their approach is a little different in that they look at the difference between the prior class probability p(ck|∅)

and the probability when a group of features is known, p(ck|{xi}i). But since this

approach has exponential time complexity and is not useful when there are more than a handful features, we cannot use it with high-dimensional data. Even an approximation method proposed by them in (Strumbelj and Kononenko, 2010) is not feasible ”for several hundred features or more”. However, we will try to find a multivariate approach that is somewhere between the univariate method and a full analysis of the power set in section 4.2.

3.3 Feature Relevance for Support Vector

Ma-chines

Support vector machines (section 2.2) are popular classification algorithms, since they are easy to implement and train (most scientific computing software packages have built-in SVM functions that can be used straight away). Since they are so widely used, numerous methods have been proposed for feature relevance estimation for SVMs, most of which are general ones. One of the proposed methods is to take the absolute weight vector of a linear SVM to describe feature importance. However, this is wrong and can be very misleading. Still, the method has been used in several scientific publications, even in the medical domain. Examples for argumentations against this are given by Gaonkar and Davatzikos (2013) and Haufe et al. (2014)

(26)

who show that the magnitude of the weight vector in fact does not reflect general feature relevance.

If we look at how the class score is computed with a linear SVM, y(x) = x>_w+b,

we can immediately see that the sensitivity map would just be the weight vector. Also when the class score is transformed to output probabilities, the sensitivity map would just be a multiple of the weight vector. The relative relevance values between the features would not change. From this we can conclude that for the linear support vector machine a sensitivity analysis is inherently general, and might actually be misleading.

3.4 Neural Network Visualisation

The thing that makes neural networks so powerful is at the same time also what makes them hard to train, and difficult to understand. They represent very complex and highly nonlinear mathematical functions with many free parameters that have to be tuned during training. Therefore, understanding what exactly goes on in the intermediate layers might help improve these networks. There has been a lot of work in the recent years on the visualisation of convolutional neural networks and some impressive results have emerged from it. To our knowledge, no such work has been done for networks trained on data other than images, but the techniques can easily be adapted.

Convolutional Neural Networks

Again, we want to make the distinction between general and instance-specific meth-ods so we can then tie in with our method.

A successful general approach some authors have taken follows an idea proposed by Erhan et al. (2009): given a node of interest in the network, generate an input image that maximises the activation of that node. This can be a hidden- or an output unit, and the resulting image gives a sense of what excites this unit the most, i.e., what is looking for. For example we would expect that the output unit of the class cat will be maximal for an input image that looks like the most typical cat. See Simonyan et al. (2013) and Yosinski et al. (2015) for some intriguing results.

A similar but instance-specific method is proposed by Mahendran and Vedaldi (2014) in the context of convolutional neural networks. Instead of trying to find

(27)

which input features have the largest effect on individual units, they take a deep image representation in a hidden layer and try to reconstruct the image from only this representation. Instead of looking at individual nodes, they are interested what information is contained in whole feature maps of the network. This gives us a sense of what information from the input is retained in the feature map. Although this is an instance-specific method, it is in principle different to what we are doing: instead of trying to find the important feature in the original input, it produces input images that represent the information that is present in different parts of the network structure. Therefore it can also not be used to explain the output, i.e., the decision of the classifier.

Instance-specific methods for neural networks that try to localise input fea-ture/regions that are important for the activation of nodes in the network typically do a sensitivity analysis as described in section 3.2.1. Equation (3.1) is not restricted to the probabilistic output, but can be adapted to use it in connection with any node in a neural network. Simonyan et al. (2013) propose image-specific class saliency visualisation to rank the input features based on their influence on the assigned class. To this end, they compute the partial derivative of the class score Sc with

respect to the input features xi,

∂Sck

∂xi

(3.7) to estimate each feature’s relevance. The class scores Sck are given by the nodes in

the fully connected layer which comes right before the output layer with the class probabilities. In figure 3.1, we show sensitivity maps for some images from the ImageNet database (Russakovsky et al., 2015). We have used a deep convolutional neural network taken from the caffe model zoo3_{, which is a replication of the model}

of Szegedy et al. (2014). Backpropagation is carried out by the caffe framework (Jia et al., 2014).

Similar work was done by Zeiler and Fergus (2014) who use deconvolutional net-works to project feature activations back to input space. But as Simonyan et al. (2013) show, this can be interpreted as a sensitivity analysis of the network’s in-put/output relation, so we will not have to go into the details of this method. Zeiler and Fergus (2014) additionally use a different strategy to find the important input regions by occluding portions of the input image with a grey patch and visualising p(c|x\i) directly. This is somewhat similar to what Robnik-ˇSikonja and Kononenko

(28)

Figure 3.1: Sensitivity maps for images from the ImageNet database.

(2008) do, by making input features unknown by occluding them. And while it removes structure from the image, it might also add additional information – a grey patch looks like something to the classifier. So it is not clear if the class probability goes down only because information is occluded, or also because the probability of a class that commonly has grey areas in its images rises. And while for humans occluding parts of the image with a grey patch is like taking that information away, for classifiers this is a little different: a grey patch itself always looks like something; the information is changed, but not completely taken away.

(29)

Chapter 4 Explaining Individual Classifier

Decisions

In section 3.2.2 we presented the method of evaluating the prediction difference by Robnik-ˇSikonja and Kononenko (2008), which will be the starting point for our contributions. Recall that their method was based on the idea that we can measure a feature’s importance by observing what happens when that feature is unknown, i.e., evaluating the prediction difference between p(ck|x) and p(ck|x\i) for one of the

possible classes ck. To estimate p(ck|x\i), Robnik-ˇSikonja and Kononenko (2008)

proposed the following approximation: p(ck|x\i) = X xi p(xi|x\i)p(ck|x\i, xi) (4.1) ≈X xi p(xi)p(ck|x\i, xi) . (4.2)

For real-life problems, p(xi) is usually unknown. Therefore, we would estimate (4.2)

by replacing feature value xi with all the values we have seen for that feature in the

training data, and then take the average, p(ck|x\i) ≈ 1 |Xtrain| X xi∈Xtrain p(ck|x\i, xi) . (4.3)

In the next section, we will argue why this approximation might not be suitable for some problems, and propose to approximate p(xi|x\i) differently for these cases.

In section 4.2, we will discuss how we can go from a univariate- to a multivariate approach, even when the data is high-dimensional. For both these improvements we will illustrate how to apply them for image data. Then in section 4.3 we will

(30)

propose a method to validate instance-based relevance estimation methods when a visual analysis is not possible and our understanding of the data is limited. Finally, we will show how the method of Robnik-ˇSikonja and Kononenko (2008) can be used with deep neural networks.

4.1 Simulating Unknown Feature Values with

Con-ditional Sampling

When sampling from the marginal distribution p(xi) in equation (4.2) and

com-pletely ignoring the other features x\i, we are making an approximation to truly

marginalising that feature out. Instead of evaluating what the class probability is when the feature is completely unknown, we are now looking at what happens on average when that feature can take any of its possible values. For some problems, this is a valid approximation: for example when the features are not, or at least not strongly, dependant on each other. But obviously this does not always hold. If we evaluate p(ck|x\i, xi) for feature vectors that are not even possible in the problem

domain, this will likely obscure the result. Therefore, we propose to approximate p(xi|x\i) when necessary, making use of the known properties of the data by

condi-tioning at least on some other features. Fitting a distribution over all feature values is often not feasible with complex and high-dimensional data, so we can instead condition feature xi on some subset {xj}j ⊂ x\i of which we can assume that the

features depend on each other. To make this idea a bit clearer, we will discuss it for the case of image data. For other kind of data, we might have to come up with individually fitted strategies.

4.1.1 Conditional Sampling for Images

In natural images, the value of a pixel does not depend so much on its location, but much more on the pixels around it. The probability of a red pixel suddenly appearing in a clear-blue sky is rather low. So p(xi) is not a very accurate approximation of

p(xi|x\i). We can get a much better approximation based on two main assumptions

we can make about image data:

• A pixel’s value depends mostly on the pixels in some neighbourhood around it, and not so much on the pixels far away.

(31)

• A pixel’s value does not depend on the location of it in the image (meaning its coordinates, not its relative location to other parts in the image).

Under these assumptions we will fit a probability distribution over patches of pixels. Let these patches be of size k × k × 3 (assuming RGB images; for grey-scale images the third dimension can be dropped). By using the given data, we can fit a proba-bility distribution P ({xj}3k

2

j=1) of our choice over these patches. Then if we want to

sample a feature value xi, we can condition it on the pixels in a small neighbourhood

around that feature, using the distribution P ({xj}3k 2

j=1). Using a multivariate

Gaus-sian distribution is probably the most straight-forward choice for this distribution. Since we will be using it some of our experiments (section 5.2), we want to briefly outline how we can fit this kind of distribution to the data, and how we can sample from it.

Example: Multivariate Normal Distribution

A multivariate Gaussian over some patch of pixels {xj}3k 2 j=1 is given by P({xj}3k 2 j=1) ∼ N ({xj}3k 2 j=1|µ, Σ) , (4.4)

where µ is the mean vector and Σ is the covariance matrix. To estimate these parameters, we will take T samples of k × k × 3 patches from the training data and collect them in a matrix M of size (3k2_{) × T . The mean over the samples T will}

give µ, and the covariance matrix of T will give Σ.

When the parameters of the distribution are determined, we can then sample pixel values given some neighbourhood around them. Suppose we are given a pixel xi and some patch around it, {xj}3k

2₋₁

j , written into a vector xj. Without loss of

generality, we can assume that they are sorted in a 1-dimensional vector like so: " xi xj # = " x1 x2 # (4.5) and that the mean vector µ and covariance matrix Σ are partitioned accordingly, i.e., " µ1 µ2 # (4.6) and " Σ11 Σ12 Σ21 Σ22 # . (4.7)

(32)

Then the conditional distribution of x1given x2is a multivariate normal distribution, p(x1|x2) ∼ N (x1|µ, ˆˆ Σ) (4.8) with ˆ µ= µ1+ Σ12Σ−122(x2 − µ2) (4.9) and ˆ Σ = Σ11− Σ12Σ−122Σ21 (4.10)

For a derivation see Bishop (2006).

4.2 From a Univariate to a Multivariate Analysis

The method Robnik-ˇSikonja and Kononenko (2008) propose is a univariate ap-proach: each feature value’s influence is computed by simulating its absence, but keeping the other feature values fixed. We would expect that, especially with high-dimensional data, removing one feature does not have a large influence on the class score, and that the classifier should be relatively stable under this manipulation. Therefore, we would like to take a multivariate approach: simulating the absence of several features at once and observing the change in prediction. Removing the whole head of a cat should have a significant influence on the class score, as opposed to just removing one pixel from its nose. Ideally, we would test every element from the power set of features like ˇStrumbelj and Kononenko (2008) suggest, but this is clearly infeasible for high-dimensional data. Therefore, we have to decide which elements from the power set to use, and for which features it makes sense to analyse them together. There are two straightforward methods to do this without using information about the data:

• Randomly pick elements from the power set, possibly defining how large the sets should be. For example, we could say that we want to pick 1 to 10 features at once, and then uniformly pick from all the possible sets. By sampling many such subsets and marginalising the features out together, we will get a better explanation than by just a univariate test.

• Marginalise out features together that are correlated. We can do this without much difficulty for pairs of features, and just take the top K correlated feature pairs.

(33)

Figure 4.1: Illustration of how a whole pixel patch (in red) can be simulated as unknown to the classifier. The red patch can be marginalised out by replacing it with samples conditioned on the surrounding pixels. This way, the information that was in the red square is removed. In this example, we would remove the information about the cat’s ear and see how this affects the class probability assigned to ”cat”. Both methods are not very sophisticated, but might still do better than just a univariate approach. We will use these methods in our experiments section, with the bacteria data (section 5.3).

However, we can come up with better strategies by incorporating knowledge (1) about the classifiers, (2) about the data, and (3) taken from some feature ranking methods. Take the spike and slab neural network from section 2.3.2. It does some kind of mix from feature selection and -ranking: during training, the classifier learns which features to focus on in every decision. We can therefore restrict our multivari-ate analysis to the input features that have a large weight, which can diminish the effective dimensionality of feature space considerably. We can also use knowledge about the data itself, since we might have some understanding of the dependencies amongst features. One example again is image data: as in the previous section, we can make some assumptions on how pixels are correlated.

4.2.1 Multivariate Analysis for Images

When doing a multivariate analysis for image data, we can again utilise the fact that nearby pixels are more correlated than pixels that are far away from each other. Marginalising out a whole patch of adjacent pixels will have a much larger effect than changing unconnected pixels. We will implement this in a sliding windows fashion: assume we want to make a multivariate analysis using patches of size k×k×3 (again, assuming we have RGB images). Starting in the upper left corner, marginalise out

(34)

a patch there. We then slide the window to the right, marginalising out the next patch - until we reach the lower right corner. The patches are overlapping, so that for each pixel we take the average relevance it has from all patches it was in.

We can combine this with the conditional sampling method of section 4.1.1, by conditioning the whole patch on a larger patch around it. The method will stay the same, only that we now sample not one but several features at once. The procedure is illustrated in figure 4.1.

4.3 Validation of Explanations

When doing instance-based relevance estimation, it is not always straightforward to assess how well a method does, or even to compare different methods. Therefore, papers on relevance estimation often test their methods on artificial datasets, where the mechanism behind how the data is generated is known and therefore also which features are most discriminative and important. Another approach is to just look at the results and reason about them directly, either if the data is well understood (like visual data), or by the assessment of experts (like in some medical settings). But humans may be biased in some sense and it can be difficult to objectively validate the method, or compare methods that all give somewhat reasonable results. For problem settings where human expertise is outperformed by classifiers, it is even impossible for us to validate the results.

We thus need an analytic way of validating the method when the ground truth is unknown. Here, we propose a novel method of doing so, since we have not found any such method in an extensive literature search. There seem to be only assessment strategies to evaluate general methods. For example, retraining the classifier with only the most important features: this can be used to assess a general method, but not for instance-specific relevance vectors.

A instance-based relevance estimation method returns a D-dimensional vector r which gives a contribution measure ri (1 ≤ i ≤ D) for each feature xi of the input

vector x. To evaluate whether the method really does find the most important features, we will first sort the features i according to the relevance vector r ∈ RD_{, in}

ascending order. The features with negative influence on the class score come first, at some point the irrelevant features with zero relevance, and the most important ones last. We denote the resulting list as R = {πasc(xi)}i, which is given by a

(35)

permutation πasc of all the features. Now, we will evaluate p(c|x) by successively

marginalising out a growing number of features, starting with the least important ones. I.e., we map

d 7→ p(ck|x\{πasc(xi)}di=1) (4.11)

d ∈ {1, . . . , D}, (4.12) where D is the total number of features. We can do this for any class c, but usually the class with the highest confidence score will be most interesting, i.e., the class that the classifier assigned to the input. What we will observe is how the class probability p(ck|x\{xi}di=1) changes with d. If it declines fast, this means that very

discriminative features are thrown out early on, and have been assigned a too low relevance. If the class probability is declining slowly with growing d, this means that the most important/discriminative features have been assigned the largest relevance. It could even be that at the beginning, when marginalising out features that actually spoke against the class of interest, the confidence of the classifier in that class will rise. We will use this validation method in our experiment section in chapter 5. As we will see there, this method is not only valuable for the evaluation and comparison of feature relevance estimation methods. It can be incorporated into the explanation as well, and give additional and very valuable insights into the problem. It can help determine how many features make a difference, and give the user a feeling of where he can make a cut-off when saying ”this many features are enough to make a correct prediction”.

4.4 Neural Network Visualisation

When trying to understand neural networks and how they make decisions, it is not only interesting to analyse the input-output relation, but also to look at what is going on inside the hidden layers of the network. We can to use the idea of Robnik-ˇ

Sikonja and Kononenko (2008) to visualise the role that hidden nodes of a neural network play in making the decision.

Let z(k)_i and z_j(l) be any two distinct nodes of the network in two different layers F(k) _{and F}(l) _{respectively, so that layer F}(k) _{comes first in the network. In order to}

estimate the influence z_i(k) has on the value of z_j(l)for a specific input vector, we can proceed as follows. The idea is to investigate what happens with the value of z_j(l) if

(36)

the value z(k)_i is unknown, but everything in the network that does not depend on those values stays fixed. To this end, we will first propagate the given input through the network until we reach layer F(k)_{. We will write f}(k) _{for the vector that holds all}

features from this layer, and whose values are estimated by the forward pass. Since the layer F(l) _{comes later in the network, we can now write the value of z}(l)

j as a

function g(.) of that vector, y = g(f(k)_{). To see how z}(k)

i influences z (l)

j , we want to

evaluate the difference between g(z_j(l)|f(k)_{) and g(z}(l) j |f

(k)_\z(k)

i ). We will refer to this

as the activation difference (as opposed to prediction difference which we have used for the input-output relation).

The function g(.) now is not necessarily a probability density function any more, so we have to adjust equation (3.2) to estimate g(z_j(l)|f(k)_\z(k)

i ). We will re-write the

equation as an expectation: g(zj(l)|f (k)_\z(k) i ) = E_p(z(k) i |f(k)) h g(zj(l)|f (k) )i (4.13) =X z(k)_i p(z(k)_i |f(k)_\z(k) i )g(z (l) j |f (k)_{) ,} _(4.14)

so that it stays the same if g(.) is a probabilistic output function, but we can also evaluate it for other nodes in the network with non-probabilistic g(.).

How to evaluate the activation difference between g(z_j(l)|f(k)_{) and g(z}(l) j |f(k)\z

(k) i )

now depends much on the nature of g(.). Robnik-ˇSikonja and Kononenko (2008) propose three different ways of doing this for probabilities (equations (3.4)-(3.6)), but these are not applicable for general functions. The most naive way of evaluating the change when zi(k) is unknown is to just take the difference,

activDiff_z(k) i (z (l) j |f (k)_{) = g(z}(l) j |f (k)_{) − g(z}(l) j |f (k)_\z(k) i ) . (4.15)

If more information about the activation function is given, a more appropriate mea-sure can be chosen.

If we want to do a multivariate approach, we can replace z_i(k) with a set of features that are in the same layer, and proceed just as described above. This might be interesting for example for convolutional layers, where we can do a multivariate analysis in a sliding window approach for individual feature maps, just like we have described it for an input image in section 4.2.1.

There are different choices for z(k)_i and z_j(l) which can give interesting insight into the mechanism of the network. We are going to explore two choices in our

(37)

experiments section 5.2.5 to get a sense of what is possible and what we can learn from such an approach. An extensive evaluation of the method however is outside the scope of this thesis. Let us thus briefly discuss the two cases we want to show later in our experiments.

• Relevance of an input image on the hidden feature maps of a con-volutional neural network. By looking at how the input features influence nodes inside the network instead of the probabilistic output, we can get a sense of what these nodes are specialised on. In convolutional networks, we have feature maps in the convolutional layers which are some deep represen-tation of the input. Instead of visualising how the input features influence a single feature of this map, we can aim for learning what activates the entire feature map. To this end we will, for each unit z_i(k) in a feature map F(k)_,

esti-mate a relevance vector rel_z(k)

i of the size of the input with the above method

(i.e., evaluating the activation difference instead of the prediction difference). Then we will sum up all these relevance vectors,

X

z_i(k)∈F(k)

rel_z(k) i

, (4.16)

to be able to visualise which part of the input image influences the entire feature map.

• Influence of hidden feature maps on the output. Looking at how a single feature map in the network can influence the prediction of the output might give yet more information about the role of that map in the whole context of the network. We can evaluate this as described above, i.e., propagate the input image up to that feature map, and then just treat the feature map as the input to the rest of the network, and go on using the method like we have before: for each unit in the feature map, we compute its influence on the output and can show the result in an image the same size as the feature map. This can be done in a univariate approach, or in a multivariate (sliding window) approach.

(38)

Chapter 5 Experiments

In this chapter, we want to evaluate and compare instance-specific relevance esti-mation methods. We will begin with a rather easy classification task, namely with the MNIST database. It’s easy in the sense that the problem is relatively easy to solve for common classifiers, but also in the sense that we can understand and solve this problem easily ourselves. If we can reason about the data, we can also evaluate the relevance estimation methods discussed here, and get a sense of what they are doing. On this dataset, we will compare the sensitivity analysis and the prediction difference by Robnik-ˇSikonja and Kononenko (2008) to get a feeling for how they work.

We will then make a more rigorous analysis of the methods on the CIFAR10 dataset, which consists of colour images from ten different classes. Adding to the comparison to the sensitivity analysis, we will here assess how marginal sampling (Robnik-ˇSikonja and Kononenko, 2008) and the conditional sampling proposed in section 4.1.1 for evaluating p(ck|x\i) compare. Using the prediction difference, we

will be able to get a good sense of how the convolutional network we will use makes decisions on this specific dataset. Furthermore, we will exemplify how the method can be adapted to peek into the inner workings of a convolutional method, as we have proposed in 4.4.

Finally, we will finish with a medical example to see how the method works on non-visual data. Since we cannot evaluate the methods by interpreting the results ourselves, we will utilise the validation method proposed in 4.3, and present how it can also be incorporated into the explanation itself in order to understand even better how the classifier makes its decision.

(39)

Figure 5.1: Examples from the MNIST database

5.1 MNIST

MNIST (LeCun et al., 1998) is a large database of labelled 28 × 28 pixel grey scale images of handwritten digits. Instead of using the whole dataset (digits 0 − 9), we will restrict ourselves to only the digits 5 and 81_{. This restriction will make the}

results very clear and comprehensible, so that we can get a good idea of how the methods work, and afterwards go on to more complex problems.

5.1.1 Classifier

For these experiments, we will be using two classifiers. One is a linear support vector machine (section 2.2), trained with a standard package in python (sklearn, Pedregosa et al. (2011)). The library also takes care of the probabilistic outputs (by using Platt scaling (Platt et al., 1999)). Its accuracy on the test set is 0.96 (on the smaller dataset of only fives and eights).

The other classifier we will be using is a spike and slab neural network (SSNN) as described in section 2.3.2. The network was trained with one hidden layer of 500 units (fully connected), and has an accuracy of 0.98. Recall that a spike and slab neural network does a mix of feature selection and -ranking (section 3.1). For classification, the input features are multiplied each with a scalar value, reflecting how important it is for classification in general. Figure 5.2 shows these scalar values, visualised as an image the same size as the input. We can see that the outside regions of the image (depicted in black) are ignored by the classifier, with it focusing more on the middle section. This makes sense since the digits are usually seen in the centre of the image. Further, we see the strongest focus on a diagonal bar. So in each decision, the classifier looks particularly at that region to make a decision. Since it was trained on images of fives and eights, presumable the classifier takes the presence of this bar as an indicator for an eight, and the absence as an indicator for a five.

1_{The full database has 50, 000 training-, 10, 000 validation- and 10, 000 test instances.}

(40)

Figure 5.2: Visualisation of the learned spikes on the input of the neural network. During a feed forward pass, the input pixels are multiplied with these values which effectively leads to feature selection: the classifier focusses on the white regions. This shows that the network learned to concentrate on the middle region of the image, and especially on a diagonal bar (the lightest region).

This is already an interesting result, but only as a general method (recall section 3.1). Instead, we want to analyse the decisions of the classifier for a particular input image. For comparing our methods to the sensitivity analysis, we computed the gradients for this network using the python library Theano (Bastien et al. (2012), Bergstra et al. (2010)).

While the purpose of showing the spike and slab neural network results is to get a good basic understanding of how the methods work, using a SVM will show how the results differ between shallow and deep classifier architectures.

5.1.2 Settings

Since we want to use this example only to get familiar with the method, we do not explore different settings. Instead, we will present results from what we found led to the most visually appealing results.

• Multivariate Approach: We have used a multivariate analysis, with a slid-ing window of a maximum size of 4 × 4 over the image. The windows are overlapping, and for each pixel we take the average relevance from all the win-dows it was in. We start with a univariate approach (window size 1 × 1) and

Explaining Individual Classifier Decisions

MSc Artificial Intelligence

Master Thesis

Explaining Individual Classifier Decisions

Luisa Zintgraf

October 2015

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Motivation

1.2

Goal

1.3

Contributions

Chapter 2

Preliminaries

2.1

The Classification Problem

2.2

Support Vector Machines

2.3

Artificial Neural Networks

2.3.1

Convolutional Neural Networks

2.3.2

Spike & Slab Neural Networks

Chapter 3

Related Work

3.1

General Feature Relevance Estimation

3.2

Instance-Specific Feature Relevance

Estima-tion

3.2.1

Sensitivity Analysis

3.2.2

Prediction Difference

3.3

Feature Relevance for Support Vector

Ma-chines

3.4

Neural Network Visualisation

Chapter 4

Explaining Individual Classifier

Decisions

4.1

Simulating Unknown Feature Values with

Con-ditional Sampling

4.1.1

Conditional Sampling for Images

4.2

From a Univariate to a Multivariate Analysis

4.2.1

Multivariate Analysis for Images

4.3

Validation of Explanations

4.4

Neural Network Visualisation

Chapter 5

Experiments

5.1

MNIST

5.1.1

Classifier

5.1.2

Settings