Concept-based Explanations for Natural Language Processing Models

(1)

Language Processing Models

De Smet Lennert

Promoter: Prof. dr. Yvan Saeys

Supervisor: Arne Gevaert

A thesis presented for the degree of

Master of Science in Mathematics

(2)

(3)

This page is not available because it contains personal information.

Ghent University, Library, 2020.

(4)

This thesis could not have been completed without the help of many people and thus a giving of thanks is most definitely in order. First of all I want to sincerely thank both my promoter prof. dr. Yvan Saeys and my supervisor Arne Gevaert for the support, directions and timely advise given throughout writing this work. Not only did they allow me to explore the field of interpretable machine learning, but they did so granting a sufficient degree of independence while still correcting when necessary.

Furthermore, my parents have to be thanked especially as they have always supported the choices I wanted to make. It is they who allowed me to study what I love in my own way and so I could not have hoped for this work and myself to be as they are today without them. Of course, I have to thank my friends, siblings and other family members for lifting my spirit when necessary, helping to bring about the completion of this work. On a final note, I do want to acknowledge the more subtle, but no less important influence of all the fellow students I have met throughout the last couple of years. They have provided me with numerous opportunities of discussion and reflection that have helped steer me towards a multitude of interesting mathematics and ideas which have surely had an impact on my approach of constructing this work.

(5)

Dedication i Declaration ii Acknowledgements iii 1 Introduction 1 1.1 Machine learning . . . 1 1.1.1 Unsupervised learning . . . 1 1.1.2 Supervised learning . . . 3

1.1.3 Natural Language Processing . . . 3

1.1.4 Neural networks . . . 4

1.1.5 Convolutional and Recurrent neural networks . . . 6

1.1.6 Motivation . . . 10

1.2 Interpretability . . . 11

1.2.1 Desiderata . . . 11

1.2.2 Context of explanations . . . 12

1.2.3 Local versus global . . . 12

1.2.4 Evaluation . . . 14

2 Explanation generation 15 2.1 Local methods. . . 15

2.1.1 Layer-Wise Relevance Propagation . . . 15

2.1.2 DeepLIFT . . . 16

2.2 Testing with Concept Activation Vectors (TCAV) . . . 18

2.3 Automatic Concept-based Explanations (ACE). . . 20

3 Relevance-driven ACE (R-ACE) 23 3.1 R-ACE using DeepLIFT . . . 23

3.2 Extensions of DeepLIFT . . . 24

3.2.1 Deeplift on recurrent layers . . . 25

3.2.2 Relevance through pooling layers. . . 26

4 Evaluation methods 27 4.1 Absolute similarity . . . 27

4.2 Relative similarity . . . 28

4.3 Significance testing . . . 28

5 Datasets, models and parameters 30 5.1 Synthetic benchmark . . . 30

5.1.1 Construction . . . 30

5.1.2 RNN architecture and performance . . . 31

(6)

5.2 IMDB movie reviews sentiment analysis . . . 32

5.2.1 Network and performance . . . 33

5.3 Twitter sentiment analysis . . . 34

5.3.1 RNN architecture and performance . . . 35

5.3.2 CNN architecture and performance . . . 36

6 Results 38 6.1 Synthetic benchmark . . . 38 6.1.1 ACE results: RNN . . . 38 6.1.2 ACE results: CNN . . . 39 6.1.3 R-ACE results: RNN . . . 39 6.1.4 R-ACE results: CNN . . . 40 6.2 IMDB . . . 41 6.2.1 ACE results . . . 41 6.2.2 R-ACE results . . . 42 6.3 Twitter . . . 43 6.3.1 ACE results: RNN . . . 43 6.3.2 ACE results: CNN . . . 45 6.3.3 R-ACE results: RNN . . . 47 6.3.4 R-ACE results: CNN . . . 48

Conclusion and future work 49

A Samenvatting 51

(7)

1

Introduction

In the present day, the use of machine learning and artificial intelligence algorithms is widespread and has in-filtrated into almost every domain thinkable. This strong development, carried out over the course of the last few decades, has enjoyed an explosion of widespread attention in the last 10 years due to the additional suc-cess of neural networks. The origins of this mathematical construct can be traced back to the 1950s with the introduction of the perceptron[31]and many theoretical extensions[10,15]were proposed over the following decades. The reason for the only recent additional explosion of applications is twofold. On the one hand there is the availability of huge datasets necessary to train such neural networks and on the other hand there is the computational power offered by the use of GPUs to perform that training. This combination led to the many advances in fields like image recognition[25],natural language processing[8]and general automation[34]seen today.

However, as the computational requirements might already imply, interpreting the decisions made by neural networks is far from easy due to their highly complex and non-linear nature. This black box treatment can in turn result in general distrust in the decisions made, especially in sensitive use cases like medical or socio-economic analyses. Apart from this, the need for methods to explain neural networks is further underlined by recent juridical developments[41]enforcing the presence of at least a basic explanation of any prediction made by a machine learning algorithm. While many efforts[22,30]have already tried to satisfy this need, there is still a lot to be done and this work will try to give its own contribution.

The next couple of sections of this chapter will give a brief, but thorough introduction to machine learning, neural networks and explanation methods thereof.

1.1 Machine learning

Machine learning is the field of applied mathematics where statistical models are constructed on a given dataset to perform inference or make decisions related to a certain task. These models are obtained without explicit programming and thus extract knowledge from their own training dataset in an automatic way with the goal to be able to generalise this knowledge to unseen samples of similar distributions. In other words, the ultimate achievement for a machine learning model is to learn what really matters from its given dataset with respect to its given task such that it is capable of performing that same task on similar data that did not occur in the training data.

Historically, the first distinction that is generally made in the field of machine learning is that between supervised and unsupervised learning. The assigned task to be learned is where both of these fundamentally differ.

1.1.1 Unsupervised learning

With unsupervised learning, only a dataset is given and the goal is to find general structure in that dataset. Most of the algorithms in this category focus on finding groups or clusters of similar samples, effectively discovering different underlying building blocks of the data. If new samples are then presented, the model is able to classify it in one of the discovered clusters. An example of this would be a set of vectors belonging to 2 disks in R2_,

where a good unsupervised algorithm would assign a colour to each vector if they belong to one disk or the other. The expression of similarity in this case could be the Euclidean metric, since vectors in the same disk will

(8)

Figure 1.1: Example of application of k-means clustering method to points sampled from 2 disks.

be generally closer or more similar to each other than they are to vectors from the other disk. An example of precisely this setting is given by Figure1.1.

One of the most well-known examples of an unsupervised learning algorithm is k-means clustering[13].Since this will be used later on and to obtain a better idea of how such an algorithm works, this algorithm will be discussed in detail. Let X = {x1, . . . , xn} ⊆ Rmbe a set of real vectors and k a natural number which

represents the amount of clusters that are to be found. The procedure begins by randomly selecting k elements µi of the set X, representing the centers of every cluster in the initial iteration. The cluster number that is

assigned to every element xi ∈ Xis now given by

arg min

k ||xi− µk||.

So now, given the initial cluster centers µj, every element of X is assigned a certain cluster depending on the

closest center. However, these clusters are probably not very representative of the real structure due to the initial cluster points being randomly chosen. To solve this, multiple iterations follow using different cluster centers every time. In detail, the next k cluster centers are defined as

1 nj nj X i=1 γijxi, with γij = ( 1 if arg mink||xi− µk|| = j 0 otherwise and nj = n X i=1 γij.

Or, the next cluster centers are the average vectors of all the vectors currently assigned to the different clusters. This iteration continues until a certain convergence criterion is met, for example when the distance between previous and next cluster centers is lower than a certain threshold value. This was exactly how the results of Figure1.1were obtained, given a value k = 2. An important remark is that the value of k is, a priori, unknown

(9)

unless more is already known about the underlying structure of the points X. In practice, this is one of the parameters that has to be optimised by applying different values and evaluating the cluster performance for the different values.

1.1.2 Supervised learning

Supervised learning on the other hand does not have to label the data that is given. Instead, in addition to every sample a label is already available and the goal here for the supervised algorithm is to predict the label of those samples. For example, say the training set consists of images of either a cat or a dog, then a good supervised method would be able to say which one of the two animals is on a certain image. This is also an example of a classiﬁcation problem as the image belongs to either one of two classes, i.e. the label can only take a discrete amount of values. The data might as well consist of bank accounts with various different details like the balance and a possible task could be to predict the monthly expenditure of that account, which is a continuous value. Such an application would be called a regression problem.

No matter which kind of problem is given, supervised algorithms model the relation between samples and their labels in one way or another such that the labels of new samples can be predicted using this model.

While the next section will give a thorough introduction to neural networks, which can be used in the supervised learning setting, the simpler algorithm of a Support Vector machine or SVM[14]will be discussed to set the stage. In this context let X = {x1, . . . , xn} ⊆ Rmagain be the dataset at hand, this time with given labels

Y = {y1, . . . , yn}for every xi. Assume the labels are either -1 or 1, resulting in a binary classification problem.

A SVM will try to find a hyperplane in Rm_{that separates all elements in X as well as possible with respect to}

their labels. Moreover, it is desired for this hyperplane to have the largest possible margin or distance between itself and the closest points of both classes. To simplify the discussion, assume that a hyperplane exists that separates X with respect to Y perfectly. This can be formulated formally by assuming the existence of w ∈ Rm

and b ∈ R such that

w · xi+ b ≥ 1, if yi = 1

and

w · xi+ b ≤ −1, if yi = −1,

or in one easy expression

yi(w · xi+ b) − 1 ≥ 0. (1.1)

Let H1and H2denote the hyperplanes following the equations w · x + b + 1 = 0 and w · x + b − 1 = 0

respectively and consider a point x−_{on H}

1 and a point x+on H2. The margin M can then be determined

as M = |x− _{− x}+_|_{. Furthermore, since the vector w is perpendicular to the separating hyperplane, it is}

known that x− _{= x}+_{+ λw}_{for some λ ∈ R. Together with the fact that x}−_{and x}+_{satisfy the equations}

w · x−+ b = −1and w · x++ b = 1it can be deduced that λ = _wT2_·w. Combining all of this gives

M = |x−− x+| = λ|w| = √ 2 wT _{· w}.

Now the problem of finding a separating hyperplane with a maximal margin M is mathematically formulated as a constrained optimisation problem. Indeed, the quantity M is to be maximised with respect to the constraints given by equation1.1. This can in turn be solved using, for example, Lagrange multipliers[4].

1.1.3 Natural Language Processing

One specific domain in which the various methods of machine learning, be it supervised or unsupervised, can be applied is that of Natural Language Processing or NLP. In general, this envelops all the possible applications having an input consisting of words. This can be in the form of single words, sentences or complete bodies of text. An immediate problem in dealing with such an input is that there is no straightforward way to use it in

(10)

most of the existing machine learning methods. Indeed, most machine learning algorithms depend on samples being vectors in some real vector space. Therefore, the first step in solving many NLP problems is finding a

languistic embedding, mapping words or lexemes present in the given dataset to real-valued vectors.

One of such embeddings is the GloVe-embedding[29],of which multiple variants exist. It is constructed in a probabilistic way, estimating the probabilities of finding a certain word in the context of another based on the corpus of text that is given. This knowledge is then, in turn, integrated into a couple of interesting character-istics of the embedding itself. The first of these is that words with high probabilities to be found together are reflected in vectors with low distance between one another. Secondly, there is the presence of a constructed linear semantic substructure, i.e. a difference of 2 word vectors will give a direction capturing the semantic difference between both words. An example of this is given by the words ‘good’ and ‘bad’, whose difference will give an approximate direction between what is rather positive or more negative. The same approximate direction is then expected to be given by similar differing words like ‘beneficial’ and ‘malevolent’. It are these properties of the GloVe-embedding that will be used extensively later on. No further detailed explanation of the precise construction will be given to avoid too large of a tangent but note that this is in itself thus a method following the unsupervised paradigm, where the structure that is desired to be found is given by a vector repre-sentation conserving semantics as a linear substructure. There are multiple other and more recent embeddings available, like BERT[8],however these do not exhibit the same semantic substructure. Instead, they allow for more complex dependencies to be mirrored in their embedding of, for example, a complete sentence. GloVe on the other hand is a pure word-based embedding and does not implement any grammatical dependencies in its representation of a given input. Constructing a linguistic embedding is just one of the important aspects of natural language processing.

When a linguistic embedding is given and samples can be converted to real-valued vectors, this allows the fur-ther application of machine learning methods to solve various supervised or unsupervised tasks. One of which is sentiment analysis where a set of words, sentences or other corpora of text are given together with a label in-dicating the degree of positive or negative sentiment a certain sample has. This is mostly either 1 for positive sentiment or 0 for negative sentiment, but this can be a range of values as well. The task for any method that will be applied is then to be able to predict the sentiment of unseen bodies of text. Of course, numerous other applications, e.g. speech recognition[12],exist as well.

1.1.4 Neural networks

The explanation methods that will be discussed later are primarily designed for neural networks following the supervised paradigm. This means a dataset of samples X = {x1, . . . , xn}and corresponding labels Y =

{y₁, . . . , yn}is given, where xi ∈ Rmand yi ∈ R. As mentioned before the evolution of neural networks

started with the introduction of the perceptron[31]of which a symbolic representation can be seen in Figure1.2. The perceptron itself can also be mathematically represented as a triplet (w, b, ϕ) where w ∈ Rm_{, b ∈ R are}

called the weights and bias respectively and ϕ : R → R is the activation function. The output of the perceptron of the input x = (x1, . . . , xm)is then simply calculated as

ϕ m X i=0 wi· xi+ b ! .

From this formula it becomes clear why the wiare called the weights of the perceptron, as they effectively assign

an importance to every incoming component xi. In the case of the original perceptron the activation function

was the Heaviside function H(x), turning it into a binary classifier. This shows that the activation functions have to be chosen in function of the problem the perceptron is trying to solve, as the Heaviside function would for example not fit the criterion of a regression problem particularly well.

(11)

ϕ Activation function P w2 x2 ... ... wn xn w1 x1 w0 1 inputs weights b

Figure 1.2: Symbolic representation of the perceptron. Hidden layers

Input

layer Outputlayer

Figure 1.3: Symbolic representation of a general neural network.

A general neural network Φ : Rm1 _{→ R}ml+1 is constructed intuitively by stacking perceptrons both

verti-cally and horizontally, as can be seen in the symbolic representation on Figure1.3. Concretely l ∈ N hidden

layers Φi : Rmi → Rmi+1 are constructed by handing the input x ∈ Rmi over to mi+1 ∈ N perceptrons

(wi_j, bi_j, ϕi). The output of such a layer is then a vector in Rmi+1with components the outputs of the different

perceptrons. One can write this output briefly by first multiplying the input x with the matrix Wi_{and adding}

a vector bi_{followed by an element-wise function ϕ}i_{, i.e. ϕ}i_(Wi_{· x + b}i₎_{. Here W}i_{and b}i_{are constructed by}

assembling the seperate mi+1weights wijand biases bij. To obtain the complete network, the output of layer i

becomes the input of layer i + 1, i.e.

Φ(x) = Φl(Φl−1(. . . Φ1(x) . . . )).

Regarding terminology, Φ is called a neural network of depth l and the layer Φiis said to have width mi+1. The

components of the ml+1-dimensional output of the lthlayer of Φ are also frequently referred to as the logits of

Φ. Supervised learning using neural networks of a depth larger than 1 is frequently referred to as deep learning. Most oftenly the different elements of a given layer are no longer called perceptrons but neurons, hence the name neural network.

Optimisation of the parameters wi

jk and bijto fit the problem at hand is most frequently done via gradient

descenton a certain loss function L(ˆy, y), where ˆy and y are the output of the neural network and the desired

output respectively. This loss quantifies the error that is made, so values of wi

(12)

desired. Gradient descent or any variation thereof requires the partial derivatives ∂L ∂wi jk and ∂L ∂bi j to compute the

new values of those parameters by moving in the negative direction of those partial derivatives, i.e. wi_jk0 = wi_jk− η · ∂L ∂wi jk and bi_j0 = bi_j− η · ∂L ∂bi j . (1.2)

The difficulty in this procedure is acquiring the partial derivatives since the loss function L is itself a function of the output of the neural network and so the derivatives of the network with respect to its parameters will present themselves somewhere down the line. Luckily this can be solved by cleverly applying the chain rule, yielding a recursive relation that propagates the error from the output all the way back through the network. Because of this reverse behaviour, the application of gradient descent on the optimisation of neural networks is also often referred to as backpropation.

Note that the loss function described here only depends on one output ˆy originating from one input x. In practice, a sum of such losses will be taken over a subsample of the available training samples to speed up the learning process. This does not change the validity of any of the above as the derivative is a linear operator. Another point of interest is the up until now unmentioned η in formulas1.2or the learning rate. If η is large, the parameter change will be large as well. This might speed up the convergence of gradient descent with the caveat that the minimum might be missed. On the other hand if η is small, the convergence speed will be a lot slower but there is a better guarantee that the minimum will be reached. To better close in on the minimum and still speed up the training, an adaptive learning rate is used, starting with a larger value and reducing it as training proceeds.

1.1.5 Convolutional and Recurrent neural networks

While normal neural networks as introduced earlier already perform very well on a multitude of tasks, there are numerous extensions of the standard paradigm tailored towards more specific problems. Two of the most frequently occurring ones are the convolutional[10]and recurrent[32]neural networks or CNNs and RNNs respectively.

Convolutionalnetworks introduce so-called convolutional operations into the normal neural network

frame-work. Convolutions were first proposed in applications within the field of image recognition where the input is often (N ×M)- or (N ×M ×C)-dimensional, meaning every sample can be represented in a meaningful way as a 2- or 3-dimensional tensor. For example, a grayscale image is (N × M)-dimensional where a colour image is (N × M × 3)-dimensional due to the presence of multiple colour channels. A convolution can in these cases intuitively be seen as a certain region of a given image on which certain weights are applied or, in other words, it acts as a filter. In general, a convolution takes part of any given input which is then used as input for a regular neuron with its own parameters and activation function. Figure1.4shows a 2N-dimensional example of this, which will be used to further explain the different parameters of a convolutional operation.

First of all, the dimensions of the region that is selected by this convolution are called the kernel. In practice, multiple different convolutions with different weights move across the image with certain steps called the stride of the convolution. Figure1.4thus has a kernel (3×3) and a stride of (1×1), i.e. the (3×3) kernel first moves horizontally with step size 1 until it reaches the end of the image and then starts the same horizontal movement, but starting one step lower vertically and so on. In the precise case of Figure1.4, the weights wiwould be given

by   w1 w2 w3 w4 w5 w6 w7 w8 w9  =   1 0 1 0 1 0 1 0 1  ,

(13)

0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 Input I × w1 w2 w3 w4 w5 w6 w7 w8 w9 Convolution C = 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 I × C w1 w2 w3 w4 w5 w6 w7 w8 w9 ×w1×w2×w3 ×w4×w5 ×w6 ×w7×w8×w9

Figure 1.4: Example of a 2N-dimensional convolution operation.

but it remains important to note that every wiis a parameter that is determined during the training phase of

the network at hand. Apart from a given kernel and stride, a convolutional layer should be given an amount of convolutions or filters to apply to some input and an activation function.

The intuition behind this is that every convolution has a certain receptive field, corresponding to the kernel dimensions. If multiple convolutional layers are put one after another, the effective receptive field grows in size as smaller fields are again combined by consequent convolutions. One component of the output of such a collection of convolutional layers contains the combined information of a certain amount of original features. Having that output as the input of a normal layer would mean the normal neurons now do not have the orig-inal features at their disposition but a combination of those from different regions of the input. This in turn introduces the benefit of obtaining a representation of the original data that is more invariant to shifts in that data. In the regular case, if the input of a normal neural network is an image of a certain object, having a slightly translated image of that same object would change the input dramatically. Since the representation of that im-age formed by a couple of convolutional layers is dependent on combinations of features, that will not change as much if a small translation of the same image is handed to those layers.

Apart from convolutions, pooling operations were introduced to deal with the often high dimensional output of a convolutional layer, as one might expect this to be substantial when utilising a wide range of filters. These are similar to a convolution in the sense that they also assemble the information of a certain region. The dif-ference being that pooling operations only use fixed weights and functions instead of free ones that are to be optimised during network training. Examples are max-pooling or average-pooling operations where the weights are uniformly 1 and the activation functions are the maximum or average operator respectively. In practice these operations thus mostly follow the application of a convolutional layer to reduce the dimensionality of its out-put.

Recurrentneural networks introduce a completely different kind of operation and are applicable in situations

where the input generally has a time-dependent component, e.g. a daily series of weather data from various weather stations across a country. While a normal network only allows connections between consequent layers, recurrent networks permit the connection of neurons in previous layers. When a time-dependent or sequential input is handed to such a network, the return connections allow for the sharing of information from previous states. This is frequently used in the context of natural language processing as well where the input can be a phrase or body of text. In order to obtain more context of a given word in a sentence, information of previously seen words is of the utmost importance. Figure1.5shows a symbolic representation of a recurrent layer. In effect, it consists of t repetitions of the same cell A where every subsequent ith_{cell has as input the activation}

hi−1of the previous (i − 1)thcell and the ithelement of an input sequence x. Here t stands for the length of

(14)

Figure 1.5: Symbolic representation of an unrolled recurrent layer.[28]

Apart from allowing just these return connections, there is an array of special configurations of neurons with different activation functions trying to model some form of memory that keeps track of which previously observed states should really be preserved and which shouldn’t. One of the most prominent of these is the Long-Short Term Memory or LSTM[15,28]configuration, which will be used in the recurrent network architectures of this work. Such a layer is built up from a number of LSTM cells, similar to how a regular layer is built up from individual neurons. Every such cell can, in itself, be seen as a small neural network as it consists of several connected layers of neurons. Figure1.6illustrates how such a cell fits in the recurrent framework. Note that apart from the cell activation ht, another arrow points to the next LSTM cell. This extra information is called

the internal cell state Ctof the tthLSTM cell.

Figure 1.6: Illustration of how an LSTM cell fits into the recurrent framework.[28]

The inner workings of a normal LSTM cell are governed by a set of 4 equations, corresponding to the 4 neural layers present in a single cell. These layers are represented in Figure1.6by the yellow rectangles. Figure1.7gives the first of these equations, often referred to as the ‘forget’ equation. This weighs and decides which part of the previous state Ct−1should be forgotten, depending on the new input [ht−1, xt], by multiplying Ct−1with ft.

An output of ft = 0would indicate throwing away all of the information in Ct−1while a value of ft = 1

infers keeping the entire previous state. The notation [ht−1, xt]denotes the concatenation of the activation

(15)

Figure 1.7: The ‘forget’ equation and corresponding neuron in a LSTM cell.[28]

Next, it has to be decided which information to add to the internal state. This step consists of 2 parts. One layer with tanh activation function provides a new candidate value ˜Ctand another layer with sigmoidal

acti-vation function, called the ‘input gate’, determines how much of this potential state, it· ˜Ct, is passed on and

added to the next internal state. This step is illustrated in Figure1.8.

Figure 1.8: Determining potential information to add to the cell state.[28]

Combining the above, the new cell state can be determined by the equation seen in Figure1.9. It is precisely this value that will be carried over to the next cell and will be used to compute the ultimate activation of the current cell.

(16)

Figure 1.9: Updating to the new state Ctbased on previous calculations.[28]

Finally, the output of the current cell will be based on the internal state that was produced in the previous steps. Ctis put through a tanh function to normalise its values to be between -1 and 1, after which a final

sig-moidal filter is applied to remove those parts of the output that are not deemed interesting. So apart from the raw internal state Ct, a filtered version htis carried over to the next LSTM cell. Figure1.10visualises this last

step.

Figure 1.10: Calculating the final output htof an intermediate LSTM cell.[28]

As mentioned before, every yellow rectangle represents a complete neural layer of certain dimensionality or width. The width of all the internal layers of a cell thus have to be equal and this then also corresponds to what will be called the width of the entire recurrent layer. All intermediate values like Ctor htwill be vectors and all

functions are just point-wise functions. As a final remark it has to be mentioned that this addition to the deep learning framework is compatible with the earlier discussed optimisation methods. In other words, all of the free parameters like Wf or bCcan be optimised using backpropagation.

1.1.6 Motivation

All things considered, neural networks are just a class of real-valued functions. Apart from the fact that a practi-cal and generally well-performing approach for their optimisation is readily available, one could ask the question why exactly this class gained so much interest.

(17)

The specific construction of a neural network lends itself very well for proving results in an approximation theoretic context. The most well-known result is the Universal Approximation Theorem[16].

Theorem. (Universal Approximation Theorem) Every continuous function f deﬁned on a compact subset

K ⊂ Rn_{can be uniformly approximated arbitrarily well by a neural network of depth one with arbitrary,}

non-linear activation function ϕ, i.e. (∀f ∈ C(K, Rn_{))(∃ε > 0)(∃N ∈ N) such that}

sup x∈K |F (x) − f (x)| < ε, with F deﬁned as F (x) = N X i=1

viϕ(wix + bi), for certain vi, wiand bi.

Equivalently it can be said that the class of functions of the same form as F is dense in C(K, Rn).

This only means that for any continuous function defined on a compact subset, there exists a combination of parameters of a neural network of depth one that approximates that function arbitrarily well and not that the above mentioned optimisation algorithm actually finds those exact parameters. While it is practically im-possible to guarantee such parameters can always be found, numerous practical analyses indicate that the used optimisation methods do in fact work well.

Note that it is mentioned that a network with just one hidden layer already has this strong approximation prop-erty, so it might seem, at first, that it does not make any sense to even consider networks with more than 1 hidden layer. However, results for conventional[6]and recurrent[19]neural networks have shown that the use of more than 1 hidden layer can roughly reduce the amount of necessary neurons, and hence memory, to approximate a function arbitrarily well by an exponential amount. Other results[5]were able to derive a fundamental lower bound on the required memory to store the weights of a deep neural networks, whilst still guaranteeing the uni-form approximation. It is precisely the combination of practical applicability backed by the aptitude of neural networks to perform well in more theoretical contexts that drove their success over the last few decades.

1.2 Interpretability

Interpretable machine learning or IML is the subfield of machine learning that focuses on the generation of ex-planations of or interactions with a certain machine learning process. Furthermore, the produced exex-planations and interactions should be humanly understandable. The requirement to be inherently humanly interpretable leads to numerous issues when trying to formally define any notions within IML. Due to the increased interest in this field over the last couple of years, multiple efforts[21,40]were undertaken in developing a more struc-tured taxonomy.

1.2.1 Desiderata

Problems already arise with the use of the word interpretation or interaction, as these are very subjective and difficult to formalise concepts. One possible perspective[21]is to classify an interpretation by the goal it desires to achieve. The most pressing desideratum of any explanation would be to increase the trust put into a machine learning model by humans as this is the most common cause of tensions in any practical application. Connected to this is the important question of ethical and unbiased decisions. In many more socially integrated applica-tions it is absolutely necessary to guarantee that the model does not base its decisions on biased information that can be present in its training dataset or to at least be able to identify such information. A completely differ-ent perspective would be that of informativeness, i.e. a machine learning model is used to provide information about its specific application. As the raw computational power of modern computing devices increases dramat-ically over time[24,35],applying machine learning algorithms to problems with unknown mechanics could

(18)

in turn contribute to a more rapid increase in knowledge. Indeed, those algorithms extract their knowledge from enormous datasets and given they perform well, a closer analysis of why they do could yield interesting new directions of research previously unbeknownst to humans. Other feasible desiderata can be given as mo-tivation for more forms of interpretations but this work will only consider the generation of explanations that contribute to either the increase of trust or informativeness of a model.

In general it has to be mentioned that the need for explanations originates from some form of incompleteness of the model itself, its task or its application formalisation. Indeed, not all problems require a form of interpretabil-ity, examples of such systems being climate control or postal sorting.[9]argues that the reason for this is due to either no impactful consequence being linked to a faulty output or the fact that the model and its applications are sufficiently verified such that a lack of trust is no longer an issue. A case where incompleteness is present and does lead to a necessity of interpretation can be found in the human quest for knowledge itself. Gaining knowledge is one of the many goals possibly set by a human, however there is no complete way to state what knowledge abstractly is. This results in having to be satisfied with potential explanations of whatever the knowl-edge in question is, which are then, in turn, converted to the knowlknowl-edge that was sought in the first place. The gap of incompleteness thus gets filled with a potential interpretation, given by an agent, algorithm or any other explanation generator, in a comprehensible form such that it is deemed adequate with respect to the inquirer.

1.2.2 Context of explanations

As the complete process of a machine learning algorithm consists of multiple key parts and each of these can be a possible subject to explain, it is vital for any interpretability method to state the exact aspect of such algorithm that it tries to explain. A fairly recent study[40]has taken it upon itself to analyse publications linked to IML over the last couple of years in order to propose a practical division of a machine learning process together with a description of what an interpretation or interaction in that context might mean. Figure1.11shows a symbolic representation of the proposed division. This work will focus on the interpretability of the predictions P of a given trained model M on a dataset D. In other words, methods that output an explanation as to why predic-tions have the values they have will be the primary subject. While every single one of the remaining components is equally interesting to study, having properly functioning methods to explain the predictions of a model is the most direct way to increase the general trust put into that model since it are the predictions that are most frequently served to humans of various backgrounds in a multitude of applications. Apart from that, explain-ing decisions might also be the most natural way to improve the trust or informativeness since providexplain-ing a solid argumentation of a given decision, in any form, is arguably the most common way humans would explain their actions to one another.

1.2.3 Local versus global

Another important distinction when discussing interpretability is the difference between local and global ex-planations. The construction of a local explanation or interaction depends on a single dataset sample. A typical example of this would be the so-called saliency maps often used in image recognition that assign scores to the individual pixels of an image showing which pixels had the most influence on the given model prediction. There are numerous methods to assess the importance of input features or combinations thereof and explicit examples of these will be covered in the next chapter. Figure1.12depicts the output of such a saliency method.

Globalexplanations or interactions draw their conclusions by gathering information across multiple samples.

While such methods avoid potential bias, they are more difficult to construct as there is the need for a certain operation that gathers information from these samples. Apart from that it is commonplace in practice for ques-tions to be asked about single predicques-tions, e.g. a bank costumer questioning why he personally did not get his loan approved. Global explanations might in that case not be able to answer that question as specific informa-tion could be lost during this gathering process. There is thus always a trade-off between discovering the more general patterns given by global methods and the specificity of local methods.

(19)

Figure 1.11: Schematic distinction of different machine learning aspects.[40]

Figure 1.12: Example of a saliency map which visually illustrates pixel importance. Left: The original image. Middle: Evidence on the image towards recognising a dog. Right: Evidence on the image towards recognition of a cat.[36]

(20)

1.2.4 Evaluation

Last of all there is the still hotly debated matter of explanation evaluation. Again, due to the inherent demand of human comprehensibility, the question of how to evaluate the interpretations and interactions acquired from any method in an objective manner is very problematic.[9]gives, among other things, a proposition of taxonomy of interpretability evaluation similar to what[40]does for the different explanations themselves. On the one hand, an obvious possibility would be to let real humans assess the performance of explanations. An important distinction can be made in this case between the use of domain experts or more lay individuals. When asking experts to appraise the interpretability, one can speak of application-grounded evaluation. Here the experts can perform a thorough analysis employing their very specific domain knowledge. A possible disadvantage might be that, if the application domain is itself incomplete, new insights can be wrongly assessed as being faulty. Another problem is that expert appraisal does involve a lot of time and effort, which is further amplified by the smaller population from which experts can be drawn.

When considering human evaluators, which will be referred to as human-grounded evaluation, that are not necessarily experts in the domain of the applications, it has the advantage of having a larger population to draw from. However, the specificity of their assessments will not be as profound as would be the case when only considering domain experts. Moreover, if the application itself is complex in nature, a simplified version of the real task or its components will have to be given for intelligibility’s sake.

The use of humans can in itself however lead to problems of objectivity. While the above discussed methods definitely perform well in their appraisal of how humanly understandable an explanation is, they can most surely not be guaranteed to be completely objective. For that, a third category of functionally-grounded evaluations, where a formal expression or proxy of explanation quality is used, arises. Avoiding the use of human individuals, the main benefits are a reduction in time and effort required to perform the evaluation and the shift to a more objective metric. Where these metrics fail is the comprehensibility part. Unless the adopted definition or proxy of interpretability is tested on how understandable it actually is for humans, one can not assume anything in this department.

In the following chapter, a functionally-grounded method of evaluation performance that has shown to have some characteristics of being humanly understandable will be used. It is one of the primary goals to keep all discussions as objective as possible, hence only functional methods will be used.

(21)

2

Explanation generation

As mentioned before there has been an ever growing need for humanly interpretable explanations of decisions made by computer algorithms, in particular those applications of machine learning. Especially the case of deep learning has started an expanding inquiry for the exposure of knowledge on which the decisions of those appli-cations are based.

In order to satisfy that inquiry, two main problems arise. First of all there is the need for a global analysis of knowledge of the application at hand. There are already numerous methods of discovering regions of interest of a single given input, e.g. saliency methods, but for a more general image of the explanation problem there is a definite need for methods that generate explanations that pertain to the more global level.

Secondly, there is the need for human interpretability of any generated explanation. When thinking about what an explanation of a certain decision should contain, the notion of concepts immediately becomes necessary. In-deed, when asking someone why they think there is a car in front of them, the human explanation could be the presence of wheels, headlights, asphalt, etc. These are the high-level concepts present in our visual field that are given a certain degree of importance towards deciding what object is actually in front of us. In a similar way there is thus the need for any explanation generating method to be at least capable of reasoning on a similar level of abstraction.

In what follows, the special case of deep learning applications will be at the forefront of the discussion as this provides one of the most challenging problems in explanation generation to date. In light of this restriction, a recall of notation might prove beneficial.

Let Φ : Rn_{→ R}m_{be an already trained neural network with l hidden layers Φ}

iof widths m2, . . . , ml+1= m

with weights wk

ji, bias bki and activation functions ϕi. An implicit assumption made throughout is that the task

for which Φ is trained is a classification problem with m possible classes. This assumption is purely due to the focus of this work being mainly on classification, not because it is incompatible with a regression setting.

2.1 Local methods

Before continuing to look at possible solutions of the above specified problems, lets first give a proper example of a recently proposed local interpretation method, DeepLIFT[38].DeepLIFT is an example of a method based on the paradigm of Layer-Wise Relevance Propagation or LRP[3]. Before diving into the inner workings of DeepLIFT, a brief overview of LRP will be given.

2.1.1 Layer-Wise Relevance Propagation

Given an input x = (x1, . . . , xn)of Φ with output Φ(x) = y = (y1, . . . , ym)the goal will be to quantify

the contribution of component xitowards a certain prediction yj, i.e. assigning relevance scores Rito every xi

such that yj = n X i=1 Ri.

Then Ri > 0would indicate feature xito have a positive influence towards classifying x as class j while Ri< 0

would imply the contrary. Of most interest would however be the differential contributions ∆Riwith respect

to some input x0such that Φ(x0)j = 0, corresponding to a state of maximal uncertainty with respect to the

(22)

of maximal uncertainty would be Φ(x0) = _m1, where m represents the number of classes. This is easily reduced

to the previous situation by consistently subtracting 1

m. In practice it is often difficult to find such a root point

and DeepLIFT will give an example where such a specific point is not necessary. However, in order to keep the intuitive and qualitative description of positive scores relating to positive influence, it will be assumed that such a point x0exists and is readily available.

As the name suggests, LRP will make use of the layer-wise construction of neural networks to propagate rele-vance scores throughout the network, not very dissimilar from how backpropagation works in the optimisation phase. To that end, suppose relevance scores Rk+1

j are given for every jthneuron in layer k+1 of Φ with respect

to the input x. A certain set of rules will then be given to recursively obtain the scores Rk

i for every ithneuron

in layer k while respecting the set of equations Φ(x) = · · · = mk+1 X i=1 Rk+1_i = mk X i=1 Rk_i = · · · = n X i=1 Ri,

also known as the preservation of relevance. The values Ricorrespond to those scores obtained by applying the

recursive rules on the relevances R1

j of the first hidden layer. Defining the explicit recursive rules to propagate

the relevances is left to more specific implementation of LRP, like DeepLIFT. In general however, such rule is assumed to be of the form

Rk_i = X

j:neuron i is input for neuron j

Rk,k+1_i←j , where Rk,k+1

i←j abstractly represents the quantity of relevance flowing from neuron j in layer k + 1 to neuron i

in layer k. An example of this could be

Rk,k+1_i←j = Rk+1_j · Φk(x)i· w k ij Pm_k h=1Φk(x)h· whjk .

Of course one has to be cautious about situations where the denominator is 0, but this mainly serves as a simple example of how such rule could look in practice.

The LRP framework actually applies to every real-valued classifier f : Rn _{→ R}m_{whose output depends on}

consecutive stages of computation. An example fitting this description that is not a neural network would be a Bag-of-Words or BoW model[44].Indeed, such a model can be separated into 3 distinct stages with the first be-ing the computation of local features after which the second stage then performs some unsupervised algorithm to obtain representatives of these features. In the third stage, statistics of the local features are computed with respect to their representatives. Finally a classifier using these features is applied and, depending on the sort of classifier, this will be compatible with a similar LRP framework. A more detailed explanation of how this would formally extend is also given in[3].

2.1.2 DeepLIFT

Multiple explicit LRP methods[26,36]make use of partial derivatives to locally measure the influence of neu-ron activations with respect to their input neuneu-rons. Relevance propagation rules from neuneu-ron i to neuneu-ron j in layers k and k +1 respectively are then roughly constructed by weighing the relevances Rk+1

j by these

measure-ments. DeepLIFT[38]on the other hand avoids using derivatives altogether, albeit by using a finite version of what a derivative conceptually stands for. While gradients are a completely valid candidate in this context, they can sometimes lose information in the region of their domain where they are zero. Such regions can still carry useful information, which is otherwise lost if the derivative returns 0.

(23)

differ-ences from a reference input x0. This reference can be seen as a state of maximal uncertainty, such that all the

observed differences and hence relevances can be intuitively interpreted as the positive or negative influence that is attributed to features or neurons with respect to that which is considered to be absolutely uncertain. Given such a reference input x0, define ∆x as x − x0. A multiplier miis then defined to be of the form

mi =

Ri

∆xi

, (2.1)

with Ri a relevance score. Again, since everything will be measured with respect to the reference x0, mi is

actually a finite version of the derivative of the score Riif this would represent a function. Indeed, both are

obtained by calculating the difference in function values divided by the input difference where in the one case those differences are infinitesimal and finite in the other.

To simplify notation, the discussion will be restricted to a single neuron (w, b, ϕ) with input x ∈ Rn_and

output y = ϕ(w · x + b). This restriction does not hamper the generalisation of DeepLIFT in any way as the flow of relevance throughout the network moves between neurons anyway. The goal will be to give an explicit formula to construct a multiplier mi, such that the relevance Rxi can be derived by inverting2.1to

Rxi = mi·∆xi. The output y and corresponding relevance R will be separated into their positive and negative

parts to obtain

∆y = ∆y++ ∆y−and R = R++ R−, with

∆y = ϕ(w · x + b) − ϕ(w · x0+ b).

The output y of a neuron (w, b, ϕ) can itself be decomposed into a linear part, where multiplication and ad-dition happens on the input x, and a non-linear part by applying a non-linear function ϕ. DeepLIFT provides 2 different propagation rules to facilitate both the linear and non-linear case.

Linear rule Let

z = w · x + b represent the linear part of a neuron operation, so

∆z = w · ∆x. This gives positive and negative parts

∆z+= n X i=1 1{wi· ∆xi > 0}wi· ∆xi ∆z−= n X i=1 1{wi· ∆xi < 0}wi· ∆xi

that in turn lead to a choice of relevances Rxiof xiand multipliers mziattributed to the intermediate result z

with positive and negative parts being

R+_x_i =1{wi· ∆xi > 0}wi· ∆xi and m+zi=1{wi· ∆xi > 0}wi,

(24)

In other words, if the multiplier mzof z is given by previous propagations, then the relevance Rxiof the input xiis computed as Rxi = R + xi+ R − xi = ∆xi· m + zi+ ∆xi· m−zi. Rescale rule

In this case, let y = ϕ(z) with ϕ a non-linear, real-valued function. Since ϕ only takes a single input, a straight-forward choice of relevance Rzof z is

∆y = ϕ(z) − ϕ(z0), with z0= w · x0+ b,

which trivially leads to the multiplier my = ∆y_∆z. If the positive and negative parts of ∆y are defined to be

proportional to the positive and negative parts of the input ∆z ∆y+= ∆y

∆x · ∆z

+_{and ∆y}−₌ ∆y

∆x· ∆z

−_,

then given the multiplier myof y, the relevance Rzof z is again easily found as

Rz = m · ∆z.

Recursion and relevance

Finally, in the spirit of keeping multipliers similar to real derivatives, it is assumed multipliers adhere to a chain rule as well. This can be illustrated with the above obtained multipliers mzand my, corresponding to a finite

version of the derivative of z with respect to x and y with respect to z respectively. The chain rule would then state that the multiplier m from y with respect to x is given by

mi = my· mzi.

With this relation it is possible to propagate the multipliers backwards through the network in exactly the same fashion as was done for the backpropagation algorithm. The initial multiplier of the output yi, which is

sup-posed to be redistributed layer by layer, can be initialised by choosing myi = 1, which corresponds to an initial

relevance of Ryi = ∆yi. Recursively using both of the above explicit propagation rules within the LRP

frame-work will yield the desired result of relevance scores Rifor every component xiof the input x.

2.2 Testing with Concept Activation Vectors (TCAV)

In the quest to solve the problems described earlier, a possible partial solution is given by[20].It allows testing if a certain given concept is of any interest to the application at hand by vectorising the concept in some real vector space and calculating the directional derivatives with respect to this vectorised concept to measure its importance.

Allow the proposed procedure to be clarified further with a practical example. Consider the case of object recognition where one of the classes is that of a zebra. A trained and at least decently performing convolutional neural network Φ : Rn _{→ R}m_{is given, where n is the dimension of the input space and m the amount of}

possible object classes.

When subjectively asking what concept would influence the decision to classify an object as a zebra, that of ‘striped’ could come to mind and say there is a certain set of example samples of ‘striped’ readily available. Figure

2.1gives a small example of such a set. It is generally accepted that the abstraction level of CNNs grows with increased depth of the network[17].Since ‘striped’ is a rather lower-level concept, let Φl: Rml → Rml+1be a

(25)

(a) (b) (c)

Figure 2.1: The left and middle images are visual examples of the striped concept of various origins. Both can be abstractly or directly linked to the right example of the class zebra.

relatively early hidden layer of Φ. To not make the notation any more convoluted, let the notation Φl(x)with

x ∈ Rnbe shorthand for Φl(Φl−1(. . . Φ1(x) . . . )). The vectorisation process now goes as follows.

Let PC be the set of example samples of the concept C =‘striped’ and PN a so-called negative set consisting

of, for example, random input samples. Consider the activation sets

Φl(PC) = {Φl(x) | x ∈ PC}and Φl(PN) = {Φl(x) | x ∈ N }

Both of these are subsets of Rml+1and thus a binary linear classifier, e.g. an SVM, can be trained to distinguish

between them. It is this classifier, represented by the vector vl

C ∈ Rml+1perpendicular to the decision

bound-ary, that will be seen as the vectorisation of the given concept. Any such vectorisation will be called a Concept Activation Vector of the concept C, or CAV for short. The reason for this choice is that such a vector points in the direction of what the linear classifier considers to be the part of Rml+1relating to the concept C in the form

of the activation set PC. This also clarifies the use of directional derivatives later on.

Next, let Φl,z : Rml+1 → R be the map giving the logit value of the class of zebra with its domain being the

range of Φl. As described above, the directional derivative of Φl,zwith respect to vlCis the quantity of interest,

quantifying the impact of a change towards the direction of C. Concretely Sl,z,C = lim ε→0 Φl,z(Φl(x) + εvl_C) − Φl,z(Φl(x)) ε = ∇Φl,z(Φl(x)) · v l C

can be calculated for any input sample x ∈ Rn_.

Finally, one can now measure the sensitivity of a class to a certain concept C over multiple samples, i.e. a more

globalconclusion can be acquired. Indeed, let Xz be the set of all samples with label ‘zebra’ and define the

TCAV-score of the concept C as

T CAVQl,z,C =

|{x ∈ Xz| Sl,z,C(x) > 0}|

|Xz|

.

This TCAV-score quantitatively represents the influence of concept C for the class of ‘zebra’. If close to 0.5, the concept is probably of no real importance for the networks decision of classifying an object as ‘zebra’ while if significantly higher or lower, it is.

(26)

of negative examples. In the worst case, a completely useless CAV is learned and all its derivatives rendered worthless. To guard against such volatile scenarios, multiple CAVs can be trained of the same concept C using different negative sets PN,i. If the CAVs representing the concept C are consistent and useful, they should

ideally have TCAV-scores that do not vary as much and are different from what a random vector would give, which is a score of 0.5. More precisely a two-sided t-test of the acquired scores can be performed, deciding if it is significantly different from 0.5. If so, it is said that the concept C influences the prediction of class z significantly.

2.3 Automatic Concept-based Explanations (ACE)

While TCAV is a powerful tool that allows the analysis of the influence of certain given concepts towards the decision process of a deep learning application, it does not however discover those concepts of interest auto-matically. Manually looking for concepts and gathering example images of those is very time consuming and considering every concept of possible interest may very well be an insurmountable task. Automation of this process is thus definitely desirable. A way to fill in this gap is proposed in[11].An overview of the general pro-cedure whilst applying to the same application of object recognition follows. A pseudo-algorithmic version is given by Algorithm1.

Algorithm 1The ACE Procedure

1: procedureACE(Φ, l, Sz, PN, R) 2: A ← ∅ 3: for all x ∈ Szdo 4: for all r ∈ R do 5: Lx,r← Segment(x, r) 6: for all xj ∈ Lx,rdo 7: xj ← Scale(xj) 8: aj ← Φl(xj) 9: A.extend(aj) 10: end for 11: end for 12: end for 13: C ← Cluster(A) 14: result ← ∅ 15: for all c ∈ C do 16: scorec← T CAV (Φ, l, Sz, PN, c) 17: result.extend((c, scorec)) 18: end for 19: return result 20: end procedure

As with TCAV the input consists of a trained classifier Φ : Rn _{→ R}m_{with the index l of a hidden layer of}

interest, a subset Sz ⊆ Xz of all samples with a certain label z and a family PN of sets of negative samples

PN,iused to define the CAVs. Apart from that, a list of scales R also has to be given, whose necessity will be

explained shortly. The ideal output of ACE would be a list of interpretable concepts with their corresponding TCAV-scores with respect to the class with label z. In other words, when considering the object recognition setting, the target class of interest z would be that of a ‘zebra’ and the desired output should ideally contain some sort of representation of the ‘striped’ concept.

(27)

Figure 2.2: Segmentations of a zebra image at increasing scales from left to right.

In the first step, described in lines 2-3 in Algorithm1, multi-resolution segmentations Lx,rbased on the

pro-vided scales r ∈ R of every sample x ∈ Szare constructed since the concepts that are important for decision

making are obviously present in some form in the samples themselves. The precise form of the scales r ∈ R de-pends on the utilised segmentation method, but they are mostly natural numbers giving the amount of elements that is expected in one segment. Hence by segmenting the samples on different scales, discovery of concepts of various levels of abstraction is made possible. In the case of the example application, segmentations of the in-put images into a number of groups of pixels or super-pixels of varying size are carried out. The super-pixels of smaller size mostly contain texture information, while larger super-pixels can contain complete parts of objects. The ‘striped’ concept could then for example be discovered by looking at the smaller super-pixels. A visual ex-ample of such a segmentation can be seen in Figure2.2.

Given a certain amount j of segments xj ∈ Lx,rof the samples x ∈ Sz, the next goal is to group similar

segments coming from different samples into clusters which will then correspond to concepts. Since clustering in the original input space is mostly far from ideal due to the general lack of a proper clustering metric, an inter-mediate metric space is required. In[43]it was discovered that the euclidean metric on the activation space of an intermediate layer of a real-valued and well-trained convolutional neural network is a satisfactory candidate. Hence all segments are scaled up to the dimension of the input and passed through the given network up until the intermediate layer of interest Φl : Rml→ Rml+1. This is seen in lines 6-9 of Algorithm1. Let the

interme-diate activations of the scaled segments be aj = Φl(xj) ∈ Rml+1. Now cluster all of the activations ajin the

latent space Rml+1, as is done in line 13. Depending on the utilised clustering method, additional parameters

could be required by the ACE procedure. For example, if k-means clustering is used, the amount of clusters k to consider has to be given. To further improve the coherency of the concepts, segments with outlier activations are removed from every cluster. Moreover, every cluster is obliged to contain a certain minimal and maximal amount of elements. Discovered clusters not meeting these additional requirements will be removed from the process. Now the collection of segments belonging to each cluster are seen as examples of the concept repre-sented by that cluster. The examples of the ‘striped’ concept in Figure2.2could in this case be segments of an original image of a zebra that were grouped together in this way based on their similar intermediate activations. It is for every one of these clusters and their examples that multiple CAVs for every negative set PN,iare

con-structed as before and the TCAV-scores are computed with respect to the same intermediate layer in which the clustering was carried out. The acquired scores are then a final measure, indicating if the concepts have any significant effect on the target class. This final step is done in lines 15-18. The set of segments representing a cluster or concept are then seen as interpretable examples of the concept itself, giving a conceptual and globally valid interpretation of the networks decision process. Figure2.3gives an illustration of how such an eventual output might look, where ACE was applied to the ‘GoogLeNet’[39]object recognition network.

(28)

Figure 2.3: The output of ACE applied to a set of example images of the zebra class, using the GoogLeNet classifier. One significant concept was discovered, which can indeed be linked to the striped concept.

(29)

3

Relevance-driven ACE (R-ACE)

While ACE on itself already seems to be reasonably well-performing in the case of image analysis, the way in which the concepts are being gathered is still a rather brute one. Multiple segmentations of every sample in a given set are considered and all of the segments present in those segmentations are being put through the net-work to compute intermediate activations and organise them into clusters. One might imagine that a lot of those segments could actually not contribute any useful information towards the decision process, convoluting the clustering and hence concept discovery process. The segmentation step is applied to every sample separately, or in other words, it is a local process. Instead of brutishly throwing a lot of possibly useless information to the network, it might prove interesting to first assess the relevance of the input features of each sample with a local interpretation method. Given such relevance scores a threshold could be used to identify those regions of the sample of sufficiently high importance, from which then the normal or a slightly altered ACE procedure could carry on. In other words, ACE would be converted into a method that automatically globalises and conceptu-alises interpretations that are already locally valid.

As R-ACE will mainly try to refine the segmentation step used in ACE, a more precise mention of the utilised segmentation method has to be given first. The following chapters will apply both ACE and R-ACE on a se-lection of problems residing in the domain of Natural Language Processing (NLP), meaning the original ACE segmentation methods applicable in image analysis are not usable here. Every input x, which will be a set of words xifor i ∈ {1, . . . , n}, will be segmented as follows. Given a scale r ∈ N, construct a list of segments by

letting segment j consist of the words x(j−1)·r+1, . . . , xj·rfor j ∈ {1, . . . , n_r}. The last segment n_r + 1

will then take the remaining words xbn

rc·r+1, . . . , xn.

The following discussion will provide a specific description of what is different to the normal ACE procedure in the context of this segmentation method, with possible outlooks towards more general applications like image analysis.

3.1 R-ACE using DeepLIFT

Let Φ : Rn_{→ R}m_{be a classifier with output components Φ(x)}

iof a certain input x. Instead of considering

every consequent word or group of words in the input as a possible segment, a relevance propagation method, as was introduced in chapter 2, will be utilised to calculate relevance scores Rifor every word xiin the sequence

x. Such a method always gives relevance scores with respect to one target class, just as TCAV uses the directional derivative with respect to a target class. As reference point x0, an input of all zeroes could be used as this would

represent an empty input sequence. Without any words to analyse at all, such an input would be considered a state of maximal uncertainty as neither a neural network or a human could assign any sentiment to it. A possible reference point for image analysis applications could similarly be a plain white image.

Given such scores Ri, the next step is determining which scores indicate words of true interest according to the

network Φ. Say the class i is the target class, then if ∆yi = Φ(x)i− Φ(x0)i > 0, this means the evidence for x

to be classified as class i is more substantial than that of x0. Furthermore the initial relevance, which is equal to

∆yi, is positive. Hence words xiwith positive relevance scores Rihave a positive influence towards classifying

xas class i according to Φ. On the other hand, if ∆yi = Φ(x)i− Φ(x0)i < 0, the evidence for Φ to classify

(30)

scores Riwould highlight those words that led to this lower degree of evidence. In other words, positive scores

Riwould still signal words of positive influence. The reason for this additional discussion is the important fact

that the chosen point of reference x0does generally not satisfy Φ(x0)i = _m1, as was assumed in the main

dis-cussion of an LRP method in chapter 2. x0defines, in a way, its own baseline of what is generally considered

maximally uncertain, contrary to choosing a point that would satisfy what Φ finds to be maximally uncertain. This has an effect on how to interpret the initial relevance, but the conclusion of positive scores indicating pos-itive influence is the same.

When this is taken into consideration and the right sign of scores is determined, a threshold can be introduced to select those words with a high enough relevance. In this precise application, the maximum was taken over all the relevance scores that had the right sign for positive influence with respect to class i and all the features with a score higher than half of this maximum were considered features of interest. As will be seen further down the line, this will reduce the amount of segments considerably and lead to a better ratio of concepts with high TCAV-score to those with a lower score. To repeat, instead of using every word appearing in the input as a segment, only those words with sufficiently high relevance are given the segment role in the further course of ACE.

Due to the simplicity of the utilised segmentation method, considering only single words as segments, it is straightforward to only select those words with relevance higher than the earlier mentioned threshold as seg-ments of interest. It is straightforward because the input features exactly coincide with the eleseg-ments that are assigned relevance scores. However, when a more complex segmentation method is desired, this becomes more of an issue as a method of relevance aggregation per segment is required to indicate the importance of this more abstract segment. Moreover, it even allows for a segmentation method to make use of the relevance scores as additional information. This supplementary knowledge could then allow for further tailoring of said segmen-tation procedure towards the problem at hand. For example, say the application is object recognition and the target class of interest is that of a car. Given a coordinated grayscale image I ⊆ R2_{of a car and the relevance}

scores Rij of every pixel pij on that image, those scores would already demarcate different regions of interest

and perhaps more importantly, remove those parts of the image that are not worth consideration with respect to the given classifier. Segments could then be chosen at dynamic scales by, for example, taking the cohesive groups of pixels that pass the relevance threshold r. Or, in other words, letting the connected components of the set

{pij| Rij ≥ r} ⊆ I ⊆ R2

correspond to segments would lead to segments of dynamic resolution. In the case of an image of a car, one such components could be a wheel or a window. This would be but one method, which is closely related to topological density-based clustering[42].In general any other unsupervised method could be utilised, possibly considering both the additional information of relevance scores and other domain-specific knowledge of the problem. However again, clustering and segmenting samples of text is very different to that of image data, as there does not, for example, needs to be a direct correlation between inter-word distance and semantic cohesion. Thus while still possible, the focus will be put on verifying the validity of R-ACE utilising a simple segmentation method.

3.2 Extensions of DeepLIFT

Both the rescale and linear rule were initially only explained in the context of regular dense and convolutional layers, as these consist of the linear and non-linear part for which the rules were constructed. As mentioned in the introductory chapter1, there are multiple extensions of the deep learning framework that go beyond these constructs. If DeepLIFT, or any other LRP method, is to be applied to these extensions, further clarification will be required. This will be provided in the following sections for recurrent and pooling layers.