Interpreting Language Models

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Interpreting Language Models

by

J

AAP

J

UMELET 11842539

February 14, 2020

48 ECTS April 2019 – February 2020 Supervisors: D. Hupkes, MSc Dr. W. H. Zuidema Assessor: Dr. C. Monz

(2)

“Algorithms! Math! Why can’t we make what we wanna make, the way we wanna make it?! WHY?! ”

(3)

iii

Abstract

In this thesis, we aim to to gain better insights into the capacities and reasoning me-chanics of neural language models. Although it has become clear in recent years that these models are highly proficient in processing language, how they are able to do this is still largely unknown. We argue that investigating a model’s output is no longer sufficient to obtain a thorough and qualitative understanding. Therefore, we introduce a new interpretability method, calledGCD–shap, which we employ to investigate the behaviour of language models. We applyGCD–shap to tasks pertain-ing to syntactic agreement and co-reference resolution and discover that the model strongly relies on a default reasoning effect to perform these tasks. By applying these techniques to analyse language models, we are able to gain unprecedented insights into the ways these models operate.

(4)

(5)

v

Acknowledgements

Firstly, I would like to express my sincerest gratitude to Dieuwke, for your empa-thetic guidance, your invaluable feedback, and the many thought-provoking discus-sions we have had. I am eagerly looking forward to our future collaborations! Next, I would like to thank Jelle for your thoughtful insights and your inspiring optimism. Thank you for making me feel a welcome part of your group, I thoroughly enjoyed our discussions.

Big thanks to my good friends from Incognito for keeping me sane, thank you for all the laughs, the tasty beers, and the many weird conversations we have enjoyed throughout the years. To my buddies from Breda: we have know each other longer than we have not, and I can’t wait to see where our travels will bring us next. I would like to thank my family for their compassionate support, I have been truly blessed with all of you. To my big little bro: much love to you, you never fail to make me laugh and I want to thank you for your support and your sincere interest in my studies. I want to thank my dad for being weird and awesome; thank you for inspiring me both intellectually and culturally, and providing me the support when I needed it. I want to thank my mom for being lief and awesome; for your boundless support and interest in my thoughts, and most of all for teaching me the things that are most important in life.

(6)

(7)

vii

3 Generalised Contextual Decomposition 19 3.1 Contextual Decomposition. . . 19 3.1.1 Information Flows . . . 19 3.1.2 LSTMs . . . 20 3.1.3 Decomposing Activations . . . 21 3.1.4 Multiplicative Interactions. . . 24 3.2 Relevant Interactions . . . 26 3.3 GCD–shap . . . 28 3.3.1 Combined Contributions . . . 28 3.3.2 Individual Contributions . . . 28 4 Experiments 33 4.1 Subject-Verb Agreement . . . 34 4.1.1 Phrase Contributions . . . 35 4.1.2 Pruning Information . . . 36 4.2 Anaphora Resolution . . . 38 4.2.1 Resolving Referents . . . 39

(8)

4.2.2 Gender Preference . . . 40

4.2.3 Pruning Information . . . 42

4.2.4 Corpus Frequency . . . 43

4.3 GCD–shap vs. GCD . . . 44

5 Discussion and Conclusion 47 5.1 Discussion . . . 47 5.1.1 Dimensions of Interpretability . . . 47 5.1.2 Attribution Methods . . . 48 5.1.3 Shapley Values . . . 49 5.2 Future Work . . . 50 5.2.1 Linguistic Phenomena . . . 50 5.2.2 Model Architectures . . . 51 5.2.3 Contributions of Interactions . . . 52 5.3 Conclusion . . . 52 A diagnnose Library 55 Bibliography 57

(9)

1

Chapter 1

Introduction

Understanding language is an enterprise that has concerned humans for ages. The field of natural language processing (NLP) tackles this task from a computational perspective, by creating large-scale models that are adept at representing and pro-cessing language. NLP lies at the intersection of linguistics and computer science and is one of the major subfields of artificial intelligence (AI), an academic disci-pline that aims to model human intelligence. Allowing computers to process and analyse language is an indispensable step towards creating intelligent systems that are able to communicate with humans, and that are capable of reasoning about their surroundings in a comprehensible manner.

The complex intricacies of human language make this endeavour extremely chal-lenging. The innate ability of humans to efficiently pick up language in only a few years is tremendously impressive, but this innateness stems from millions of years of evolution. How humans acquire and process language is still poorly understood, making it impossible to transfer our cognitive processes directly onto a computa-tional system. Therefore, we need to resort to algorithms that are not necessarily rooted in human language processing, under the presumption that this will yield a sufficient approximation towards language understanding.

NLP ultimately aims to develop systems that possess a truly comprehensive un-derstanding of language, an unun-derstanding that is able to systematically generalise to new scenarios, based on a fundamental apprehension of the core structures of language. Contemporary approaches adhere to an unsupervised paradigm, driven by enormous amounts of data, which has led to a series of breakthroughs in the field. These successes have been supported by an exponential increase in comput-ing power, and by uscomput-ing deep learncomput-ing models such as neural networks. However, employing neural networks comes at a great loss of transparency; while their im-pressive performance is commendable, it is no longer evident how these models op-erate. This has given rise to a new line of research that aims to uncover the internal dynamics of these models, in a similar manner to how psycholinguistics attempts to unravel the mysteries of human language processing.

1.1 Interpreting Language Models

In this thesis, we will focus our investigation on language models. These are models that assign a probability to the next token in a sentence, conditioned on a prior con-text. To do this proficiently, they should be able to thoroughly grasp a sentence on

(10)

both a syntactic and semantic level: by keeping track of information on the subject, the topic of the sentence, the action being performed, etc. This makes these systems perfectly suited for investigating current NLP systems, as there are few tasks that require such a comprehensive understanding of language as language modelling does.

In recent years, there has been considerable interest into analysing the linguistic capacities of language models. Most of these analyses approach language modelling from a behavioural angle, investigating a model’s output behaviour on a specific linguistic phenomenon. This approach has yielded a substantial understanding of what phenomena these models understand, but does not necessarily provide cues on how these phenomena are processed. This is a gap in understanding to which we aim to contribute. We will do so by introducing methods from explainable AI to the analysis toolkit for language models, in order to address the main research question of this thesis:

How do language models successfully process language?

1.2 Explainable AI

Deep learning models such as neural networks are now used in many applications employed in the public domain, such as medical imaging software, automatic trans-lation systems and fraud detection algorithms. Due to their ubiquitous impact on so-ciety, it is vital to thoroughly understand how these neural network models operate and what kind of biases they encode. However, neural network models – commonly referred to as black-box models – are notoriously difficult to interpret. This stands not only in the way of comprehending their powerful capacities and interacting with them, but also impedes their deployment in areas where transparency is paramount. The field of explainable AI offers techniques to increase the interpretability of deep learning models, enabling their predictions to be more explainable.

We will employ explanation methods to gain more qualitative insights into the inner workings of language models. Explaining complex models, such as neural networks, is challenging: an explanation should be simple enough for a user to comprehend, yet at the same time complex enough to be a faithful representation of the model’s behaviour. For our research we have looked into the explanations of a method called Contextual Decomposition (CD, Murdoch et al., 2018). However, during the development of our research we encountered that CD contains several flawed assumptions, leading to explanations that are not faithful enough to the un-derlying model. The second research question of this thesis will hence focus on the faithfulness of model explanations:

How can we guarantee the faithfulness of an interpretability method?

1.3 Contributions

To address the main research question of this thesis, we develop GCD–shap, a new explanation method that explains the behaviour of neural networks in a clear and

(11)

1.4. Outline 3 understandable way. We utiliseGCD–shap to investigate two common linguistic phe-nomena: subject-verb agreement and anaphora resolution. For both these phenom-ena we are able to gain unprecedented insights into how a language model processes them, becauseGCD–shap allows us to delve much deeper into the model processes than a behavioural approach can. We observe a default reasoning effect: for both phenomena the model picks up a preferred class that acts as a default, and only when it is provided sufficient evidence it will consider predicting the opposite class. Guaranteeing the faithfulness of our approach stood at the forefront during the development ofGCD–shap. We will provide a clear overview of related research on explanation methods, and place our contributions in the context of these studies. We then address the flaws that we observed in the original formulation of CD, for which propose several improvements. We adhere to an axiomatic approach: the faithfulness of our method will be guaranteed on the basis of a set of well-studied properties that should be satisfied by an explanation.

1.4 Outline

In Chapter2, we introduce the background that is relevant for our work. In partic-ular, we present an overview of the field of explainable AI (2.1) and focus in par-ticular on attribution methods (2.2). Next to that we present an overview of cur-rent interpretability studies on language models (2.3). In Chapter 3, we introduce our proposed attribution method, GCD–shap, and provide a thorough overview of the procedures that are relevant to our method. In Chapter4, we apply GCD–shap to investigate a language model’s abilities in handling subject-verb agreement and anaphora resolution. And finally, in Chapter5, we discuss our approach and results and provide a summary and possible future work.

(12)

(13)

5

Chapter 2

Background

The core of this thesis focuses on applying a technique from explainable AI to lan-guage modelling. Lanlan-guage models have recently shown their worth as fundamen-tal building blocks for a wide range of NLP applications. Understanding how these models are able to represent language so successfully is therefore a worthwhile en-deavour that will lead to a more qualitative and linguistic understanding of the cur-rent state of the art. This section sets out an overview of related background and concepts. In Section2.1we present an overview of explainable AI, an inspiring field that aims to open up the black-box nature of current deep learning systems. In Sec-tion2.2we provide an introduction to attribution methods: a concept from explainable AI that can create interpretable explanations for the prediction of a model. In Section

2.3we provide an overview on interpreting language models.

2.1 Explainable AI

The many successes of deep learning in recent years have led to a surge in the de-ployment of deep learning models, exerting a ubiquitous influence on society. Due to their impact on critical areas such as medicine (Obermeyer and Emanuel, 2016), fraud detection (Sudjianto et al., 2010), and medical imaging (Wu et al., 2019), be-ing able to understand these models is of paramount importance. However, their high-dimensional, non-linear nature makes it extremely difficult to understand the rationale for a prediction by simply looking at the model itself. Therefore, we need to resort to explanations that describe a model’s behaviour into more comprehensi-ble concepts. The field of explainacomprehensi-ble AI aims to create such explanations, allow-ing model predictions to be better understood and justified. We present a concise overview of the field in this section, highlighting its key concepts and challenges.1

2.1.1 Interpretability

To understand a model’s behaviour, we need to resort to explanations that express a model’s reasoning into interpretable concepts. Designing such explanations thus amounts to developing additional models that act as explanatory models for the more difficult model one aims to understand: the behaviour of a complex system

1_{A more general overview and survey of the current state of this field, also known as interpretable}

machine learning, can be found in (Guidotti et al., 2018; Ras et al., 2018; Molnar, 2019; Murdoch et al., 2019; Samek et al., 2019).

(14)

is explained by a simpler, interpretable model. This allows us to gain insights into aspects such as:

1. Specific contributions of input features to a model decision: e.g. the negative sentiment of the sentence “This film is terrible” stems solely from the presence of “terrible”.

2. Latent biases it acquires during training: e.g. a language model that assigns a far higher probability to himself than herself in the following sentence: The doctor looked at himself/herself.

3. Complex reasoning patterns employed by a model: e.g. how a sentiment clas-sifier deals with the function of not in This is not a bad movie.

These insights allow a user to draw qualitative conclusions about areas that sur-pass the model output itself. Such areas are referred to as desiderata of deep learning models: auxiliary objectives we consider to be important but that prove challenging to be modelled formally. Interpretability allows these objectives to evaluated or op-timised alongside the main objective (Doshi-Velez and Kim, 2017; Lipton, 2018). We will briefly expand on what these desiderata are, and address the external goals that can be aided by interpretability.

2.1.2 Desiderata

One important desideratum is fairness, which concerns the implicit or explicit bi-ases a model might employ, leading to discriminatory decision making. Identifying these biases provides a way to mitigate their effect, and might also shed insights on implicit biases that are encoded in the training data itself (Hardt et al., 2016; Dressel and Farid, 2018; Gonen and Goldberg, 2019; Chang et al., 2019). This also provides a way to unmask “Clever Hans” predictors, that effectively rely on spuriously corre-lated features in the training data (Lapuschkin et al., 2019; McCoy et al., 2019).

If a model is employed in an environment in which it is assigned a certain de-gree of responsibility, it is important it can be trusted (Ribeiro et al., 2016). A user should be able to understand a model’s dynamics sufficiently to feel comfortable ceding control to it, as a user needs to be confident the model generalises well to new data it has not encountered during training. Causality is a desirable prop-erty as it allows us to infer hypotheses about the real world from data. Although most deep learning models are still confined to only learning correlations, injecting methodology from causality theory (Pearl, 2009) into them is likely to increase their robustness and interpretability. Robustness is a desirable feat in many ways, for instance to increase transferability to dynamic domains (Caruana et al., 2015) or to reinforce models against adversarial attacks (Szegedy et al., 2014). Fidelity or

faith-fulnessconcerns the degree to which an explanation accurately describes the under-lying model (Guidotti et al., 2018). If an explanation is not faithful to the underunder-lying model there is a risk that humans prefer explanations that seem most plausible, but that don’t correspond to the model behaviour (Lipton, 2018). It has recently been shown that several interpretability methods can lead to such pitfalls (Adebayo et al., 2018; Kindermans et al., 2019).

(15)

2.2. Attribution Methods 7

2.2 Attribution Methods

Imagine we are analysing some classification task, such as the sentiment analysis of a sentence. For our analysis we are interested in understanding a prediction on some specific instance, like “This film is terrible”. One way to gain insight into the model’s prediction would be to obtain the individual contribution of each token to the output. In our example “film” would turn out to provide a neutral contribu-tion, whereas “terrible” provides a very strong negative contribution. Methods that are able to express the output of a model in terms of the contributions of the input features are called attribution methods.2

2.2.1 Gradient-Based Attributions

There exist many types of attribution methods, as the concept of a “contribution” can be defined in various ways. To gain a better understanding of the intricacies of attributions, we will consider a simple linear regression example first, based on Ancona et al. (2019b).

Linear models Suppose we are interested in creating an attribution method for a sentiment classifier. The classifier receives the number of positive (x+) and negative words (x−) in a sentence as input, and outputs the sentiment as a score. We assign the following coefficients to the two sentiment features:

f(x+, x−) =3·x+−2·x− (2.1) There are multiple ways we can create the contributions of x+_{and x}−_{to f . We will}

use the notation R+(x; f)to express the contribution of x+to f , and R−(x; f)for x−.3

One way of expressing the relevance of a feature is by addressing the rate of change that is caused by that feature, i.e. the gradient. A straight-forward implemen-tation uses the coefficients of f , resulting in the partial derivatives:

R+(x; f) = ∂ f(x)

∂x+ =3 R−(x; f) = ∂ f(x)

∂x− = −2 (2.2)

This states that each added positive word will increase the sentiment score by 3, and each negative word will decrease it by 2. These attributions provide a complete interpretation of the model that is independent of the input, and are therefore a

globalinterpretation.

However, when investigating the contributions of a specific instance we can see that these attributions no longer make sense. Say we have a sentence with 0 positive words, and 10 negative words: f(0, 10) = −20. Our previous global interpretation would still assign a contribution of “3” to the positive tokens, even though there are no positive tokens present at all. To address this we need local attributions,

2_{Such methods have been referred to with various definitions, such as relevance (Robnik-Šikonja}

and Kononenko, 2008; Bach et al., 2015), contribution (Shrikumar et al., 2017), and saliency methods (Simonyan et al., 2014).

3_{We adopt this notation from Ancona et al. (2019b). The R could be interpreted as expressing the}

(16)

that incorporate the input into the contributions4_{. A simple method to achieve this}

is by multiplying the gradient with the input itself, a method called gradient∗input (Shrikumar et al., 2017). For our example this would yield the following attributions:

R+(x; f) = ∂ f(x) ∂x+ ·x + ₌₀ _R −(x; f) = ∂ f(x) ∂x− ·x −_{= −}₂₀ _(2.3)

We can see that these attributions answer a different kind of question than our pre-vious global interpretation. The gradient method of Eq. 2.2tells you the expected change in output for each input feature, whereas the gradient∗input method of Eq.2.3

describes the individual contribution of each input.

Because of the simple linearity of our model it turns out that all contributions sum up to the original output. This is a property of attribution methods called

com-pleteness, which is desirable as it leads to intuitive interpretations that guarantee a degree of faithfulness to the original output.

Non-linear models We now update our sentiment classifier by squashing the sen-timent between 1 (positive) and -1 (negative), using a tanh activation:

g(x+, x−) =tanh(x+−x−) (2.4) The introduction of this non-linear function will pose difficulties for our previously established global and local interpretations. Our gradient interpretation now be-comes: R+(x; g) = ∂g(x) ∂x+ =tanh 0₍ x+−x−) R−(x; g) = ∂g(x) ∂x− = −tanh 0₍ x+−x−) (2.5) The attribution for a feature now depends on both features, making it no longer a global attribution method. This method still describes the rate of change, but is only accurate for infinitesimal perturbations around the original input.

Our previous example of 0 positive words and 10 negative words now yields g(0, 10) ≈ −1. The local gradient∗input attribution results in the following contribu-tions: R+(x; g) = ∂g(x) ∂x+ ·x+=0 R−(x; g) = ∂g (x) ∂x− ·x− ≈ −3.40·10−16 (2.6) It can easily be seen that the attributions for the non-linear model g no longer satisfy the completeness property, contrary to those for the linear model f . This is caused by the saturated gradient of g, making it unable to account for the non-linear dynamics of the model. In order to reinstate completeness we will need to resort to more advanced attribution methods which we will describe in the next section.

4_{Local attribution methods are also referred to as salience methods. The global, gradient-based}

(17)

2.2. Attribution Methods 9

2.2.2 Axiomatic Attributions

It is notoriously challenging to evaluate an attribution method empirically. This challenge arises because it is hard to differentiate between errors stemming from misbehaviour of the model, versus misbehaviour of the method itself (Sundararajan et al., 2017). There have been attempts at defining ground-truths for attributions that are compared to the attributions of some method (Pörner et al., 2018). However, considerable assumptions about the model behaviour need to be made in order to create such ground-truths, that can lead to amplifications of human biases. After all, if we would have access to the actual ground-truth of a model explanation, this would imply there is no need at all for creating attributions: the ground-truth itself provides the exact attribution we are after. Several methods have therefore opted for an axiomatic approach (Sundararajan et al., 2017; Ancona et al., 2019a). This allows for more systematic guarantees on faithfulness, based on a set of axioms: theoretical properties that an interpretability method should possess.

Axioms Several desirable axioms have been established in the literature on attri-bution methods.5 We present an overview of the 6 axioms that have been collected by Ancona et al. (2019a).

1. Completeness. All attributions should sum up to the model output minus the output at a baseline value:

N

∑

i=0

R_i(x; f) = f(x) − f(0) (2.7) This is a desirable property, as it provides a degree of faithfulness to the un-derlying model. It also allows for an easy sanity check if a specific attribution has been computed correctly. The baseline is often set as a 0-valued input, but finding the right baseline configuration can be challenging.

2. Null Player. An attribution method should assign no contribution to variables on which a function does not depend:

f(x) = f(x\ {i}) ⇒ Ri(x; f) =0 (2.8) Where x\ {i}indicates feature i has been omitted or set to some baseline value. 3. Symmetry. When two variables play the exact same role in a model their at-tributions should be the same. That is, when the output of a model is inde-pendent of the order of two variables x1and x2their attributions should be the same:

f(x1, x2) = f(x2, x1) ⇒ Rx1(x; f) =Rx2(x; f) (2.9)

5_{An axiom is often supposed to represent a self-evident property. We are aware that not all axioms}

presented here carry this immediate self-evidence, but adhere to the methodology that has been estab-lished in the current literature.

(18)

4. Linearity. A linear combination of two models should yield the same linear combination of attributions:

f(x) =a·f1(x) +b·f2(x) ⇒ R(x; f) = a·R(x; f1) +b·R(x; f2) (2.10) 5. Continuity. Near identical inputs should lead to near identical attributions:

R(x; f) ≈R(x+e; f) (2.11)

This axiom assumes f to implement a continuous function. The min function is an example of a function that poses a challenge to this axiom. Assigning all contributions to the smallest value neglects the dependence on the other value, and will lead to discontinuous attributions.

6. Implementation Invariance. If two models with different implementations represent the same function their attributions should be equivalent:

∀x. f(x) =g(x) ⇒ R(x; f) =R(x; g) (2.12) This axiom might seem trivial, but consider the models f(x) =x1· (x2·x3)and g(x) = (x1·x2) ·x3. Attribution methods that propagate relevance scores in a step-by-step fashion will assign different attributions to f and g, as opposed to methods that only consider the full model gradient (Sundararajan et al., 2017).

Shapley values The only attribution method that satisfies Axioms 1 to 4 is the

Shapley value, which has been proven by Shapley (1953).6 Shapley values are a concept from game theory, initially proposed as a solution for credit assignment in coalitional games.

A coalitional game takes as input a set of players, whose cooperation will lead to a score. Shapley values aim to establish the contribution of each player to the final score, similar to the attribution methods we have described so far. How do we determine this contribution? We can not simply omit one player from the input and look at the difference in the score. This would greatly undervalue the player’s contribution. For example, consider a citizen’s contribution to the outcome of an election. Omitting this citizen’s vote from the election would unlikely result in a change of the outcome. If only this change in output would be considered, the citizen would be assigned no contribution at all. However, repeating this procedure for all individual citizens would result in an attribution in which none of the citizens contributed to the election outcome!

Shapley values handle this by considering every coalition that can be formed by the players. A coalition is defined as a subset of players whose cooperation results in a combined score. A player’s contribution is based on the average marginal contri-bution of that player to each possible coalition. A marginal contricontri-bution is defined as the outcome of the coalition including the player, minus the outcome of the coalition

6_{Shapley values satisfy continuity and implementation invariance as well, but these axioms are not}

part of the original proof, and have only been added in the context of attribution methods (Montavon et al., 2017; Sundararajan et al., 2017).

(19)

2.2. Attribution Methods 11 without the player. This is expressed as follows:

Ri(x; f) =

∑

S⊆x\{i} | {z } coalitions without i relative number of coalitions of size|S| z }| { |S|!(|x| −1− |S|)! |x|! · (|f(S∪ {i{z}) − f(S))} marginal contribution of i to coalition (2.13)

How do we translate this formulation to an attribution method for a deep learn-ing model? This turns out to be fairly straight-forward. The input features of a model fulfil the same role as the players in our formulation. So instead of a group of players forming a coalition, we now have features being combined. Furthermore, it can be seen that our scoring function f expects its input to be a set of players. Most deep learning models expect their input to be of a fixed size, so simply omit-ting a feature is often not possible. Instead, we can reset an omitted feature to some baseline value, such as 0. There is no restriction on what this baseline can be. It should be noted, however, that the sum of all Shapley values is equal to the model output minus the output at this baseline (conform completeness, Eq.2.7). The baseline thus has an effect on the contributions, which is something that should be taken into consideration when interpreting a model (Sturmfels et al., 2020).

2.2.3 Overview of Attribution Methods

The previous two sections provided a theoretical overview of the concepts related to attribution methods. Here we present a short summary of the existing literature on these methods.

Backpropagation methods We showed that solely using the gradient of a non-linear function is not sufficient to account for the marginal contribution of a feature. To overcome this challenge, several methods have been introduced that use a modi-fied gradient of the model.

Layer-wise Relevance Propagation(LRP, Bach et al., 2015) creates contributions by propagating a relevance signal from the output of a model back to the inputs. To calculate the contribution of one specific output class, the relevance signal is set to the output logit of that class. LRP then employs a modified partial derivative, that distributes the relevance signal over all the input features. LRP does not satisfy symmetry and implementation invariance.

Integrated Gradients(IG, Sundararajan et al., 2017) create a contribution by in-tegrating over the gradient of the model between an input and a baseline, at discrete intervals. Because it is based on the full model gradient it satisfies implementation invariance. One of its main limitations is that it is computationally costly, as the gra-dient needs to be calculated for each interpolated point between the input and the baseline.

DeepLIFT (Shrikumar et al., 2017) is similar to LRP by using a modified back-ward pass, and similar to IG by basing a contribution on the difference with the

(20)

output for some baseline value. This difference is propagated by comparing the in-termediate activations to the activations of the baseline input. DeepLIFT does not satisfy symmetry and implementation invariance, for similar reasons as LRP. For a network that only contains ReLU activations DeepLIFT is equivalent to LRP (An-cona et al., 2018).

Shapley-based methods The attractive properties of Shapley values make them a very useful tool for attributions. However, calculating these values is NP-hard, as the number of possible coalitions scales exponentially in the number of input features. Several methods have therefore been proposed that approximate Shapley values.

KernelSHAP (Lundberg and Lee, 2017) applies the concept of Shapley values to LIME (Ribeiro et al., 2016). LIME creates feature attributions for an instance by slightly perturbing it, and fitting a linear model on the changes in the output. Ker-nelSHAP is based on this concept, but creates perturbations based on the coalitions of Shapley values. To cope with computational constraints it samples from the full set of coalitions and assumes feature independence.

DeepSHAP (Lundberg and Lee, 2017; Chen et al., 2019) applies Shapley val-ues to the concept of DeepLIFT. Whereas DeepLIFT propagates the absolute differ-ences with respect to a baseline input, DeepSHAP propagates the Shapley values of smaller network components instead.

Contextual Decomposition(Murdoch et al., 2018) is related to LRP and DeepLIFT, but creates attributions in a forward fashion, from the input up to the output logits. It creates an attribution for a phrase by splitting the hidden states of a model into a sum of two parts: one that contains the information that stems from the phrase of interest, and one containing all the other information. The contributions of these two flows to intermediate components are computed using the Shapley value procedure, but by considering only two flows at a time this computation yields an approxima-tion of the Shapley values of the full model. The main contribuapproxima-tions of this thesis are based on this method, and it is explained in more detail in Chapter3.

DASP(Ancona et al., 2019a) incorporates the uncertainty that arises when Shap-ley values are approximated. By propagating this uncertainty the authors arrive at a method that is able provide robust approximations that converge faster than related approaches.

2.3 Interpreting Language Models

Recent successes in natural language processing have given rise to a line of research that focuses on qualitative analyses of these models. We focus in particular on the analysis of language models: statistical systems that create abstract representations of language. Various recent breakthroughs in NLP have relied on the representations of pre-trained language models. These successes highlight the importance of being able to truly grasp why these models are so adept at representing language.

(21)

2.3. Interpreting Language Models 13 This section commences with a brief introduction to language modelling (2.3.1), followed by a general overview of the interpretability analyses that have been con-ducted in recent years (2.3.2).

2.3.1 Language Models

Language modelling is the task of assigning a probability to some token conditioned on its context: i.e. P(wt|ct). This probability ranges over a fixed vocabulary and re-quires a thorough linguistic understanding of the context. For example, if a model were to assign a high probability to cat based on The dog chased the ..., it needs to pos-sess the semantic knowledge that dogs are likely to chase cats and that the syntactic position of the current token allows a noun like cat to be predicted.

N-grams Before the advent of neural language models these probabilities were computed in a count-based fashion, using n-gram models (Jurafsky and Martin, 2009). An n-gram model uses a context window of length n, and bases a token prob-ability on the relative frequency the token’s context was succeeded by the token.

While the simplicity of n-gram models is attractive, they carry several drawbacks that make them unsuitable for modelling natural language in a large scale setting. First of all, the fixed length of the context windows makes it impossible to model long-distance dependencies that surpass the context length. One might try to circum-vent this by increasing the context window, but this will lead to data sparsity as the number of contexts grows exponentially. A second drawback is that these models are unable to incorporate semantic information: the probability of “motorbike”, for instance, is completely independent of that of “motorcycle”. Finally, n-gram mod-els have issues in dealing with unseen contexts, as these will always be assigned a probability of 0, and smoothing techniques are needed to cope with this.

Neural Neural language models (LMs) turn out to be quite robust to the draw-backs of n-gram models, and have hence been the most popular approach to lan-guage modelling in recent years. Fundamental to these models is the use of word embeddings, which represent the meaning of a token as a point in a high-dimensional space. Semantic and syntactic information can now be captured in the represen-tations themselves: similar tokens will result in represenrepresen-tations that are near each other in the embedding space. This method of representation is referred to as dis-tributional semantics, based on the distribution hypothesis of Firth (1957) who stated that “You shall know a word by the company it keeps”.

Neural LMs can roughly be divided into 3 types of architectures. Feed-forward LMs (Bengio et al., 2003) take in the embeddings of a context window as features to a feed-forward neural network that generates a probability distribution over the corpus vocabulary. A drawback of such LMs is that the size of the context window must be fixed. This makes them unable to capture long-distance dependencies, sim-ilar to n-gram models. Recurrent LMs (Mikolov et al., 2010) cope with this by using a recurrent neural network (RNN) that processes a sentence incrementally while keep-ing track of an intermediate hidden state (Elman, 1990). The most commonly used RNN architecture is the Long Short-Term Memory network (LSTM, Hochreiter and

(22)

Schmidhuber, 1997). These models have been proven highly successful at carrying information over a long distance and are not confined to a fixed context window. The third and most recent type of LMs is based on the Transformer architecture (Vaswani et al., 2017), that utilises an attention mechanism to represent the context of a token. Long-distance information no longer needs to be carried over incrementally, as the model can attend to each token in its context directly.

Applications Language modelling is a task that has a wide range of applications. Sampling from these models allows them to act as convincing language generators, as has been successfully shown by Radford et al. (2019) using an attention-based approach (GPT-2). Even more salient is the use of language models as a pretraining step for NLP tasks such as sentiment classification, natural language inference, and reading comprehension. In this setup, a LM is first trained on a large corpus, and a task-specific model is then trained on top of these contextualised word embeddings. Examples of pre-trained models are ELMo (Peters et al., 2018), using bi-directional LSTMs, and BERT (Devlin et al., 2019), using Transformer models.

This thesis will mainly focus on LSTM-based LMs. Even though attention-based methods have led to various breakthroughs in recent years, we argue that investigat-ing recurrent methods is still a worthwhile venture. Their incremental nature makes them better suited for comparative studies in psycholinguistic setups, and the mod-els themselves are less dependent on large amounts of training data (Merity, 2019).

2.3.2 Assessing Linguistic Capacities

The recent successes of neural LMs have led to a line of research that aims to assess to what extent these models truly grasp linguistic phenomena. This type of analysis can be fruitful in multiple ways. On the one hand, it provides an insight into what types of generalisations a model creates for representing linguistic categories. In order, for instance, to systematically “understand” subject-verb agreement a model ought to be able to capture some abstract notion of number in its representations.

Additionally, LMs can be utilised as abstract models of human language process-ing; therefore, these models are sometimes referred to as “psycholinguistic subjects” (Futrell et al., 2019). This enables these models to be examined in controlled setups that are impossible with human subjects. Furthermore, to improve a model’s capac-ities and highlight its shortcomings it is of great importance to obtain a systematic understanding on what constructions it fails. This knowledge can then be employed to provide the model with the supervision it needs.

Linguistic analysis of language models can be traced back to the introduction of RNNs by Elman (1990). Since then LM performance has often been linked to human language processing, showcasing its usage in psycholinguistic topics such as the poverty of the stimulus and hierarchical sentence processing (Hale, 2001; Lewis and Elman, 2002; Levy, 2008; Frank and Bod, 2011; Fossum and Levy, 2012; McCoy et al., 2018; Futrell et al., 2019, inter alia).

Subject-verb agreement Linzen et al. (2016) presented one of the first large scale investigations into a model’s capacity in handling a specific linguistic phenomenon.

(23)

2.3. Interpreting Language Models 15 They focused on subject-verb agreement, using several LSTM-based LMs. Their ex-perimental setup consisted of sentences extracted from a Wikipedia corpus, contain-ing various amounts of “intervencontain-ing agreement attractors”. These are noun phrases occurring between the subject and verb, of a different number. Due to their number being different the model is tricked into predicting a verb of the wrong number. The following sentence is an example of such a construction:

The keys to the cabinet are/∗is on the table.

The performance of the model is evaluated by comparing the probabilities of the cor-rect and wrong verb form. Their results showed that performance drops drastically when the number of attractors increases. Only when the model is provided explicit supervision about the subject number, instead of latently picking up this signal from raw textual data, it is able to cope with a larger number of attractors.

Gulordava et al. (2018) expanded on their approach and trained new LSTMs, leading to improved results on four different languages (English, Russian, Italian, and Hebrew), without any explicit supervision. They also showed that their models are capable of handling semantically nonsensical constructions, demonstrating that a syntactic signal is actively being encoded into the representations.

Lakretz et al. (2019) took on a more fine-grained approach, dissecting the inner mechanics of Gulordava et al.’s LM. Their study investigated subject-verb agreement at the neuron-level of the LM, on a hand-crafted corpus of subject-verb constructions. Using an ablation experiment they show that the model utilises only two specialised neurons to keep track of number: an exciting result that might pave a way to better understand how LMs operate.

Phenomena The success of the models of Gulordava et al. (2018) opened a path-way for many other papers that investigated the full spectrum of linguistic phe-nomena. Jumelet and Hupkes (2018) investigated a LM’s performance on negative polarity items, a special class of phrases that need to be licensed by some negative context, such as any in “He didn’t buy any bread”. They showed the model performed well on short-distance dependencies between the polarity item and its context, but performance dropped rapidly as the distance increases.

Marvin and Linzen (2018) provided a large set of artificially generated construc-tions related to number-agreement, reflexive anaphora, and negative polarity items. Their elaborate corpus design highlighted the specific strengths and weaknesses of the LMs, that turned out to be surprisingly sensitive to particular lexical items oc-curring in a sentence. Wilcox et al. (2018) investigated filler-gap dependencies, con-structions in which a “filler” such as what creates a dependency with a “gap”, an empty syntactic position that is licensed by the filler. An example of such a con-struction is “I know what the lion devoured _ at sunrise”. They report generally positive findings and show that the LMs learn a subset of the known restrictions on filler–gap dependencies, known as island constraints. Hu et al. (2020) re-evaluated the exper-imental setup on reflexive anaphora of Marvin and Linzen (2018) and Futrell et al. (2019), showing that LMs learn more about this phenomenon than they had previ-ously been given credit for.

(24)

Schijndel et al. (2019) look into the minimal corpus and model criteria for a LM to learn these phenomena. Their experimental setup is based on the tasks of Marvin and Linzen (2018). They train a plethora of LMs by varying sizes of the training corpus and the model’s hidden size. Based on this large set of models they conclude that performance does not scale proportional to the training corpus and model size: the quality of the data is of far greater importance than the quantity.

Probing One aspect all the previous investigations share is their “behavioural” na-ture: conclusions are mostly drawn based from output distributions of the LM. A related line of research has delved deeper into the model’s hidden states themselves, uncovering encoded linguistic properties using probing tasks (Adi et al., 2017; Shi et al., 2016; Belinkov et al., 2017), also referred to as diagnostic classifiers (Veldhoen et al., 2016; Hupkes et al., 2018). Giulianelli et al. (2018) applied these classifiers to the internal LSTM gate activations, demonstrating the internal trajectories of the model in processing subject-verb agreement. Their approach goes one step further by up-dating the model weights based on the classifier signal, showing that this explicit supervision can greatly improve model performance.

Zhang and Bowman (2018) probe the hidden states of various NLP models (in-cluding LMs) using a POS and CCG tagging task. They show that language mod-elling encodes more syntactical information than machine translation does, but raise concerns that their random baselines perform surprisingly well. Hewitt and Liang (2019) examine these concerns by introducing control tasks: random mappings that act as a proxy to the faithfulness of a probed feature. They argue that if a random mapping can be learned to a similar degree as the actual mapping, performance on the actual mapping should not be interpreted as that feature being encoded.

Pre-trained capacities A fair share of research has focused on probing the inter-nal representations of pre-trained LMs like ELMo and BERT. While investigating these type of models is not the focus of this thesis, the approaches presented in these papers can easily be extended to other types of LMs. Tenney et al. (2019b) and Tenney et al. (2019a) introduce edge probing tasks that probe syntactic relations between multiple relations. Relations that are examined include dependency rela-tions, co-reference resolution, and semantic role labelling. Their setup uncovered that BERT employs language processing heuristics that resemble the steps of the traditional NLP pipeline. Hewitt and Manning (2019) use probes to uncover hierar-chical tree structures in the representations of ELMo and BERT. They show that such hierarchical structures are successfully encoded by these models, contrary to a set of baselines.

Several papers have applied the behavioural experiments that were initially de-vised for recurrent LMs to attention-based representation such as BERT and GPT-2 (Goldberg, 2019) . These studies have shown that these models create expressive representations, that outperform recurrent models in many areas. However, it re-mains an open question to what extent attention-based approaches are more adept at representing language, compared to recurrent models. Tran et al. (2018) compare

(25)

2.3. Interpreting Language Models 17 recurrent and attention-based models on a set of tasks that requires a hierarchical understanding of language. They find that the recurrent models still outperform the attention-based models on several tasks, highlighting that recurrency is still a useful property an NLP system.

Attributions for LMs The previously presented lines of research mostly examined LMs in either a behavioural or probing fashion. Relatively few approaches have incorporated the attribution methods that were presented in Section2.2. Voita et al. (2019) employ LRP to analyse the contribution of the different heads in Transformer models. Using this approach they can localise the most salient components of such models, allowing them to prune less important parts. Pörner et al. (2018) flip this approach around, by using the linguistic assessment tasks described above as an evaluative tool for attribution methods. They argue that the contribution of a subject to a subject-verb agreement task can be interpreted as a proxy to the quality of an attribution method. In other words, a model that is shown to be adept at subject-verb agreement is likely to rely on information provided by the subject. A successful attribution method should thus assign a high contribution to the subject.

The intersection of attribution methods and language modelling analysis forms the core of this thesis. We pose that applying concepts from explainable AI to LMs will lead to a better qualitative understanding of their mechanics. This allows their black-box nature to be unravelled to an unexplored extent, no longer being confined to probing or behavioural experiments. Note that our approach is not limited to LMs at all, and can easily be extended to different types of NLP systems.

(26)

(27)

19

Chapter 3

Generalised Contextual

Decomposition

This chapter contains work that has been published in (Jumelet et al., 2019).1

We propose a new attribution method called Generalised Contextual

Decompo-sition. It is based on a method called Contextual Decomposition, introduced by Murdoch et al. (2018). We improve on this method by generalising its applicability, and address several of its flaws. We will first address the original setup of Murdoch et al. in Section3.1, followed by our proposed improvements in Section3.2and3.3.

3.1 Contextual Decomposition

How does information flow through neural networks? How do the features in these networks interact with each other? Is there a way we can separate all these interact-ing flows? These are questions that are addressed by Murdoch et al. (2018), who introduce an attribution method called Contextual Decomposition (CD).

3.1.1 Information Flows

The models that we aim to investigate are recurrent neural networks: models that process their input incrementally. At each step they read in a new input token, while keeping track of a hidden state h that is an intermediate representation of all processed input (Elman, 1990). Consider the following high-level representation of some recurrent model f :

f f f

x1 x2 x3

h0 h3

h1 h2

(3.1) At each step t the model combines the previous hidden state ht−1and the current in-put xtto create ht = f(ht−1, xt). In the context of NLP, xtis often a word embedding.

1_{The contents of this chapter have been rewritten and refined since the publication of our work.}

(28)

A prediction pT at step T is generated based on the hidden state hT, using a simple linear projection D followed by a SoftMax:

zT =DhT+bz (3.2)

pT =SoftMax(zT) (3.3)

Suppose we are interested in determining the contribution of x1to the prediction p3. We could follow the path that is traversed by x1, noting that it interacts with the hidden state h0 at the first step. This interaction produces a new hidden state h1, containing information stemming from both h0and x1. The entangled state h1then interacts with the next input x2, resulting in a hidden state h2 that now contains information from h0, x1, and x2. This procedure continues once more, resulting in the highly entangled state h3.

Partitioning Flows To create a valid attribution for x1, we must ensure that the information stemming from x1 is correctly separated from information stemming from h0, x2, and x3. We thus need to devise a procedure that partitions all these different flows, while preserving faithfulness to the model. CD creates this partition in an additive fashion. It does so by splitting the hidden state htinto 1) a part βtthat is relevant with relation to some phrase (e.g. x1), and 2) a part γtwhich contains all the other information flows that are irrelevant to the phrase (e.g. everything except x1):

ht =βh_t +γh_t (3.4)

In order to arrive at this partition, we will need to update our model f to output both β and γ, conditioned on the phrase of interest φ. This phrase is often defined as a single feature, such as x1in our example, but in theory any set of input features could be set “in focus”.

βh_t, γh_t = f0((βh_t₋₁, γh_t₋₁), xt; φ) (3.5)

This additive separation in itself already provides the attribution of x1 we were after. Due to its linear nature we can simply plug it into Eq.3.2:

zT = DhT+bz = D(βh_T+γh_T) +bz = Dβh_T | {z } βz_T +Dγh_T | {z } γz_T +bz (3.6)

βz_T can be interpreted as the contribution of x1to zT, either as an absolute measure or normalised as βz_T/zT.

3.1.2 LSTMs

In order to derive our updated model f0, we will need to dive into the mechanics of the model itself. The architecture that we consider here is the LSTM (Hochreiter and

(29)

3.1. Contextual Decomposition 21 Schmidhuber, 1997), whose setup we will briefly discuss first.2

LSTMs improve upon the simple RNNs of Elman (1990) by introducing gating mechanisms that control the flow of information. Next to that, LSTMs use two types of internal states: a hidden state h and a cell state c. The cell state acts as a “conveyor belt” and is able to efficiently pass along information.

The cell state ct is updated via two mechanisms. A multiplicative forget gate ft selects which information of the previous cell state should be kept and discarded. This is followed by an additive update containing new information. The additive update is the result of a gating mechanism as well, consisting of an input gate itand candidate values ˜ct. ct= ftct−1 | {z } filters old information + adds new information z }| { it ˜ct (3.7)

The forget and input gates are both the result of a sigmoid operation (squashing values between 0 and 1), and the candidate values are the result of a tanh operation: ft =σ(Wfxt+Vfht−1+bf) (3.8) it =σ(Wixt+Viht−1+bi) (3.9) ˜ct =tanh(W˜cxt+V˜cht−1+b˜c) (3.10) W and V denote a linear transformation, and b the intercept vector.3 _{We can see that}

all three of these operations depend both on the previous hidden state ht−1and the current input token xt.

All that is left now is updating the hidden state ht. This is done based on the updated cell state ct and another gating mechanism that also depends on ht−1 and xt. ot =σ(Woxt+Voht−1+bo) (3.11) ht =ottanh(ct) | {z } output gate (3.12)

The hidden and cell state are passed along to the next step, after which this whole procedure is repeated. Predictions are often made based on ht, using the procedure of Eq.3.2and3.3. The initial states h0and c0can be set to a zero-valued vector, or as the end states of another sequence.

3.1.3 Decomposing Activations

The goal of CD is to partition htinto two parts: one part β that can be attributed to a phrase φ, and one part γ that can be attributed to all features not part of φ. Having

2_{In this work, we focus on applying CD to LSTMs, but it should be noted that the technique can be}

extended to other kinds of neural network architectures, such as CNNs and feed-forward NNs (Singh et al., 2019).

3_{These intercepts are often referred to as bias terms. Because our later experiments will focus on}

(30)

set out the mechanics of the LSTM, we can now tend to the hurdles of creating this partition.

How does information provided by xtreach ht? It can be seen that information stemming from a token xtenters the network via the three gates and candidate val-ues of Eq. 3.8to3.11. However, disentangling xt’s subsequent contribution to ht is challenging. This challenge arises from the non-linear nature of the activation func-tions (σ and tanh), and the multiplicative gates. These activation funcfunc-tions and gates are the two main components we need to deal with, and lie at the core of CD.

We will consider the forget gate ftfirst. The first step of the CD procedure aims to establish the contribution of each of the components to the output of σ. We refer to this step as factorisation: the output of a non-linear function is factorised into a sum of contributions.4If this is done in an additive fashion that satisfies completeness, Eq.

3.8can be rewritten as:

ft = Rx(ft) | {z } contr. Wf_x_t + contr. Vf_h t−1 z }| { Rh(ft) +Rb(ft) | {z } contr. bf (3.13)

The contribution of an individual feature depends on the full input to ft and the activation function of interest σ. A term such as Rx(ft)represents the marginal con-tribution of Wfxtto the squashed sum ft. Note that the full attribution method (i.e. CD) thus depends on smaller attribution methods for the activation functions (i.e. Rx).

Remember that we had split up our hidden state h into a relevant β and irrel-evant γ part. This aspect should be taken into consideration for the factorisation procedure. We could split up Eq. 3.13into four parts (x/β/γ/b), but CD does this slightly different. If the current token at step t is part of φ, the phrase in focus, we group its projection together with β, otherwise with γ. This is clarified as follows:

ft=σ(Wfxt+Vfht−1+bf) =σ(Wfxt+Vf(β_th−1+γth−1) +bf) =σ(Wfxt+Vfβh_t₋₁+Vfγh_t₋₁+bf) =            σ((Wfxt+Vfβh_t₋₁) | {z } β +Vfγh_t₋₁ | {z } γ + bf |{z} b ) if xt ∈φ σ(Vfβh_t₋₁ | {z } β + (Wfxt+Vfγh_t₋₁) | {z } γ + bf |{z} b ) if xt ∈/φ = Rβ(ft) +Rγ(ft) +Rb(ft) (3.14)

4_{Murdoch et al. refer to this step as linearisation. From our experience this can lead to confusion:}

we are not linearising the non-linear activation function. The resulting sum of contributions is still non-linear with respect to the input features, after all.

(31)

3.1. Contextual Decomposition 23 We can apply the same procedure to the other activation functions. Note that, similar to ht, the cell state ctis also split in two: ct = βc_t+γc_t.

it =Rβ(it) +Rγ(it) +Rb(it) (3.15)

˜ct =Rβ(˜c) +Rγ(˜c) +Rb(˜c) (3.16) ot =Rβ(ot) +Rγ(ot) +Rb(ot) (3.17) tanh(ct) =tanh(βc_t+γc_t)

=Rβ(tanh(ct)) +Rγ(tanh(ct)) (3.18)

These components will form the basis of the next step: the multiplicative gating mechanism.

Shapley values We introduced the contribution function R, that expresses the con-tribution of a feature to an activation function. It is defined as the Shapley value over the squashed summation, based on the formulation of Eq.2.13:

Ri(ft) =

∑

S⊆x\{i} |S|!(|x| −1− |S|)! |x|! · σ(i+∑_jS_j) −σ(∑_jS_j) (3.19)

where x denotes the set of input features{β, γ, b}. The absence of an input feature

is handled by setting it to 0, i.e. by not taking it into account in the summation. Due to the the small and constant number of input features this formulation is computa-tionally tractable.

The number of possible coalitions can be restricted to a smaller set. Murdoch et al. note that they empirically noticed improvements when the intercept term is fixed to the first position of each coalition. According to them, the factorisation of ftthen boils down to the following:

Rβ(ft) =1/2· σ(β+γ+b) −σ(γ+b) + 1_/₂_·_σ₍_β₊_b_{) −}_σ₍_b₎ Rγ(ft) =1/2· σ(γ+β+b) −σ(β+b) + 1_/₂_·_σ₍_γ₊_b_{) −}_σ₍_b₎ R_b(ft) =σ(b) (3.20)

However, Murdoch et al. did not address that this formulation actually allows the intercept term to act as a baseline value.5 _{This is a stronger justification than}

their empirical observation, and saves us from having to distribute the baseline σ(0)

over the resulting contributions.6 In fact, the function that is described by3.20 is ft(β, γ) =σ(β+γ+b): b is no longer considered a free parameter to the forget gate,

but actually serves its role as a fixed offset term. Completeness (Eq. 2.7) is hence satisfied as follows:

∑i∈{β,γ}Ri(ft) =σ(x) −σ(b) (3.21)

5_{If we actually would fix the position of the intercept term, the contribution R}

b(ft)should be defined

as σ(b) −σ(0), incorporating the empty sum that follows from Eq.3.19.

6_{Distributing σ}₍₀₎_{in that case is done uniformly, so in our three-way split of β/γ/b this would}

result in adding σ(0)/3=1/6 to each contribution. This breaks the null player axiom: in case when one of the input terms is 0 it would still be assigned a contribution of 1/6.

(32)

Interaction set:{β-β, β-b}

βc_t₋₁ β γ β

γ_tc₋₁ γ γ γ

Rβ Rγ Rb ft (A) Forget gate partition.

Rβ β γ β

it Rγ γ γ γ

R_b β γ γ

Rβ Rγ Rb ˜ct (B) Input gate partition.

˜ht Rβ β γ β

Rγ γ γ γ

Rβ Rγ Rb ot (C) Output gate partition,

˜ht=tanh(ct).

FIGURE3.1: Schematic visualisation of the partition of ctand htinto βt+γt. In these schemata each interaction is marked by the partition

it is added to, conform Eq.3.24and3.25.

3.1.4 Multiplicative Interactions

By factorising the activation functions it now becomes possible to create a cross-term of interactions between components. We will consider the forget gate first again. It consists of a multiplication between the sigmoidal ft activation, and the previous cell state ct−1. Eq.3.14factorised ft into three terms, and the cell state ct−1has been split into βc_t₋₁+γ_tc₋₁. The cross-term of ftct−1will then obtain 6 interactions:

ftct−1 = Rβ(ft) +Rγ(ft) +Rb(ft) (β_tc−1+γct−1) =Rβ(ft) β c t−1 + Rγ(ft) βct−1 + Rb(ft) βc_t₋₁ + Rβ(ft) γ c t−1 + Rγ(ft) γ c t−1 + Rb(ft) γc_t₋₁ (3.22)

Each of the terms in this summation can be interpreted as a specific interaction be-tween flows of information: e.g. bebe-tween the relevant flow and itself (Rβ(ft) βct−1), or the the forget gate intercept and the irrelevant flow (Rb(ft) γc_t₋₁).

The sum of Eq.3.22constitutes the intermediate cell state to which new informa-tion from the input gate is added. The input gate is also expanded as a cross-term of interactions, similar to the forget gate. This results in 9 interactions, because the candidate values contain a intercept term. The cross-term has not been expanded here for the sake of clarity.

it ˜ct = Rβ(it) +Rγ(it) +Rb(it) Rβ(˜c) +Rγ(˜c) +Rb(˜c)

(3.23) We have obtained all the ingredients to handle the update step of ct, which amounts to adding up the 15 interactions of Eq. 3.22 and 3.23. But how do we create the new split of ct = βt+γt? We do so by deciding beforehand which inter-actions we deem to be relevant, and which are irrelevant. What makes an interaction relevant? For now we will adhere to the definition proposed by Murdoch et al., but we will address this question in more detail in Section3.2.

Murdoch et al. consider an interaction relevant if it occurs between a β part and itself (β-β), or between a β part and a intercept part (β-b). All other interactions are deemed irrelevant: between γ and itself (γ-γ), γ and β (γ-β), γ and the intercept

(33)

3.1. Contextual Decomposition 25

FIGURE3.2: A graphical overview of (G)CD, based on the LSTM de-sign of Olah (2015). φ denotes the phrase in focus, and t∈ φimplies the action is only performed when step t is part of φ. denotes an individual interaction; green interactions are added the β part and red interactions to γ. V, W, and D represent the linear projections of the LSTM itself. The interaction set denoted here corresponds to the

partition of Eq.3.26:{β-β, β-b, β-γ∗, b-b∈φ}(REL).

(γ-b), and the intercept and itself (b-b). The relevant β split for the new cell state then amounts to the addition of the following 5 interactions. The other 10 interactions are all added to γc_t.7 βc_t =βc_t₋₁Rβ(ft) β-β +βc_t₋₁R_b(ft) β-b +Rβ(it) Rβ(˜ct) β-β +Rβ(it) Rb(˜ct) β-b +Rβ(˜ct) Rb(it) β-b (3.24)

The hidden state is created based on the factorised output gate (Eq.3.17) and the new βc_t+γc_t split (Eq.3.18):

βh_t = Rβ(tanh(ct)) Rβ(ot) β-β

+Rβ(tanh(ct)) Rb(ot) β-b (3.25)

We denote the partition presented here as an interaction set that contains the relevant interactions:{β-β, β-b}. A schematic overview of this partition is provided

in Figure3.1.

7_{Singh et al. (2019) assign the b-b interaction to β when x}

t ∈φ, and to γ otherwise. We denote this

(34)

Recap This constitutes the complete procedure of Contextual Decomposition. We briefly recapitulate the overall process.

CD receives as input the phrase of interest φ, for which the contributions will be computed. At initialisation we set γc₀ and γh₀ to the initial states c0 and h0, and the β states to 0-valued vectors.8 _{The first token is processed, and added to β if it}

is part of φ and to γ otherwise (Eq. 3.14). We then create the Shapley contributions of β and γ to the activation functions. This yields a sum of contributions, allowing us to investigate the interactions between individual flows of information. The mul-tiplicative gates can now be reformulated as a large cross-term of interactions. For each possible interaction we define whether it is deemed relevant or irrelevant. All relevant interactions are added to the β partition, and all irrelevant interactions to the γ partition (Eq. 3.24). This is done for the three gates of the LSTM, after which the procedure is repeated for the next token. In a multi-layer LSTM, β and γ parts are not only propagated forward, but also upward, where they are added to their respective parts in the layer above them. Note that the whole procedure needs to be run separately for each set of features for which we want to compute a contribution. This can be considered one of the major downsides of CD.9We provide a graphical overview of CD in Figure3.2.

3.2 Relevant Interactions

What makes an interaction relevant? This question has been addressed in the previous section, but only one specific interaction set has been considered:{β-β, β-b}. To

ob-tain a better insight into how different interactions contribute to the final prediction, we experiment with various ways of defining the set of relevant interactions. This gives rise to a more flexible method that we call Generalised Contextual

Decompo-sition(GCD).

β-γ A particular case concerns the interactions between β and γ. It wouldn’t be

correct to completely attribute the information flowing from these interactions to the phrase in focus. However, disallowing any information stemming from interac-tions of a phrase with subsequent tokens will result in loss of relevant information. Consider, for instance, the prediction of a verb in a number agreement task. While the correct verb form depends only on the subject, the right time for this informa-tion to surface depends on the material in between, which in the setup described in Figure3.1would be discarded by assigning all β-γ interactions to γ.

Distributing the contributions to a multiplicative interaction has been addressed by Arras et al. (2017), in the context of LRP. LRP is similar to CD, but creates rel-evance contributions using a modified backward pass through the network, instead of a modified forward pass. In LSTMs multiplicative gates are of the form zj =

σ(zg) ·zs. Due to the backward nature of LRP, relevance is propagated back from the

8_{In case the contribution of the initial state is computed this is done vice versa.}

9_{In our implementation of CD, we provide a slight speed-up by using the original states of the}

model up till the first step T that is part of φ. This way we don’t need to run the CD up to step T, as the β partition does not contain any information yet: h<T =γ<T.

Interpreting Language Models

MS

A

I

M

T

Interpreting Language Models

J

J

February 14, 2020

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Interpreting Language Models

1.2

Explainable AI

1.3

Contributions

1.4

Outline

Chapter 2

Background

2.1

Explainable AI

2.2

Attribution Methods

∑

∑

2.3

Interpreting Language Models

Chapter 3

Generalised Contextual

Decomposition

3.1

Contextual Decomposition

∑

3.2

Relevant Interactions