A Conceptual Framework for Interpretability Methods for Artificial Neural Networks

(1)

A conceptual framework for

interpretability methods for

artificial neural networks

Abstract:

In recent years, surging interest in explainable artificial intelligence (XAI) has lead to a significant proliferation in the number and variety of interpretability methods for artificial neural networks (IMANN). However, rather than being derived as general tools, most of these methods have been developed in an isolated, domain-specific manner. As such, it is often unclear how these IMANN are related or whether their innovations may have applications outside of their original domain. This lack of cohesion limits IMANN’s usefulness and may hinder the field’s progress. To overcome these issues, this paper asks “how can we understand the shared conceptual space of IMANN?”. We answer this question by drawing on theoretical XAI literature and methodological IMANN literature to develop a conceptual framework within which diverse IMANN may be situated,

compared, contrasted and combined. The framework is a three dimensional space, defined by the following axes; 1) the hypothesis-driven vs. data-driven axis, 2) the dataset-centric vs. network-centric axis, and 3) the local vs. global axis. To illustrate this conceptual space, we describe a number of “exemplar” IMANN (along with associated applications and findings) that characterise the two poles of each axis. We also discuss the validity and usefulness of each axis, alongside some methodological issues that the axes highlight. Finally, we illustrate how the axes may be combined to classify specific IMANN and to help uses navigate their shared conceptual space when putting IMANN to use.

(2)

Abstract:

1 1. Introduction

3 2. Prior work

4

2.1 Types of system 4 2.2 Types of explanation 5 2.3 Types of stakeholder 6 2.4 Levels of analysis 7 2.5 Targets of explanation 8 2.6 Discussion 9

3. Aims and approach

10 4. Why are ANNs opaque systems?

10 5. Defining the axes

12 6. The hypothesis-driven/data-driven axis

13

6.1 Diagnostic classifiers, as a hypothesis-driven exemplar 14

6.2 Contextual Decomposition, as a data-driven exemplar 17

6.3 Discussion 22

7. The dataset-centric/network-centric axis

24

7.1 Saliency-based class model visualisation, as a network-centric exemplar 26

7.2 Class saliency mapping, as a dataset-centric exemplar 27

7.3 Discussion 28

8. The local/global axis

30

8.1 Layer-wise Relevance Propagation, as a local dataset-centric exemplar 32 8.2 Spectral Relevance Analysis, as a global dataset-centric exemplar 33 8.3 Diagnostic classifiers, as a local network-centric exemplar 34 8.4 Representational similarity analysis, as a global network-centric exemplar 35

8.5 Discussion 38

9. Combining the axes

39 10. Conclusion

41 References

41 Appendix

44

Convolutional Neural Networks 44

Recurrent Neural Networks 46

Long Short-Term Memory networks 46

(3)

1. Introduction

Artificial neural networks (ANNs) have proven to be one of the most successful paradigms in machine learning (ML), setting new state-of-the-art performance standards on a variety of problems, in domains as structured and symbolic as natural language processing (NLP) and as open-ended as computer vision or control. Nonetheless, this paradigm has at least one major shortcoming - ANNs are typically characterised by a lack of transparency. It is often extremely diﬃcult to accurately infer how ANNs solve the problems to which they are applied, what information they use to do so, how they process that information, or why they make certain decisions. ANNs’ lack of transparency calls their reliability, trustworthiness and accountability into question, limiting their usefulness to a number of ML stakeholders. Furthermore, this is a

frustrating hinderance to those who are interested in uncovering the basis of learning and intelligence, whether artificially instantiated or otherwise.

Perhaps in response to the proliferation and success of opaque ML systems, the last decade has witnessed a considerable growth of interest in the field of explainable artificial intelligence (XAI). This growth has been accompanied by the development of a suite of new interpretability methods for ANNs (IMANN). Many of these methods have been designed to answer specific questions or to address particular shortcomings of ANNs and consequentially, rather than being derived as general methods, most have been developed within and are tailored towards one of the many diverse domains where ANNs are used. As such, the field lacks cohesion. Whilst much progress has been made, this represents a missed opportunity. There may be many occasions where researchers using an IMANN in one domain could benefit from the valuable insights of another, or where more powerful methods could be developed through the combination of existing tools. Such opportunities are not currently being seized.

Given the above, we believe that the field of IMANN has reached a level of maturity where it stands to benefit from the “mapping-out” of a holistic, integrative conceptual framework. This is the primary goal of this paper. Specifically, we address the question “how can we understand the space of interpretability methods for artificial neural networks?”. We aim to answer this question by drawing on insights from both existing theoretical work on XAI and methodological work on IMANN developed in a variety of domains. We combine these insights to define a

three-dimensional conceptual space within which any IMANN can be situated. The space is

characterised by three axes, which may be used to describe IMANN. These are; the hypothesis-driven/data-driven axis, the dataset-centric/network-centric axis and the local/global axis. To support the reader’s comprehension of the axes, we describe a number of IMANNs which may be understood as exemplars of the poles of each axis. We also discuss the validity of the axes, alongside the strengths and weaknesses of referring to each axis to understand IMANN. Finally, to provide an answer to our research question, we combine the three axes to show how they may be used in tandem. We hope that this framework makes the three following contributions; 1) helping users of IMANN to understand and navigate the space of available methods, 2) providing theorists and researchers with a schema through which IMANN may be easily compared, contrasted and perhaps combined, and 3) identifying some of the shortcomings of the field as it stands.

The paper is structured as follows; Section 2) consists of a review of existing theoretical XAI work and draws out key themes, insights and practical limitations of that body of work. Section 3) describes the aims and approach of the current paper in depth. Section 4) illustrates why ANNs

(4)

are opaque systems. Section 5) defines the axes which characterise our conceptual framework. Sections 6), 7) and 8) describe the three axes and their corresponding exemplar methods in depth. Section 9) discusses how they may be combined. Section 10) consists of concluding remarks.

2. Prior work

Several prior publications have sought to outline the important themes of, and codify the terminology around, XAI. By conducting a brief review of this literature, we have identified five overlapping core themes, despite a general inconsistency in terminology. These themes are system types, explanation types, stakeholder types (or explanation audiences), levels of analysis and targets of explanation. This section reviews these themes in reference to a selection of recent publications which best characterise common perspectives in the field. It concludes by identifying some limitations of this literature.

2.1 Types of system

A common theme in the literature is to attempt to define or classify systems we wish to interpret according to an observer’s ability to understand them. Commonly used terms related to this theme include; opaque/transparent, black-box/glass-box, comprehensible, interpretable, understandability and intelligibility (amongst others). There is significant overlap between these terms and their meaning is not widely agreed upon. Given this, Doran et al. (2017) conducted a corpus based analysis of interpretability terms in four research communities that rely on ML methods .Using a shallow linguistic search for explanation terms, they identified related 1

explanation terms which were important in each community. These terms guided the formulation of three summarising concepts, which delineate diﬀerent types of system understandability: • Opaque systems - refers to systems whose inputs and outputs can be observed, but whose

mechanisms mapping inputs to outputs are unknown or unobservable. Furthermore, inspecting the implementation of the system does not reveal the working of these mechanisms (e.g. the system may implement some symbolic system, which cannot be easily deciphered from its implementation). Opaque systems are often also referred to to as ‘black box’ systems. Both human and ML cognitive systems exhibit some degree of opacity.

• Interpretable systems - refers to systems whose inputs and outputs can be observed, where it is also possible to observe and understand the mechanisms mapping inputs to outputs.

Interpretable systems may also be called ‘glass box’ systems and are associated with system ‘transparency’.

• Comprehensible systems - refers to systems that produce external symbols (e.g. text or images) which a user may interpret to explain how inputs are mapped to outputs. Importantly, such symbols may not necessarily accurately describe the system’s operation. For instance, when a human describes their internal thought processes, they are producing symbols which make their opaque cognitive system comprehensible, but not transparent. Humans’ subjective introspective statements may fail to accurately describe the mechanisms of the cognitive

Specifically, the corpus consists of the proceedings of the Annual Meeting of the Association for

1

Computational Linguistics, the Annual Conference on Neural Information Processing Systems, the

International Conference on Computer Vision and the Annual Conference of the Cognitive Science Society, between 2006-2016.

(5)

systems that produce them. Neural image captioning (e.g. Vinyals et al. 2014) could be seen as a comparable example from ML.

Lipton (2016) suggests that system transparency can be understood in multiple ways. They introduced two terms illustrating this point:

• Simulatibility - refers to the degree to which an observer can contemplate (or mentally simulate) the entire system at once. This indicates that being able to observe a system’s mechanisms may not be suﬃcient for understanding that system, as the scale or complexity of the

interaction of those mechanisms may be too large to comprehend. Many systems that would be deemed interpretable under Doran et al. (2017)’s framework (e.g. suﬃciently large or complex rule-based systems or decision trees) may lack simulatibility and as such possess a degree of opacity.

• Decomposability - refers to the degree to which an observer can identify and explain the components of the system and how they interact. For instance, certain complex systems may be apparently transparent under Doran et al. (2017)’s definition, but may lack decomposability due to their tendency to produce emergent phenomena which cannot be understood as properties of decomposed parts.

Lipton’s (2016) notions of simulatibility and decomposability illustrate the essential fuzziness of system transparency. Given this, whilst Doran et al. (2017)’s categories are extremely useful, they should not be understood as absolute, or mutually exclusive. Both opaque and interpretable systems may also be comprehensible and many systems may not be neatly described as either opaque or interpretable. Opacity and interpretability are best understood as opposite ends of a ‘transparency spectrum’, along which particular systems with diﬀerent degrees of simulatibility and decomposability may lie. Furthermore, the degree to which these terms apply to a system may be subjective and user-dependent. For instance, some users may be more willing or able to interpret the symbols output by a comprehensible system, or to accept the validity of those interpretations. Systems which appear opaque might be rendered interpretable to users who have the correct tools or knowledge.

2.2 Types of explanation

Another core theme is types of explanations available for understanding and interpreting AI systems. Both Lipton (2016) and Preece et al. (2018) delineate between transparency-based explanations and post-hoc explanations. As the name indicates, transparency-based explanations involve observing and elucidating the mechanisms of a transparent or interpretable system. In contrast, post-hoc explanations involve explaining what a system does or why it behaves a particular way under certain conditions, but which do little to elucidate the mechanisms

underlying those processes. Although post-hoc explanations are less informative, truly opaque systems (if such systems exist) may only be understandable on a hoc basis. As such, post-hoc explantations may be extremely valuable. Post-post-hoc methods are often associated with comprehensibility (e.g. Lipton (2016) cites methods that analyse a system to generate

visualisations or text outputs and explanation by analogy as examples of post-hoc explanation). A significant disadvantage of this type of explanation this that it is often heavily reliant on subjective interpretation and thus is open to bias and misinterpretation.

Hoﬀman et al. (2018) addressed this theme via an interdisciplinary literature review of 700+

studies (from fields including psychology, philosophy and AI) which aimed to identify and integrate perspectives on explanation. They distinguish between various forms of explanation including;

(6)

contrastive explanations, which address why a system did X instead of Y, counter-factual reasoning, which addresses under what conditions the system would do Z, and mechanistic explanation, which describes the causal chains that underly how the system did X, Y or Z. The authors also discuss the importance of understanding how types of explanations might be used to produce understanding and how this relates to their quality. For instance, they suggest that instances of explanation (e.g. a visualisation) should not be treated as stand-alone explanation objects, but instead as part of an ongoing process of explanation. As such, some explanations may act as heuristics to guide further exploration. For example, some systems may appear

opaque until post-hoc explanations provide a foundation on which to build a testable theory of the system’s mechanisms, potentially rendering it interpretable.

The language used by Hoﬀman et al. (2018) and Lipton (2016) indicates that the types of

explanations available are closely related to the type of system being explained. Lipton’s notion of post-hoc explanation clearly shares a great deal of overlap with Doran et al. (2017)’s notion of comprehensible systems. Hoﬀman et al. (2018)’s explanation types refer to system features which may or may not be observable (i.e. the mechanisms of interpretable and opaque systems). The types of explanation identified in these papers can also be understood as being situated across diﬀerent levels of analysis (which will be addressed shortly). Furthermore, these studies indicate that we can not only understand types of explanation according to the process by which they explain the system, but also by their quality and appropriateness to the context in which they are to be deployed (e.g. as part of a particular investigation or for a particular audience).

2.3 Types of stakeholder

The third theme, types of stakeholder, illustrates that the nature of a system or explanation is partially observer relative. For instance, as previously mentioned, the comprehensibility of a system is dependent upon an observer’s willingness and ability to interpret it’s outputted symbols, and the quality of those symbols is in part dependent on the observer’s ability to draw valid

conclusions from them.

Preece et al. (2018) set out to describe the relevant stakeholder groups in explainable AI. They identified four stakeholder types, each with particular interests, motivations and concerns regarding XAI. These are:

• Developers - people who build AI applications and so are concerned with using XAI to aid system development and perform quality assurance.

• Theorists - people interested in using XAI to advance AI as a field and to deepen our understanding of existing AI systems.

• Ethicists - people interested in using XAI as a tool for ensuring the accountability and auditability of AI systems.

• Users - people who engage with AI systems and who may require explanations of those systems (possibly provided by another stakeholder group). XAI may help users understand, justify and react to the behaviour of AI systems.

Whilst each group is distinct, an individual may belong to more than one. However, each of these groups requires different types of explanations. Post-hoc or comprehensibility based explanations may be appropriate for users, but lack sufficient reliability for ethicists, sufficient detail for

developers or suﬃcient objectivity for theorists. Similarly, mechanistic explanations may be valued by developers and theorists, but be incomprehensible or irrelevant for ethicists and users. As

(7)

such, determining an explanation’s audience(s) and tailoring it to their needs is essential to ensuring explanation quality.

2.4 Levels of analysis

As mentioned above, diﬀerent system elements and explanations may be situated at diﬀerent levels of analysis, which observers may use to understand a system. Zednick (2019) suggests that XAI can benefit from invoking a foundational framework from cognitive science, Marr’s tri-level hypothesis (Marr 1982). Marr suggested that information processing systems should be understood across three independent but interrelated levels:

• The computational level - addresses what the system does or why it does it. Because opaque systems can be understood in terms of their inputs and outputs, they can be described well at this level. Lipton’s (2016) examples of post-hoc analysis, Hoﬀman et al.’s (2018) contrastive explanations and counterfactual reasoning can all be situated here. Preece et al.’s (2018) ethicist and user groups would most likely be interested in this level of analysis.

• The algorithmic level - addresses how the system does what it does. This relates to the previous level in that it is concerned with describing the specific rules, processes or

mechanisms governing how the system’s computations are carried out. An opaque system’s opacity is specifically due to its resistance to description or understanding at this level. Hoﬀman et al.’s (2018) mechanistic explanation is situated here.

• The implementation level - addresses where the computations or algorithms are instantiated. In opaque systems, even when we have a complete description of the implementation level, inspecting this will not reveal how the algorithmic level functions.

Using Marr’s levels, we can be more explicit about the types of questions we wish to ask about an AI system and the types of methods that might be appropriate to answer those questions. This framework also indicates that whilst a system may not meet Doran et al. (2017)’s definition for interpretability, it can be said to be understood in some sense if a useful explanation can be advanced at another level of analysis.

In addition to delineating diﬀerent stakeholder groups, Preece et al. (2018) show that these groups may bring distinct perspectives which provide alternative levels of analysis. They suggest

integrating perspectives from developers, specifically system engineering, with theorists and ethicists, specifically epistemology. From system engineering, they adopt the verification vs validation framework:

• Validation - is concerned with building the right system, i.e. ensuring that the system is designed to meet its user’s specified needs. In XAI, a validation based approach is one that aims to design systems which are inherently transparent or interpretable (or at least

comprehensible).

• Verification - is concerned with building the system right, i.e. ensuring that the system actually does what it is designed to do. In XAI, verification is associated with post-hoc methods which can explain what the system does and why.

These perspectives share some overlap with Marr’s levels (validation is mainly concerned with the algorithmic and implementational levels, whereas verification addresses issues at the

computational level), although they are more oriented to system design than system analysis. From epistemology, they adopt the knowns and unknowns framework to identify system elements we may wish to explain:

(8)

• Known knows - include things that are understood by stakeholders and in the scope of the system, such as the system’s training, test and validation sets. They suggest the goal of transparency based methods is to “trace” the processing of known knowns from the system’s input to output.

• Known unknowns - refers to the wider space in which the system is designed to operate (e.g. inputs from “the real world” following deployment).

• Unknown knowns - refers to things outside the scope of the system, but potentially known by stakeholders (e.g. training set bias). Unknown knows are often of particular interest to ethicists. • Unknown unknowns - includes general uncertainties in the world beyond the scope of the

model. Since these undermine the robustness and transferability of AI systems, how systems handle unknown unknowns is of great interest to theorists.

This framework oﬀers a distinct, albeit complementary, perspective to Marr’s levels. Whereas Marr’s levels aid the analysis of the system itself, Preece et al. (2018)’s interpretation of the known and unknowns framework illustrates that we may also direct our analysis to elements that are extrinsic to the system, but nonetheless important for understanding it’s functioning (i.e. it’s inputs and it’s environment or context at large).

Preece et al. (2018) suggest combing these perspectives from system engineering and epistemology to generate “comprehensive explanation objects” spanning multiple levels of analysis. Their suggested object is a three-tiered set of explanations which can satisfy multiple stakeholder groups. The first layer is concerned with “traceability” and aims to determine if the system did the right thing. This level contains focused, technical transparency-based explanations which refer to the internal states of the system. The second is concerned with “justification” and aims to determine whether the system did the thing right. This layer consists of post-hoc

explantations which can be linked back to those at the traceability layer. The final layer is

concerned with “assurance” and is the most general and least technical layer. It aims to address whether the system generally does the right thing (i.e. exhibits desirable behaviour across a range of circumstances, including in the face of unknown unknowns). This layer consists of post-hoc explanations which can be linked to and summarise those of the justification level.

These three perspectives show that diﬀerent types of multi-layered analyses can be useful in XAI. Specifically, these can be applied to refine our approach to developing an AI system, describing and analysing an AI system, and describing or analysing the information that the system is designed to process. By combining diﬀerent multi-layered perspectives, we may develop frameworks that integrate various types of explanation and package them for a variety of audiences.

2.5 Targets of explanation

The final theme from the literature is targets of explanation. Whilst the general goal of XAI explanations is to aid an observer’s understanding of the entire system, any explanation must make reference to some aspect of the system, its behaviour or context. Hoﬀman et al. (2018) implicitly acknowledge this in their review, as a suggested take-away from the AI literature is the importance of distinguishing between local explanations, which address the system’s behaviour in specific circumstances, and global explanations, which address the system’s behaviour in general (note the overlap with Preece et al. (2018)’s “justification” and “assurance” layers). Preece et al. (2018) also acknowledge that the focus of analysis may be shifted from the system (i.e. validation vs. verification) to the system’s context (i.e. knowns and unknowns) and that these approaches

(9)

can be combined for more comprehensive explanations. Whilst these perspectives illustrate some engagement with this theme, they are not especially developed analyses.

The most explicit and in depth discussion of this theme comes from Zednick (2019), who aims to describe a recipe for overcoming the black box problem. They argue that system opacity is observer relative. Specifically, it is dependent upon the observer’s knowledge (or lack thereof) of “epistemically relevant elements” (EREs). EREs are aspects of the system which an observer may reference to explain or understand the system. Thus, to generate an explanation which can make an opaque system understandable, we must first determine which observers require an

explanation and which EREs are most appropriate for them. Whilst this has some similarity to Preece et al. (2018)’s explanation object (which can be unpacked according to the needs of specific stakeholders), it has the distinct advantage of acknowledging that different audiences not only require different types of explanations, but explanations that refer to different system

elements.

This raises the question, what features might be considered EREs? Unfortunately, Zednick (2019) does not delve too deeply into specific examples of EREs from existing AI systems. However, they do invoke Marr’s levels to illustrate how an observer’s questions about a system can guide our attention towards the EREs. For instance, algorithmic level “how…” questions call for EREs that refer to internal states and properties of the system. In contrast, computational level “why…” questions refer to representational content, and as such the EREs must include that which is represented - i.e. features in the system’s environment. Zednick (2019) also argues that methods that capture a system’s abstract mathematical properties and recast them as semantic

interpretations of the system are necessary, because raw parameter values will rarely be suﬃcient EREs. This suggests that EREs will often be derived from comprehensibility based methods. A more detailed analysis is needed to outline a method for determining which elements from these domains are the most epistemically relevant for particular observers, or how parameters ought to be selected for processing to generate EREs. However, by describing an approach to

conceptualise and to select particular targets of explanation, Zednick (2019) makes an extremely valuable contribution to the XAI literature.

2.6 Discussion

This review demonstrates the highly interdisciplinary nature of XAI, showing that theorists are drawing on a diverse range of fields (including linguistics, psychology, philosophy, cognitive science, computer science and systems engineering, amongst others) to develop frameworks to tackle this complex subject. Despite the diversity of the perspectives drawn on in the literature, our review shows a significant degree of overlap and conceptual harmony between the

frameworks being advanced. This is promising, as it suggests that the field is coalescing around commonly accepted principles. Nonetheless, the range of approaches and themes tackled

indicates that these principles may be flexibly employed across the broad range of enquires in the space of XAI.

Whilst this literature lays out a valuable set of theoretical foundations, when considered from the perspective of those who wish to employ its insights in practice, it has two main weaknesses. Firstly, throughout these papers, there is relatively minimal engagement with the actual tools available for XAI. This is understandable, as there is an enormous range of approaches to AI itself, never mind XAI tools, and elaborating on these is not within the scope of a single theoretical paper. However, an unfortunate side eﬀect of this abstracted approach is that it tends to present explanation in a somewhat idealised light. For instance, Doran et al. (2016) are correct to note that

(10)

comprehensibility based explanations have the potential to mislead, and both Preece et al. (2018) and Hoﬀman et al. (2018) are right to emphasise the importance of developing comprehensive explanations which hold across many circumstances. However, for many problems (such as explaining how a computer vision AI solves its task) methods which meet these concerns may not yet exist. In these instances, it is unclear how to use their frameworks to compare, contrast and deliberate between available methods to determine their appropriateness, or how to account for the limitations of those methods. The second weakness of this body of work is that, whilst it raises many overlapping concerns that practitioners of XAI may wish to consider, it is not clear which of these concerns to prioritise or how they might be balanced in the selection of particular XAI tools.

3. Aims and approach

Given the limitations noted above, this paper aims to present a conceptual model which reflects the insights of theoretical XAI work whilst simultaneously bridging the gap to the current tools and methods available for XAI. Whilst it addresses similar themes to those mentioned above, it aims to overcome the limited practical use of that work by specifically focusing on describing and

classifying currently available tools, not just the types of systems they analyse or the types of explanations they provide. To concentrate the scope of this paper, we exclusively consider ANNs in a supervised learning paradigm (i.e. opaque, but possibly comprehensible systems) and take our primary audience to be AI theorists. As such, we do not address the tailoring of explanations to specific types of system or types of stakeholder.

This paper’s core approach is to define a conceptual space within which existing interpretability methods for artificial neural networks (IMANN) can be situated. This aims to serve three practical purposes, 1) describing the key axes of the space of IMANN, thus placing individual methods in a wider context and helping users of IMANN to understand and navigate that space, 2) providing criteria through which IMANN can be easily and directly compared, contrasted and perhaps combined, and 3) identifying some of the shortcomings of the field as it stands.

To illustrate the scope of these axes, we describe and evaluate several existing IMANN which serve as exemplars of the axes’ poles. These exemplars were chosen to exhibit the broad range of methods and problem spaces that this framework may account for. However, this paper does not aim to provide a complete description of the contents of the space of IMANN and as such does not give a comprehensive overview of all currently available methods. Furthermore, this paper does not aim to produce a definitive theory for understanding XAI (e.g. as Zednick (2019) does), but instead a practical “conceptual map” for understanding IMANN. As such this

framework may need to be reviewed and refined to reflect future theoretical and methodological advances that shift the terrain of XAI. Finally, as with any map-making, this process will

necessarily entail some simplification of the subject’s complexity. Given this, the limitations and validity of the proposed axes will be evaluated, relative to their usefulness.

Before describing the axes in depth, we will review how ANNs learn and why this makes them opaque models. The functioning of some specialist ANN designs referenced in later sections is reviewed in the appendix.

(11)

ANNs are commonly understood to be opaque systems. To understand why, it is instructive to consider their design and how they learn to carry out their tasks. An ANN is a graph of artificial

neurons (aka ‘nodes’, or ‘units’) organised into a series of layers and connected via edges, along

which signals can be propagated. The first layer is the ‘input layer’, the last is the ‘output layer’ and all intermediary layers are referred to as ‘hidden layers’ (see Fig. 1).

Each neuron holds a real number describing the extent of that neuron’s activation. In a standard feed forward neural network (FFNN), the neurons in a particular layer propagate their activation signal forwards to the neurons to which they are connected in the following layer. However, this signal is not propagated unchanged. The signal passing along any particular edge is modified by that edge’s weight, which may be understood as the ‘strength' of the connection between the two neurons. Each neuron also has a bias, which can increase or decrease its overall activation. Finally, each neuron’s activation may be scaled by a non-linear activation function (e.g. the tanh function) before being output to the next layer. As such, the activation of any neuron (except those in the input layer) is determined by the weighted sum of the output of the neurons to which it is connected in the previous layer. Formally:

where a is the activation of a neuron, f is the activation function, w is the weights and b is the bias. The only exception to this is the input layer, where the neurons’ activations are determined by the input. As such, the activation signal is first determined by the input and then propagated to and modified by each layer according to the above equation, until the output layer has been reached. In a supervised learning paradigm, the network is trained on a set of labelled data. The number of neurons in the output layer is typically equal to the number of labels in the dataset, allowing for each output neuron to correspond to a label. As such, once the network’s signal has propagated from the input layer to the output layer, the outputted values can be compared to the input’s label. The diﬀerence between the output and the input label is then summarised by an error function. When the output is similar to the label, the value of the error function will be small, whereas when the output is dissimilar to the label the value of the error function will be large. Thus, for the network to perform its task well, it must learn to implement a function that reliably transforms the

a

_j

= f (

_∑

n

i=1

a

_i

w

_ij

+ b

_j

)

Fig. 1. This diagram illustrates a simplified wiring diagram for a basic feedforward neural network. The circles correspond to neurons.

The arrows correspond to edges, along which activation signals are propagated. From: https://en.wikipedia.org/wiki/

(12)

activation signals produced by its inputs into outputs that minimise the network’s total error for all inputs.

The network may be able to implement this function if its adjustable parameters (i.e. its weights and biases) are set correctly. However, because neural networks have such a large number of adjustable parameters, analytically determining which combination of settings would optimally reduce the network’s error over its entire training set is extremely diﬃcult. Instead, the parameter values are adjusted automatically using an optimisation algorithm (e.g. stochastic gradient descent, momentum, ADAM (Kingma & Ba 2014) etc.) and the back-propagation of error

(Rumelhart et al. 1986). These methods are used to calculate changes to each parameter’s value that will “step down” that parameter’s local error gradient. By repeatedly taking small steps down the error gradient of each parameter, the network’s cost function is gradually minimised until a set of parameter values which implement a function mapping inputs to outputs within some

acceptable margin of error are found. Crucially, this optimisation is achieved without the direct

intervention or oversight a programmer.

As described above, we can observe and quantify the system’s inputs and outputs in their entirety. We also have a complete description of the structure of the network and the rules that govern the flow of signals through that structure. Furthermore, we can observe any of the network’s learnt parameters and we know the rules that govern how the values of those parameters were set. In fact, we have a complete description of the system’s implementation. Given this, why is this system opaque? Despite the wealth of implementational details available, there is currently no know way to clearly translate these into an algorithmic-level description of how these settings correctly map inputs to outputs. In some cases, we may be able to assign mechanistic functions to components of a network (such as the input and output layers, or specialist features such as CNN’s pooling layers or LSTM’s forget gates), but in these instances we are often describing generic mechanisms without elucidating their specific roles in relation to semantic features of the input. To borrow Zednick (2019)’s terminology, implementational

information about an ANN does not provide us with the EREs to understand how it completes its task. However, even if we did have a means to bridge these two levels of analysis, there are further complications. The extremely large number of parameters the ANN uses to implement its function, and the negligible influence of the system’s users in setting those parameters, means that the system has poor simulatabilty. The complexity, non-linearity and high-dimensionality of the interaction of the network’s parameters also makes it extraordinarily diﬃcult to meaningfully analyse them in isolation. Furthermore, these parameters are not optimised with regards to a single input, but to an entire training set and typically even single iterations of optimisation are done in response to batches of inputs, rather than single instances. As such, the system also has poor decomposability. Whilst these properties underly the success of ANNs, allowing them to find surprising solutions to complex problems, they are also the reason why they belong to the

category of opaque systems.

5. Defining the axes

As mentioned previously, the goal of this paper is not to advance a new theory of XAI, but instead to connect existing ideas from theoretical and methodological literatures to produce a new

conceptual framework to guide the practical application of IMANN. As such, the axes to be defined draw from and overlap with both of these literatures. However, instead of oﬀering these axes as standalone frameworks (as their precursors have appeared in these literatures), the aim is to show that they are conceptually complementary and ought to be used in conjunction with each other.

(13)

The first axis addresses the style of analysis that a particular IMANN allows for, whereas the second and third address where methods place the locus of their analysis. Specifically, the axes are:

• Hypothesis-driven methods/data-driven methods. • Dataset-centric methods/network-centric methods. • Local methods/global methods.

In the following three sections each axis is described in depth and it's influences in the prior literature are noted. For each axis, we describe two IMANN which can be considered exemplars of the axis’s two poles, along with some findings illustrating the use of each exemplar IMANN (Table 1. shows various IMANN, including the exemplars, classified according our framework). We also consider the validity and usefulness of each axis. In the example findings provided, some of the exemplars have been applied to specialist ANN architectures (specifically; convolutional neural networks, long short-term memory networks, and gated recurrent units), descriptions of these architectures have been provided in the appendix for reference.

6. The hypothesis-driven/data-driven axis

The first axis to outline is hypothesis-driven/data-driven methods. This axis originates outside of the XAI literature, first appearing in Kreigeskorte et al. (2008), a systems neuroscience paper introducing Representational Similarity Analysis (a technique adopted in XAI, as is discussed later). This axis has also been referenced in recent XAI research papers (e.g. Jumelet et al. 2019). However, the general concept predates both systems neuroscience and machine learning, as it is essentially informed by the scientific method. Hypothesis driven methods are designed to test theoretically informed, pre-determined hypotheses about how a network or some component of

Hypothesis driven

Data driven Dataset

centric Network centric Local Global Diagnostic classifiers (Hupkes et al. 2018) x x x Contextual Decomposition (Murdoch et al. 2018) x x x

Class Model Visualisation

(Simonyan et al. 2014) x x x

Class Saliency Mapping

(Simonyan et al. 2014) x x x

Layer-wise Relevance Propagation

(Bach et al. 2015) x x x

Spectral Relevance Analysis

(Lapuschkin et al. 2019) x x x

Representation Similarity Analysis

(Kriegeskorte et al. 2008) x x x

Deconvolution

(Zelier & Fergus 2013) x x x

Table 1. This table illustrates how the axes outlined in this paper may be easily combined to describe any of the methods discussed throughout.

(14)

the network completes its task. As such, they help to explain neural networks’ functioning by eliminating some possible explanations whilst finding evidence consistent with others.

Conversely, data-driven methods are more exploratory in nature, and often involve processing large amounts of data about a network or its inputs (or both) without theoretical assumptions. These typically produce rich representations (e.g. visualisations) which researchers may interpret to produce explanations, or develop hypotheses. Whilst these outputs are often comprehensible (e.g. heat-maps), they may loose their usefulness if their scale or complexity becomes so large that they are no longer simulatable (e.g. an entire dataset’s worth of heat maps). Furthermore, these outputs tend to be open to subjective interpretation and, as such, interpretative bias. Contrarily, hypothesis driven methods often constitute a form of complexity reduction (since they can reduce high-dimensional data into one or few figures, e.g. the output of a statistical test) and thus admit more objective interpretations. Whilst this may suggest that hypothesis-driven

methods should be favoured, this is not necessarily the case. The explanations produced via hypothesis-driven methods are only as good as the theories that inform them and, as is discussed at the end of this section, producing useful hypotheses about neural networks is not always easy to do. As such, data-driven methods should be preferred when it is diﬃcult to determine a relevant hypothesis and exploratory analysis is required.

From the prior literature, this axis draws on Lipton (2016)’s notions of transparency and comprehensibility, as well as Lipton (2016) and Preece et al. (2018)’s delineation between

transparency based methods and post-hoc explanations. Data driven methods typically generate comprehensible symbols to inform post-hoc explanations. Conversely, whilst hypothesis-driven methods may not always increase system transparency, they are necessary to do so (this is elaborated on in this axis’ discussion). Furthermore, hypothesis-driven methods may be seen as supporting a contrasting explanation style to post-hoc explanation, as they aim to test the quality of a-priori candidate explanations.

The next two sections introduce two methods as exemplars of hypothesis-driven and data-driven methods. These are diagnostic classifiers (DCs) and contextual decomposition (CD), respectively. For each method, we discuss two studies employing one of these in their analysis. This is

followed by a discussion of the axis.

6.1 Diagnostic classifiers, as a hypothesis-driven exemplar

DCs are an IMANN introduced in Hupkes et al. (2018) that allow researchers to analyse the internal dynamics of an ANN. They are specifically designed for use with RNN architectures, although it is possible to see how their general concept could be adapted for use with other designs. DCs’ core concept is founded on the assertion that if an ANN is representing and preforming computations upon a particular feature or variable of an input, then it should be possible to observe this by extracting the relevant information from the ANN’s internal state space. Given this, to determine whether an ANN is tracking certain information or processing an input variable in a particular manner, we must make predictions about what values that

information or variable should take during processing. This series of predictions essentially correspond to a hypothesis about how the ANN performs some aspect of its task. These predictions can be used alongside DCs to test the hypothesis. Each DC is a simple linear

classifier trained to predict one value from the sequence using the hidden representations of the ANN. By observing the accuracy of the DCs’ predictions, the validity of the hypothesis can be assessed. This framework provides a general principle for how to quantitatively test hypotheses about how ANNs solve their tasks which can be easily scaled up as the ANN’s dimensionality increases.

(15)

Any hypothesis which can be expressed as a set of target values for the DCs to predict can be tested in this way. Hypotheses may range from simple predictions (such as the existence of detectors for specific features of the input) to fully fledged symbolic strategies expressed at the algorithmic level. Hupkes et al. (2018) conducted an experiment illustrating the latter. They investigated how various neural models process hierarchical compositional semantic structures by using DCs to determine how neural models process phrases in an artificial arithmetical language. This language was selected as it features hierarchical compositionality, lexical items with clearly defined meanings and it avoids potentially confounding features of natural languages, such as semantic ambiguity. The language consists of words for the integers between -10 and 10, the plus and minus operators, and open and closed brackets. Grammatically correct phrases consist of bracketed arithmetic expressions whose meaning is the solution to the expression. For instance:

(one plus (two minus three))

Is a grammatically correct expression whose meaning is:

(zero)

The highly symbolic nature of this language allows for the formulation of specific strategies for determining the meaning of any input expression. These strategies may be treated as competing hypotheses explaining the network’s functioning at the algorithmic level. Importantly, they can also be formalised as sequences of values for diagnostic classifiers to predict.

Hupkes et al. (2018) define two possible strategies that their neural models may use. These are the “recursive strategy” and the “cumulative strategy”. The recursive strategy is performed by computing each subtree of the expression according to its order in the expression’s hierarchy until a value for the entire expression can be determined. The cumulative strategy is performed by continuously computing a prediction for the value of the entire expression by accumulating the value of the expression’s items as they are encountered (i.e. adding or subtracting from left to right). Since these two strategies involve diﬀerent series of computations, they also require the networks to store diﬀerent memory contents. These can be used as the basis for the predicted sequence.

Gated Recurrent Unit (GRU) models (described in the appendix) were trained on expressions of various set lengths and tested on expressions of lengths that had and had not been previously encountered (assessing the network’s ability to generalise to new structures). The GRUs completed this task with suﬃciently high accuracy to suggest they had found a solution that incorporates syntactic structure. DCs were then applied to determine whether this solution was most similar to the recursive or cumulative strategy. The DCs trained to predict the intermediary values required by the cumulative strategy attainted the highest accuracy (although their

performance was not entirely consistent with with strategy, as the DC’s accuracy for

right-branching sentences was poor). The nature of this task allows a prediction for every time step for each hypothesis to be made. As such, a highly detailed, temporally sensitive analysis of the ANN’s internal dynamics can be made to discriminate between hypotheses (see fig 2.).

Not only can this level of detail be used to eliminate hypotheses that produce poorly fitting predictions, but it can be used to indicate what must be accounted for when refining better fitting hypotheses (e.g. accounting for right branching expressions, explaining why the internal state changes smoothly). As such, this example illustrates how hypothesis driven methods, such as

(16)

DCs, may help provide algorithmic level explanations of how an ANN solves its task. As the following example will illustrate, they may also help tie algorithmic level hypotheses to specific implementational features.

LSTMs are highly competent at tracking long-term dependencies in language (and elsewhere). However, it is not well understood how they do this. Giulianelli et al. (2018) investigated this ability, specifically using DCs to observe what number information LSTMs track when predicting subject-verb agreement in English . Initially, they trained LSTMs to complete a subject-subject-verb agreement 2

task (modified from Gulordava et al. (2018) and Linzen et al. (2016)). They used these trained networks to generate two sets of DC training data, corresponding to whether the LSTM correctly or incorrectly predicted the input sentence’s verb number. Each dataset consisted of a series of matrices of intermediary representations from the LSTMs. Specifically, the values of five LSTM components (the hidden activation, cell memory, input gate, forget gate and output gate) over the LSTM’s two hidden layers and n time steps (where n corresponds to the input’s length). These matrices were labelled with the number of the input sentence’s main verb. The DCs were trained to predict the LSTM’s classification from the intermediate LSTM representations. Ten DCs were trained in total, one per component per layer. The DC’s prediction accuracy varied between the layers and components; DCs trained on the deeper layer components were generally more accurate than their shallow layer counterparts and the most accurate overall were those trained on the hidden activation and memory cell.

This initial exploratory analysis allowed for two observations, which informed testable hypotheses. The first observation was that the accuracy of the DCs fell when predicting from representations corresponding to time steps where neither the subject or the verb were being presented to the LSTM. Despite this, the LSTMs could still make accurate predictions of the verb number,

indicating that they must still have been tracking the subject number in some form. This informed the hypothesis that number information is dynamically encoded, using a “shallow” representation

In English, a subject and its corresponding present tense verb must agree on number (i.e. singular/plural).

2

E.g. “The dogs howl” and “the dog howls” are grammatically correct, whereas “the dogs howls” and “the dog howl” are not. This rule is maintained regardless of how many words separate the subject and verb, even when some may be potential candidates for agreement.

Hupkes, Veldhoen, & Zuidema

left-branching sentences, but show low accuracy for right-branching sentences. This is inconsistent with the symbolic description of the cumulative strategy, where the stack is crucial for left-branching sentences, but not relevant at all for right-branching sentences. We revisit this issue in the discussion part of this section.

1 2 3 4 5 6 7 8 9 9L 9R 0 50 100 m ean sq u ar ed er ror _cumulative recursive

(a) prediction of result

1 2 3 4 5 6 7 8 9 9R 9L 0 0.5 1 ac cu rac y

(b) prediction of mode cumulative

1 2 3 4 5 6 7 8 9 9L 9R 0 0.5 1 p ear son r (c) trajectory correlations

Figure 14: Results of diagnostic models for a GRU model on di↵erent subsets of languages. Plotting trajectories We can not only use diagnostic classifiers to evaluate the overall match with a specific hypotheses, we can also track the fit of our predictions over time, by comparing the trajectories of predicted variables with the trajectories of observed variables while the networks process di↵erent sentences. In Figure 15, the predictions of the diagnostic classifiers on two randomly picked L9 sentences are depicted, along with their target trajectories as defined by the hypotheses. These trajectories confirm that the curve representing the cumulative strategy is much better predicted than the recursive one. A correlation test over 5000 L9 sentences shows the same trend: Pearson’s r = 0.52 and 0.95 for recursive and cumulative, respectively. Figure 14c shows the trajectory correlations for test sentences of di↵erent lengths.

( ( ( -9 + 0 ) + ( 6 + ( -8 - 7 ) ) ) + ( ( -3 + -1 ) - ( 6 - -2 ) ) ) 35 30 25 20 15 10 5 0 5 ( ( ( -9 + 0 ) + ( 6 + ( -8 - 7 ) ) ) + ( ( -3 + -1 ) - ( 6 - -2 ) ) ) 30 25 20 15 10 5 0 5 10 ( ( ( -6 +10 ) - ( ( ( 5 - 7 ) - 6 ) - -6 ) ) - ( 10 - ( 8 + 5 ) ) ) 10 5 0 5 10 15 ( ( ( -6 +10 ) - ( ( ( 5 - 7 ) - 6 ) - -6 ) ) - ( 10 - ( 8 + 5 ) ) ) 10 5 0 5 10 15

Figure 15: Trajectories of the cumulative (green, upper) and recursive (blue, lower) classifier, along with their targets trajectories for the result values. Dashed lines show target trajectories.

However, we also observe an important qualitative di↵erence between the diagnostic classifier trajectories and the target values: the diagnostic classifier trajectories are smooth, changing value at every point in time, whereas the target trajectories are jumpy and often stay on the same value for longer time spans. This once more indicates that a refinement of the symbolic cumulative hypothesis, in which information is integrated more gradually, would be more suitable for a network like this.

922

Fig. 2. Graphs from Hupkes et al. (2018) showing the prediction trajectories (solid line) and hypothesised targets

(dashed line) for the DCs trained to predict the cumulative (green, top) and recursive strategies (blue, bottom). As the cumulative classifier’s prediction trajectory corresponds more closely to its targets than the recursive, indicating the GRU’s strategy is most similar to the hypothesised cumulative strategy. As these illustrate, DCs allow for a very fine-grained analysis of the evidence for competing hypotheses.

(17)

when it is encountered (i.e. when the subject or verb are being presented), and a “deep”

representation to store it across other time steps. This hypothesis was tested by training DCs on internal representations from single time steps, but then testing their ability to predict the

classification using data from other time steps. This produces a “temporal generalisation matrix”, which can be used to assess the DCs’ ability to generalise to other time steps. If the LSTMs represent number information uniformly (i.e. they do not implement dynamic encoding), then the DCs should generalise well. If they are unable to generalise, then this would be consistent with the dynamic encoding hypothesis. The results showed that DCs trained on time steps where neither the subject or verb appeared (i.e. where the encoding is hypothesised to be “deep”) could generalise well to other time steps, whereas those trained on the subject or verb time steps generalised poorly. These findings support the dynamic encoding hypothesis. A further

exploratory analysis applied the above technique to LSTM components, rather than time steps, to produce a “spatial generalisation matrix”. This found that deep encoded number information was best represented at the final time step by the hidden activation and memory cell of the LSTMs’ deeper hidden layer. This finding ties the algorithmic-level dynamic encoding hypothesis to implementation level architectural features of the LSTM.

A second observation from the initial exploratory analysis was made. The DCs that were trained on the dataset formed from internal LSTM representations corresponding to incorrectly processed sentences were less accurate at predicting the LSTMs’ classifications from the time step they

were trained on than DCs trained on the dataset of the correctly processed sentences. This

indicates that the LSTMs failed to properly encode the number information at the point they encountered it (as opposed to it being corrupted or forgotten later). The hypothesis that

incorrectly encoded information contributes to prediction errors can be tested by intervening on and “correcting” an LSTMs’ intermediary representations using DCs and then comparing their accuracy to unchanged LSTMs. As Giulianelli et al. (2018) hypothesise that number information is “corrupted” at the point it is encountered, they allowed trained LSTMs to process sentences until they reached the subject. At this stage, they paused the processing, extracted the hidden

activation and memory cell values and trained DCs attempt to predict the sentence’s verb number. Next, they used the error of the DCs prediction to update the extracted values using the delta rule. The LSTMs then continued processing the sentences as usual. This intervention increased their LSTMs’ average prediction accuracy by ~5-7% (on two related datasets), thus supporting their failed encoding hypothesis.

These studies indicate that hypothesis-driven methods, such as DCs, can by used by researchers to determine both specific implementation level details, such as where and when particular

information is encoded, as well as more abstract algorithmic level details, such as encoding strategies or general task strategies. Furthermore, Guilianelli et al. (2018)’s move from simply observing evidence that supports a hypothesis to actively intervening to influence the processing of a blackbox ANN in accordance with a particular hypothesis, illustrates the considerable

explanatory power aﬀorded by this class of methods. Nonetheless, this approach faces one clear limitation; suitable, testable hypotheses may not always be available (especially when the network is processing less symbolic and formally structured inputs, such as images or video). This is where data-driven methods excel.

6.2 Contextual Decomposition, as a data-driven exemplar

Contextual Decomposition (CD) is an IMANN introduced by Murdoch et al. (2018) which was developed for use with LSTMs. For any input sequence an LSTM may receive, each token in that

(18)

sequence (e.g. the words in a sentence) can be treated as an ‘input variable’. The aim of CD is to extract information about which input variables contributed to the LSTM’s final prediction.

Specifically, it allows users to select a sequence of input variables of an arbitrary length and determine their contribution to the prediction. In doing so, it aims to show how the input variables were combined to produce the final expression.

Murdoch et al. (2018) introduced CD as a solution to the diﬃcult problem of explaining why LSTMs make particular decisions in sentiment analysis tasks. Note that, whilst still in the domain of linguistics, sentiment analysis is a considerably more ambiguous and less rule-based problem than the tasks tackled in the previous section. This makes it particularly appropriate problem to apply a data-driven method to. One of the challenges of explaining sentiment analysis decisions is that that sentiment is highly compositional. To give a simplistic example, take the phrase: “The student’s thesis was not good”

To correctly asses the sentiment of this phrase, a language model must avoid evaluating the positive word “good” positively. Instead, the model should identify it as part of the compound phrase “not good”, where “not” negates the sentiment of “good”. As phrases become more complex, this problem may become compounded:

“The student’s thesis was not good - it was great!”

Examples like these show it is insuﬃcient for a model to simply sum the sentiment of words in an input. Instead the model must be able to recognise and evaluate the sentiment of compositional expressions. LSTMs are known to overcome this challenge more eﬀectively than most alternative models. Given this, if we wish to explain an LSTM sentiment analysis prediction that was arrived at using compositionally, then to be valid, the explantation must respect and refer to that

compositionally. This is what CD aims to do.

CD is based on a simple observation. The final hidden state hT of an LSTM (which, after being

processed via a SoftMax function, constitutes the LSTM’s prediction) is determined by the total contribution of evidence made by the input variables it has received. This contribution may be positive, negative or neutral. The role that a particular sequence has played in determining the LSTM’s prediction may be explained by determining what evidence it has contributed to hT.This

requires us to decompose the total contribution of evidence into two terms; the contribution of evidence made by input variables within the sequence β (referred to as ‘in focus’) and the

contribution of evidence made by input variables outside the sequence γ. This observation can be formalised as:

βT provides a quantifiable measure of how much the phrase in focus contributed to the LSTM’s

final prediction. To determine the value of βT, the contribution must be partitioned from hT into β

and γ by working backwards through the operations that define an LSTM cell and disambiguating the interactions between the gates. This is achieved by linearising the activation functions , 3

partitioning any terms dependent upon previous states (i.e. ht-1) into β and γ and then sorting all

p = Sof tMa x(Wh

_T

) = Sof tMa x(Wβ

_T

+ Wγ

_T

)

Murdoch et al. (2018) define linearisations of the sigmoid and tanh functions, allowing the gates to be

3

expressed as linear sums of contributions from their diﬀerent factors. For concision, these are not be reproduced here.

(19)

terms according to whether they contribute to β or γ. For instance, assuming that xt is currently in

focus, the forget gate ft equation may be rewritten as follows:

First, the activation function is linearised:

Next, the terms determined by previous states are partitioned by focus:

Finally, the terms are sorted by focus:

Note, were xt not in focus, it would be included in the γ half of the equation.

By repeating these steps for the it, gt, and ot equations, they are able to decompose the two

products in the ct equation (i.e ft ⊙ ct-1 and it ⊙ gt) . These are decomposed independently of each 4

other, producing a set of β and γ contributions for each product. Those contributions may then be combined and summed by type to complete the decomposition of ct.

Having decomposed ct, it is comparatively simple to decompose ht. The tanh activation function is

linearised and ct is partitioned into the decomposed contributions determined above :5

Murdoch et al. (2018) empirically validated CD using a series of sentiment analysis tasks. The tasks were carried out on two datasets. First was the Stanford Sentiment Treebank (SST) from Socher et al. (2013). It consists of film reviews and has sentiment labels for both the reviews and the phrases of which they consist. Second was the Yelp review polarity dataset from Zhang et al. (2015). Similarly, this dataset consists of review data, however, it only has review-level sentiment labels. They trained LSTMs to predict the sentiment labels of the reviews and analysed how the input variables contributed to the LSTMs’ predictions using CD and four other interpretability methods; cell decomposition, integrated gradients, leave one out and gradient times input. First, they tested the proficiency of the interpretability methods at determining the importance of single words to the LSTMs’ predictions. This was done by correlating the importance scores with

f

_t

= σ(W

_f

x

_t

+ V

_f

h

_t−1

+ b

_f

)

f

_t

= L

_σ

(W

_f

x

_t

) + L

_σ

(V

_f

h

_t−1

) + L

_σ

(b

_f

)

f

_t

= L

_σ

(W

_f

x

_t

) + L

_σ

(V

_f

β

h t−1

) + L

σ

(V

f

γ

t−1h

) + L

σ

(b

f

)

f

_t

= L

_σ

(V

_f

β

h t−1

+ W

f

x

t

) + L

σ

(V

f

γ

t−1h

) + L

σ

(b

f

)

h

_t

= o

_t

⊙ tanh(c

_t

)

h

_t

= o

_t

⊙ L

_tanh

(β

c t

) + o

t

⊙ L

tanh

(γ

tc

)

h

_t

= β

_t

+ γ

_t

These equations can be seen in full in the appendix.

4

Murdoch et al. (2018) note that they could in principle decompose ot as was described above for ft, but 5

(20)

logistic regression coeﬃcients, which (when accurate) are considered highly interpretable. A positive correlation was regarded to signify a good measure of word importance. For SST, CD was found to have the strongest correlation with the regression coeﬃcients. For Yelp, CD was

competitive with the best performing alternative.

Having verified CD’s proficiency in predicting reliable importance scores for single words, they tackled more challenging tasks involving compositionally. First, the the methods’ ability to identify dissenting subphrases was investigated. To be effective, an interpretation method should be able to assign significantly different scores to positive and negative dissenting subphrases within a phrase. For phrases of up to 5 words, such as “used to be my favourite”, none of the existing methods could successfully identify dissenting subphrases. Unlike the other methods, CD could identify both positive and negative subphrases (such as “favourite” and “used to”) in a single phrase. This pattern held for both the SST and Yelp datasets, illustrating that CD is the only method amongst those tested which can uncover how the LSTMs identify the sentiment of subphrases underlying compositional phrase-level sentiment. The authors also considered higher-level compositionally. Specifically, they addressed instances where the sentiment of between one and two thirds of the review is different to the LSTM’s prediction. Of the methods tested only CD could correctly identify the sources of dissent in the review.

Whilst these findings demonstrated that CD is an eﬀective method for determining the contributions of phrases and subphrases to overall sentiment, this is not the method’s only advantage. Murdoch et al. (2018) also show that CD may inform our understanding of how that contribution is computed by capturing instances of negation. They do this by searching for negation phrases within SST and then extracting the negation interaction from each phrase by computing the CD score for the entire phrase and then subtracting the CD scores of the phrase being negated and the negation term. By comparing the positive and negative negations, they are able to show that CD assigns highly distinct scores to positive and negative negations.

Comparatively, leave one out’s scores show a high degree of overlap, indicative of false negatives. In sum, Murdoch et al. (2018) show that CD can produce interpretable scores describing the degree to which particular input variables (such as words) or sequences of input variables (such as phrases) contribute to an LSTM’s prediction. Furthermore, CD provides insight into how LSTMs compute the interactions of those input variables, reflecting the LSTMs’ ability to solve tasks requiring compositionality. Notably, these insights are produced without any prior expectations or hypotheses to guide the analysis. As such, CD is a strong example of a data-driven method. CD can also be applied to carry out analyses of a less exploratory and more targeted nature. Jumelet et al. (2019) illustrate this by describing a generalised version of CD (GCD), which they apply to investigate how LSTMs handle particular non-trivial language patterns. Rather than solely analysing the contribution of what is in focus using a decomposition based on a fixed set of interactions, they defined several sets of alternative interactions which allow for particular targeted analyses. These sets are:

• IN: includes terms for the analysis of the interaction between the token in focus with others out of focus, whilst disregarding intermediary tokens. This is useful for analyses of how a model handles long term dependencies between tokens.

• INTERCEPT: ignores input embeddings, allowing for an analysis of the model’s biases (which Jumelet et al. (2019) refer to as intercepts, to avoid confusion with prediction biases the model may exhibit).

• NO INTERCEPT: ignore interactions involving the gates, allowing for a comparative analysis showing the dependence of the input on the gate biases.