Transparency for Text Classification Models

(1)

Master Thesis

Transparency for

Text Classification Models

by

Mathijs Pieters

s12369705

July 9, 2020

48EC September 2019 - July 2020

Supervisor:

Wilker Aziz

Assessor:

Jasmijn Bastings

(2)

(3)

Neural Networks (NNs), a popular class of machine learning models, are generally considered to be black boxes because their inner workings are difficult to interpret. When used for real-world applications, transparency of the models’ prediction can increase trust and is therefore of great importance. This thesis therefore focuses on the development of transparent models for text classification at various levels of the task. First, we propose a method that learns what input words are important for the classification task at hand. This is done by learning a sparsely gated word embedding matrix, where sparsity is promoted by learning a subset of the original vocabulary that is useful for classification. This method is generally applicable for NNs that use an embedding layer. We then extend this method by learning what words are important for predicting a specific class by using class-specific gated embedding ma-trices. Using numerous experiments, we demonstrate that these methods can learn task- and class-specific words that are important for classification.

Apart from gated embedding matrices that provide global transparency, we de-velop a method that allows for transparency of a single prediction. We use the Word Mover Distance (WMD), a transparent distance metric that is defined in word em-bedding space, and combine it with prototype selection techniques. Classification is done based on the similarity of the prototypes in the train and test documents, where prototypes capture the meaning of the original document. We adapt this model to learn per class a unique weighing of the prototypes using a Mixed-Membership Model (MMM), hereby creating transparency of the complete model while maintain-ing transparency of smaintain-ingle predictions. We show that the proposed methods are easily amenable for inspection, and that the MMM learns topic words related to the differ-ent classes. The proposed techniques make a substantial improvemdiffer-ent in inference time compared to the WMD.

(4)

(5)

I would like to start with thanking Wilker for his supervision throughout this project. Without his continuous stream of ideas, great enthusiasm, and excellent feedback I would not have been able to write this thesis. Next, I would like to thank Jos van de Wolfshaar for our productive talks and extensive feedback, and Jasmijn Bastings for taking the time to asses my work. Additionally, I would like to thank MessageBird, and specially the data team, for giving me the opportunity to do this project. To my family and friends, without your support and distractions in my spare time writing this thesis would not have been possible.

Mathijs Pieters Utrecht July 9, 2020

(6)

(7)

Abstract iii

Acknowledgements iv

Notation xi

List of Abbreviations xv

List of figures xviii

List of tables xix

List of algorithms xxi

1 Introduction 1

1.1 Scope of this thesis . . . 4

1.2 Outline . . . 4

2 Sparse Latent Vocabularies 7 2.1 Introduction . . . 7

2.2 Related Work . . . 8

2.3 Learning Latent Vocabularies . . . 10

2.3.1 Learning a Task-Specific Vocabulary . . . 12

2.3.2 Learning Class-Specific Vocabularies . . . 16

2.4 Datasets and Models . . . 17

2.5 Experiments . . . 18

2.5.1 Empirical Results . . . 18 vii

(8)

2.5.2 Effect of Selection Rate . . . 20

2.5.3 Sentiment Words . . . 22

2.5.4 Topic Coherence . . . 22

2.5.5 Generalisability of Indicator Words . . . 23

2.5.6 Detecting Data Leakage . . . 25

2.5.7 Deterministic Gating . . . 25

2.5.8 Linear Probe - Logistic Regression with ARD . . . 26

2.5.9 MessageBird . . . 26

2.6 Conclusion . . . 28

3 Transparent Text Classification using Word Mover Distance and Prototype Selection 29 3.1 Introduction . . . 29

3.2 Related Work . . . 31

3.2.1 Measuring textual similarity . . . 31

3.2.2 Prototype Selection . . . 32

3.2.3 Word Mover Distance . . . 33

3.3 Method . . . 37

3.3.1 Prototype Selection . . . 37

3.3.2 Mixed Membership Model . . . 42

3.3.3 RWMD supervision . . . 47

3.4 Datasets and Models . . . 48

3.5 Experiments . . . 50

3.5.1 Classification Accuracy . . . 50

3.5.2 Inference Time . . . 50

3.5.3 Topic words MMM . . . 50

3.5.4 Visualisation Model Predictions . . . 55

3.6 Conclusion . . . 58

4 Conclusion 61 Bibliography 62 A Appendix 81 A.1 Algorithmic Description . . . 81

A.2 Architectures . . . 81

A.3 Extra Experiments . . . 83

A.4 Comparison to Deterministic Gating . . . 85

A.5 Latent Vocabulary Visualisation . . . 91

(9)

(10)

(11)

Below we introduce the notation used in this thesis. This includes the recurring symbols, operators, distributions and measures.

Sequences With x “ xx1, . . . , xLy we denote a sequence of L ordered items with xi the ith item in the sequence. The elements xi can be numbers, vectors or matrices. The length of the sequence L is also denoted by |x|. The concatenation of elements xxi, . . . xj, y is denoted with xi:j.

Matrices and vectors We reserve the use of boldface symbols or letters for matrices and vectors. With uppercase tokens for matrices (e.g. X, Σ), and lowercase for vectors (e.g. x, µ). In this research all values are real numbers R. A vector x of length n is denoted by x P Rn_{. Similarly, a matrix X with m rows of length n} is denoted by X PRmˆn_{, with X}

ij the element on the ith row and the jth column. We can denote the complete ith row of X by Xi.

Softmax The softmax maps a vector inRn _{to the probability simplex ∆}n´1 _with the components in the interval p0, 1q and summing to 1. For a vector z PRn _{the ith} component is: softmaxpzqi“ exppziq řn j“1exppzjq (1)

Sigmoid The sigmoid function is a monotonically increasing function with a first derivative that is bell shaped. It is often applied to map a real number to p0, 1q. Many instantiations exist, in this research we use the logistic function:

σpxq “ 1

1 ` expp´xq (2)

(12)

Euclidean norm For a vector x PRn _{the Euclidean norm is:} }x}2“

b

x2

1` ¨ ¨ ¨ ` x2n , (3)

also denoted as }x}. It measures the distance from x to the origin. For two vectors

x, y PRn, }x ´ y}2 measures the distance between the two point.

Gaussian distribution The Gaussian distribution (or normal distribution) is a probability function with the probability density function:

N px|µ, σ2 q “ 1 σ?2πe ´1 2p x´µ σ q 2 , (4)

with σ2 _{the variance and µ the mean of the distribution. We can generalise the}

univariate Gaussian distribution to higher dimensions (multivariate) as follows:

N px|µ, Σq “ exp

`

´1₂px ´ µqTΣ´1_{px ´ µq}˘ a

p2πqk_|Σ| , (5)

where µ denotes the mean, Σ the (co)variance, and k is the dimension of x.1 _The

covariance matrix Σ can be full meaning that we allow for correlations between different elements, diagonal meaning that elements have no correlations but each variate has its own variance, or spherical where all elements share the same variance and there are no correlations between elements.

vMF distribution The von Mises-Fisher (vMF) distribution is a probability dis-tribution on the unit-sphere2_in _Rn´1_{. The probability function is:}

vMFpx|µ, κq “ Cdpκq exp`κµTx ˘

(6) where κ ě 0, d the dimensionality of x and Cppκq:

Cppκq “

κp{2´1 p2πqp{2_I

p{2´1pκq

(7) with Iv the modified Bessel function of the first kind. Here µ denotes the mean direction and κ the concentration parameter.

1_{In the definition in Equation 5, |Σ| denotes determinant.}

2_{For an n-dimensional vector the distribution is in}_Rn´1_{because we have n ´ 1 free parameters}

(13)

KLpP }Qq “ ÿ xPX P pxq logˆ P pxq Qpxq ˙ (8) “ ÿ xPX P pxq log P pxq ´ ÿ xPX P pxq log Qpxq (9) “ E x„Prlog P pxqs ´x„PE rlog Qpxqs (10) For two continuous probability distributions P and Q the KL divergence is:

KLpP }Qq “ ż8 ´8 ppxq logˆ ppxq qpxq ˙ dx (11)

where ppxq and qpxq denote the probability density functions of P and Q. It is important to note that the measure is asymmetric: KLpQ}P q ‰ KLpP }Qq.

Big O notation In order to describe the time complexity of algorithms we use Big

O notation. It describes the worst-case running time for an algorithm to finish. For

an input of size3

n, Opn2q denotes that in the worst case the algorithm needs c ˆ n2 running time to finish, where c is a constant and does not depend on the input.

3_{The precise definition of the size differs per problem instance. E.g. it can refer to the length of}

(14)

(15)

AI Artificial Intelligence. 1, 3 BOW Bag-Of-Words. 31, 32, 39

CNN Convolutional Neural Network. 12 DL Deep Learning. 1

ELBO evidence lower bound. 30, 45, 46, 48 EM expectation maximization. 41, 43, 92 EMD Earth Mover Distance. 31, 34, 35, 48 GMM Gaussian Mixture Model. 41–43, 46 k-NN k-Nearest Neighbors. 30, 31

KL Kullback-Leibler. xiii, 31, 42, 45

LDA Latent Dirichlet Allocation. xix, 22–24, 32 LSTM Long Short-Term Memory. 12, 18, 26 MCMC Markov chain Monte Carlo. 44 MLP Multi-Layer Perceptron. 12

(16)

MMM Mixed-Membership Model. iii, viii, ix, xvii, xix, 30, 32, 33, 37, 38, 43, 44, 46, 47, 49–52, 56–58, 61, 92, 93

Mo-vMF Mixtures of von Mises-Fisher. 41 MSE Mean Squared Error. 11, 21

nBOW normalized Bag-Of-Words. 33, 35, 36

NLP Natural Language Processing. 2–4, 12, 28, 31, 32, 34 NN Neural Network. iii, 1, 2, 8, 9, 12, 28, 29, 61

RWMD Relaxed Word Mover Distance. 36, 37, 42, 47, 48

TF-IDF Term-Frequency Inverse-Document-Frequency. 31, 32, 39 VI Variational Inference. 44, 45

vMF von Mises-Fisher. xii, 41, 42 WCD Word Centroid Distance. 35, 36

(17)

2.1 HardKumaraswamy distribution . . . 15

2.2 Accuracy vs selection rate on SST2 . . . 20

2.3 Accuracy vs selection rate on AG News . . . 21

2.4 MSE vs selection rate on BeerAdvocate . . . 21

2.5 Distribution over sentiment classes . . . 23

2.6 Class specific distribution over sentiment classes . . . 23

2.7 Classification results of simple model fitted on top word from LV-LSTM 24 2.8 Word Leakage experiment on SST2 . . . 25

2.9 Visualisation for MessageBird dataset . . . 27

3.1 Visualisation for the different WMD variants . . . 38

3.2 Mixture Model . . . 42

3.3 Mixed Membership Model . . . 44

3.4 Visualisation for WMDSKmeans classification . . . 55

3.5 Prediction using MMM . . . 57

3.6 Wrong prediction using MMM . . . 58

A.1 Accuracy versus selection rate for the SST2 dataset . . . 85

A.2 Accuracy versus selection rate for the AG News dataset . . . 85

A.3 Accuracy of simple models on AG News . . . 87

A.4 Accuracy of simple models on Beeradvocate . . . 87

A.5 Results from the deterministic gate for the SST2 dataset . . . 88

A.6 Results from the probabilistic gate for the SST2 dataset . . . 88

A.7 Results from the deterministic gate for the BeerAdvocate dataset . . . 89

A.8 Results from the probabilistic gate for the BeerAdvocate dataset . . . 89

A.9 Results from the deterministic gate for the AG News dataset . . . 90 xvii

(18)

A.10 Results from the probabilistic gate for the AG News dataset . . . 90 A.11 Latent vocabulary model visualisation 1 . . . 91 A.12 Latent vocabulary model visualisation 2 . . . 91

(19)

2.1 Dataset Statistics . . . 17

2.2 Selected words for three datasets using LV-LSTM . . . 19

2.3 Selected words for BeerAdvocate aspects using LV-LSTM . . . 19

2.4 Mean TC-NPMI for LV-LSTM and LDA . . . 24

2.5 Logistic Regression with ARD on SST2 . . . 26

2.6 MessageBird results . . . 27

3.1 Parameters of the Gaussian MMM . . . 44

3.2 Dataset Statistics . . . 49

3.3 Classification results for WMD and variants . . . 51

3.4 Run time for WMD and variants . . . 51

3.5 Train time for MMM . . . 52

3.6 Mixed Membership Model results - Twitter . . . 53

3.7 Mixed Membership Model results - BBC sport . . . 53

3.8 Mixed Membership Model results - Amazon . . . 54

3.9 Mixed Membership Model results - Classic . . . 54

A.1 Top-10 selected words using LV-MLP . . . 83

A.2 Top-10 selected words using LV-CNN . . . 84

A.3 Classification results for different selections rates on the three datasets 84 A.4 Topic words for SST2 and AG News from LV-2-model and LV-4-model 86 A.5 Topic words determined by LDA. . . 86

A.6 Mixed Membership Model results - Ohsumed . . . 93

A.7 Mixed Membership Model results - 20News . . . 94

(20)

(21)

3.1 K-means clustering . . . 39

3.2 Spherical k-means clustering . . . 40

3.3 Generative algorithm for Mixed Membership Model . . . 46

A.1 Latent Vocabulary Model . . . 82

(22)

Chapter 1 Introduction

Language is a system which “makes infinite use of finite means”

Wilhelm von Humboldt — 1836

I

t is often hypothesised that what sets humans apart from other animals is theability to use language. Without spoken and written language it would be very difficult to communicate at the global scale as we do nowadays. Noam Chomsky, often considered the father of modern linguistics (Tymoczko and Henle, 2004), recites Wilhelm von Humboldt with the sentence “infinite use of finite means”, with which he means that we can create an infinite number of new sentences from a finite set of words (Chomsky, 1965, Preface). Language is also far from static, since new words appear in our vocabulary on a day to day basis.1 _{That language is important for}

intelligence is indisputable. In fact, Alan Turing already developed an intelligence test for machines solely based around the use of natural language in 1950 Turing (1950), called the Turing test. This test is performed in a textual form and is based on the idea that when machine-generated responses are indistinguishable from that of humans, the machine must be intelligent.

70 years after the proposal of the Turing test it has become even more relevant. Machines have become increasingly good at domain specific tasks, focussing on a spe-cific task at hand. At some tasks, such as lip-reading (Assael et al., 2016) and playing video games (Mnih et al., 2013), machines even outperform humans. These advances can be attributed to some important developments in Artificial Intelligence (AI). Most state-of-the-art techniques are powered by deep Neural Networks (NNs), which where already hypothesised by Fukushima (1980), but were infeasible to optimise at the time. The introduction of the back-propagation algorithm (Rumelhart et al., 1986) made it possible to efficiently train NNs (LeCun et al., 1989). More recently, developments in Deep Learning (DL) allow for networks with a bigger modelling capacity, leading to many state-of-the-art results (LeCun et al., 2015).

1_{New words in April 2020:}

(23)

These milestones also had their effect on the field of Natural Language Processing (NLP), where the interactions between machines and human language are studied. Although the start of NLP is often attributed back to the Turing test, developments of the last ten years had a great effect on the current state of the art. Mikolov et al. (2013) for instance introduced Word2vec, an algorithm that allows for large-scale unsupervised training of word embeddings and thus does not require human annotated data. These word embeddings can capture many semantic relations, which can be used for other downstream tasks. The availability of word embeddings and the increasing computer power enabled the adoption of NNs in many NLP tasks.

The increasing performance of NLP systems has gone hand in hand with the num-ber of real-world use cases. Performance is of great importance to systems that are used in production since incorrect predictions can have considerable effect, especially when used for high-stake decision making. However, prediction accuracy on its own is not enough. Liang et al. (2018) show that it is possible to create adversarial text samples that resemble the original text while being classified as any desirable class with high certainty. This makes it is easy to bypass text classifiers, such as spam or fake news detectors. For systems that are used in production, this exposes a lot of problems, since such systems can not directly be trusted. To gain trust, it is im-portant to understand the underlying process on which the decision making is based of. In addition, since the introduction of the General Data Protection Regulation (GDPR) in May 2018 it is also required by law that “meaningful explanations of the logic involved” are provided when automated decision making is used (Selbst and Powles, 2017). Furthermore, transparency is important for preventing discrimina-tion (Hardt et al., 2016). For black-box models it is difficult to detect biases towards a particular group. In order to prevent biases, research focus has gone to creating balanced datasets where different groups are well represented (Atwood et al., 2020) (e.g. man-woman, local minorities). However, without transparency models can still learn undetectable biases.

Interpretability of models is thus of great importance and research interest has also taken up on this trend. There is no clear consensus about the definition of the term interpretability. Interpretability and explainability are even difficult to define in the field of psychology (Keil, 2006). There is also no clear consensus about what is precisely considered to be interpretable machine learning. Lipton (2018) provides an extensive discussion of the desiderata and methods related to model interpretability. Within this thesis we will not focus on this debate. The purpose is to develop systems that provide insights at some level of the task and augment models to help humans criticise the model and its predictions. To what extent these methods are truly considered interpretable is a discussion left for the reader. Nonetheless, we will discuss the pros and cons of our methods, and will also do this in relation to some

(24)

3 methods that are generally considered to be interpretable. Hereby we do not imply that our methods are also interpretable. To prevent confusion, we will often resort to the term transparency in this thesis.

In order to gain insights into the model, we can consider many different types of methods. First, one can differentiate between intrinsic and post-hoc methods (Molnar, 2019). Intrinsic methods are designed to generate insights themselves (e.g. decision trees and simple linear models). Post-hoc methods make use of a second model to create insights for the model of interest. These post-hoc methods can be model agnostic and generally applicable to machine learning models, or model specific methods specialised for a certain model (Du et al., 2019). Additionally, we can differentiate between local and global methods. Local methods provide relations between a specific input instance and the prediction of the model. Whereas global methods give insights into the overall working of the model. This can be at the level of the model or the dataset. Global methods can increase transparency by showing the interplay between different parts of the model. Using local methods we obtain information about the causal relationship of an input instance and the corresponding prediction. Note that global methods can still provide insights into a prediction of a particular input, but this insight is not generated specifically for this input. For both local and global explanations it is important that they accurately portray the underlying decision making process for the predictions of the model, also referred to as the faithfulness of the explanation (Jacovi and Goldberg, 2020).

This work is performed in collaboration with MessageBird,2 _{a Communications}

Platform-as-a-Service (CPaaS) company solving the omni-channel communication problem. With over 15.000 customers from all over the world, they process millions of messages per day. MessageBird provides several automated machine learning pipelines to their customers, for example to detect spam messages. For the cus-tomers to gain trust in the systems provided by MessageBird, the classification accu-racy should be high. Additionally, the users can increasingly trust the systems when model transparency is provided. For MessageBird, explainable AI presents an oppor-tunity to improve their own understanding, as well as their customer’s understanding of the provided NLP systems. These systems include spam blocking, automated FAQ answering for building simple chatbots, and named entity recognition. By better un-derstanding their algorithms, it becomes easier to pinpoint problems in data quality, skewness, bias and other aspects.

(25)

1.1 Scope of this thesis

In this thesis we focus on the design of transparent NLP models. More specifically, we concentrate on intrinsically transparent text classification models.

The objectives of this thesis are as follows:

• Establish a global transparent model that allows for the detection of feature importance that is generally applicable to neural networks. Global transparency provides the user with easy to inspect results. With the rapid development of new techniques (e.g. dropout, batch-norm and novel activation functions), we aim for a technique that is suitable for various neural networks.

• Develop a method for both local and global transparent predictions that are faithful to the model. Faithfulness is required to truly trust the insights into the models’ prediction. We aim to develop a method that is transparent at both a local and global level, hereby allowing the user to gain insights into specific predictions as well as the overall workings of the model. These objectives are verified using datasets from academia. Additionally, a dataset provided by MessageBird is used. This enables verification of the results on a real-world problem and illustrates to what extent the used methods are applicable on problems outside academia.

1.2 Outline

First, in Chapter 2, we introduce the use of sparse latent vocabularies for determining feature importance. We start with a brief introduction in Section 2.1, followed by the related work in Section 2.2. In Section 2.3 we introduce the methods used for learning both a task-specific and class-specific vocabulary. Then in Section 2.4, we describe the used datasets and models employed in our experiments in Section 2.5. In this section we illustrate the pros and cons of our method using numerous experiments. We conclude this chapter in Section 2.6.

In Chapter 3 we propose the use of prototype selection techniques applicable to a transparent text classification method. We extend this method to classify based on the nearest class instead of document, providing global transparency as well as a significant better run time. In Section 3.1 we give an introduction of Chapter 3, followed by the related work in Section 3.2. We introduce the used methods in Section 3.3. The used datasets and models are discussed in Section 3.4. In Section 3.5 we present the experimental results and in Section 3.6 this chapter is concluded.

(26)

1.2. Outline 5 Finally, in Chapter 4 we present an overall conclusion of all three chapters. For additional results and information we refer to the Appendix.

(27)

(28)

This chapter partly overlaps with a paper under review at the time of writing, namely: Mathijs Pieters, Wilker Aziz “Sparse Latent Vocabularies for Feature Importance in Text

Classification”

Chapter 2 Sparse Latent Vocabularies

Better than a thousand useless words is one useful word, hearing which one attains peace

Buddha

Abstract

The increasing use of machine learning models in real-world applications has created a high demand for methods that give insights into their inner workings. In this chapter we focus on determining feature importance on a global level. To this end, we propose a latent vocabulary model that learns a sparsely gated version of an encoder’s embedding matrix. With a penalty for the expected size of the vocabulary we promote a compact and useful vocabulary for classification. This model is then extended by introducing multiple latent vocabularies, each learning a class specific distribution. We show that our technique allows for learning of specific words important for classification. Detecting these task-and class-specific words allows for data-driven annotation task-and the detection of artifacts in the training data.

2.1 Introduction

I

n this chapter we focus on the explanations of neural text classifiers and aim toprovide insights on features that are useful for a certain task on a global level. To this end we propose to learn a subset of the vocabulary of the input data which, though compact, is effective for classification. This can be seen as a form of a latent gating mechanism that learns to retain a subset of the rows in an embedding matrix and sets the remaining rows to zero. Our work is inspired by research on inducing latent rationales (Lei et al., 2016), where a rationale is seen as a compact and sufficient subset of the input text that can be used to justify a prediction (Zaidan et al., 2007).

(29)

A rationale is compact in that it reduces the available predictors to a short and easy-to-skim excerpt of the input, and sufficient in that using it instead of the original input for prediction does not hurt the classifier. While we retain some of their main ideas, namely, to learn a sparse latent selector and to employ a compression penalty, we deviate from previous work in that our justification is global to the entire classifier, as opposed to being specific to each sentence. We thereby make the model more transparent as a whole, rather than making each prediction independently more transparent. We learn to mask entire rows of an embedding matrix using binary random variables. In particular, we employ a sparse and differentiable relaxation to binary variables based on the HardKumaraswamy distribution (Bastings et al., 2019). To promote compact vocabularies we employ L0 regularisation (Lei et al.,

2016), and in particular a differentiable relaxation of it (Louizos et al., 2018). Our approach is widely applicable because it depends only on the use of a differentiable NN that uses a word embedding layer.

The interest of this chapter is to develop a method that learns task or class specific words important for classification. The contributions of this chapter are as follows:

1. we introduce a general approach to learning to classify using only a subset of the known vocabulary made of words that are overall important for the task; 2. we show this approach extends easily to learning multiple latent vocabularies,

where each specialises to representing words relevant to a certain class. We show the effectiveness of the proposed method in four classification benchmarks.

First, we lay out related work in Section 2.2. Then in Section 2.3 we explain the methods used. In Section 2.4 the details of the datasets and models are explained, followed by the experiments in Section 2.5. Finally, in Section 2.6 we conclude the results of this chapter.

2.2 Related Work

There is a large body of literature on explainable AI. Nevertheless, evaluating ex-planations remains challenging and even the definition of this term is susceptible to a lot of debate (Jacovi and Goldberg, 2020). We survey the aspects we see as more directly relevant to this chapter, namely that of determining feature importance. Explaining trained models. LIME (Ribeiro et al., 2016) is a model agnostic technique able to provide explanations of complex models by fitting an interpretable model on the local behaviour of this complex model, hence yielding local explanations. In subsequent work of the same authors, they extend this method to generate if-then

(30)

2.2. Related Work 9 rules (or anchors) that give more insights in the interplay between features (Ribeiro et al., 2018). A related methodology involves explaining a model by understanding what generalisations it is capable of. This involves training diagnostic classifiers or linear probes (Hupkes and Zuidema, 2018; Alain and Bengio, 2016) to detect hidden states that are predictive of a certain generalisation of interest, such as specific linguistic capabilities (e.g. number agreement between subject and verb) (Giulianelli et al., 2018; Ravfogel et al., 2019; Kim et al., 2019) or even algorithmic strategies (Veldhoen et al., 2016; Hupkes and Zuidema, 2017). Li et al. (2016) determine feature importance for a trained model by erasing various parts of the representation (such as word-vectors, intermediate hidden units, or input tokens). By observing the model prediction after erasing specific features, insights about the model decisions and errors are provided. De Cao et al. (2020) propose a similar method, where the contributions of input tokens on the model prediction is determined by learning a model that predicts which tokens to erase.

Interpreting model internals. Abnar et al. (2019) propose a methodology based on analysing the stability of internal representations. By varying single model param-eters they can systematically investigate the internal representation spaces. Related, Voita et al. (2019a) use information bottleneck to analyse how different objectives bias internal representations to quickly adapt to the learning signal or stay closer to the input. In this analysis the information bottleneck is interpreted to be the information (activations in the NN) flowing through the network while retaining in-formation regarding the predicted label. Attention mechanisms (Bahdanau et al., 2014), first developed and popularised in a machine translation context, provide easy to inspect heat maps that have been hypothesised to correlated with importance of certain predictors (such as input positions) towards a downstream task (Choi et al., 2016; Mascharka et al., 2018). Recent work has identified shortcomings of directly in-terpreting softmax-style attention coefficients (Jain and Wallace, 2019; Serrano and Smith, 2019). Though the issue is certainly not settled, as Wiegreffe and Pinter (2019) demonstrate that such coefficients can be regarded explanations depending on its precise definition.

Built-in methods. Another direction is to learn models that are more amenable to inspection by design. Strategies include supervising attention mechanisms with human-annotated rationales (Zhang et al., 2016),1_{supervised and unsupervised}

struc-tured hidden representations (Bowman et al., 2016; Yogatama et al., 2016; Choi et al., 2018; Niculae et al., 2018; Peng et al., 2018), and more generally promot-1_{Text classifiers based on learning from supervised rationales was introduced by Zaidan et al.}

(31)

ing sparsity in various aspects of the model (e.g. weights or output). First, sparse models are typically more compact, and thus potentially more amenable to inspec-tion. Additionally, sparsity penalties tend to force models to give up on spurious correlations and concentrate on important predictors. Several methods exist that promote sparsity in various aspects of the model. Sparse weights can be enforced via sparsity-inducing regularisation (Tibshirani, 1996; Scardapane et al., 2017; Louizos et al., 2018). Sparse outputs as well as sparse hidden layers can be promoted via differentiable sparse projections to the probability simplex (Martins and Astudillo, 2016).2 _{This can for example be used to make attention mechanisms sparse}

(Nicu-lae and Blondel, 2017). Other research focusses on making word embeddings (Luo et al., 2015) or sentence embeddings (Trifonov et al., 2018) themselves interpretable by enforcing sparsity. Stochasticity can also be used to induce sparse activations with the purpose of making sure certain inputs are certainly not available to the task (Lei et al., 2016; Bastings et al., 2019). This allows for local interpretability of the classification task. In similar work, Paranjape et al. (2020) use an information bottleneck objective to determine which parts of the input can be omitted without significantly affecting the task performance of a Transformer model. Voita et al. (2019b) use a sparse gate to detect and delete less interpretable attention heads in a large Transformer.

Detecting feature importance is difficult in neural networks (Olden et al., 2004; Qiu et al., 2018). The approach we propose provides insights into what features are important for text classification, with applications such as criticising the model and learning more about the data. For example, it can help spot annotation artifacts (Gururangan et al., 2018) and, more generally, other forms of data leakage (Kaufman et al., 2011).

2.3 Learning Latent Vocabularies

We aim to learn a compact subset of the vocabulary that is particularly useful for a certain text classification task. Usefulness is measured intrinsically using the model’s utility function, and extrinsically in terms of classification performance. Compactness is expressed in terms of effective vocabulary size. We formalise our approach as a latent variable model where global latent variables restrict the set of words available to the classifier’s encoder. The variables are considered to be global as they are model specific. Parameter estimation is based on a variational lowerbound on log-likelihood of observations, augmented with a complexity penalty based on the expected size of the selected vocabulary.

2_{The standard softmax function is not considered sparse because all components are strictly}

(32)

2.3. Learning Latent Vocabularies 11 Text classification. In neural text classification with closed vocabularies, an em-bedding layer projects tokens in a vocabulary V to vectors inRd_{using an embedding} matrix E, using a global collection of |V | ˆ d trainable parameters of the classifier. In recent literature it is standard to use pretrained embeddings such as Word2vec (Mikolov et al., 2013) or Glove (Pennington et al., 2014), and during training we can either fix or fine-tune this layer. Fixing the embedding layer limits the capacity to learn dataset or task specific embeddings, but allows for faster training. For some tasks pretrained embeddings can be unavailable, for example when languages with low resource data are considered (Ragni et al., 2014). Therefore we focus on methods that work for both fixed and fine-tuned embeddings, as well as for pretrained and end-to-end trained embeddings.

For an input x “ xx1, ..., xLy, a sequence of L token identifiers, where xiidentifies the i-th token in the sequence, a neural (probabilistic) text classifier is a simple statistical model:

Y |θ, x „ Catpsoftmaxpηqq η “ f pembpx; Eq; θfq.

(2.1) Here embpx; Eq maps the input sequence x to a sequence xEx1, . . . , ExLy of

embed-dings and a function f with trainable parameters θf maps from that to the natural parameter η PR|Y| _{of a Categorical distribution over |}_{Y| target labels. Given some}

observations, D “ tpxpkq_{, y}pkq_qN

k“1u, the parameters θ “ tEu Y θf of the model are estimated to attain an optimum of the log-likelihood function

LDpθq “ N ÿ k“1

log ppypkq|xpkq, θq (2.2)

via stochastic gradient-based optimisation (Bottou and Cun, 2004). For a regression task, where the target labels are continuous values, we optimise the parameters using stochastic gradient descent to minimise the Mean Squared Error (MSE) between the target and predicted value:

LDpθq “ 1 N N ÿ k“1 ´ ypkq´ f pembpxpkq; Eq; θfq ¯2 . (2.3)

In reality, the set of observationsD can be too big to calculate the loss function

for all observations at once. Therefore we resort to using stochastic gradient descent with mini batches, where we calculate the loss function from Equations 2.2 or 2.3 for a subset of size B before performing a gradient descent step. With an appropriate learning rate schedule this is guaranteed to converge to a local optimum of the

(33)

ob-jective function (Robbins and Monro, 1951).

The function f in Equation 2.1 can be any differentiable function that maps a collection of word embeddings to a prediction over targets. A common approach is to use a Multi-Layer Perceptron (MLP), which consists of several fully-connected layers with a hidden activation function. Note here that the sequence of word embed-dings can have a varying length (unless we truncate the documents to a pre-specified length). In order to use a NN with a fixed input size, we first need to aggregate the word embeddings (e.g. using mean-over-time pooling). The major downside of this method is that the temporal aspect of the data is ignored: the order of words has no effect on the prediction. In order to take local structure of the sequence into account we can use a Convolutional Neural Network (CNN). Originally, CNNs were applied on computer vision tasks where local structure (correlation between neighboring pix-els) is important. With the success of word embeddings, the local structure could also be exploited in the NLP domain (Kim, 2014). The CNN takes local structure into account but does not use all temporal information. Using convolutional filters word order is captured locally, but the temporal information that is captured is lim-ited by the size of these filters. Using a recurrent neural network we can process inputs of varying length and also exploit temporal information covering arbitrarily large spans.3 _{Many types of recurrent neural networks have been developed. In}

this research we focus on using the Long Short-Term Memory (LSTM) network, a recurrent neural network that excels in modelling long term dependencies (Hochre-iter and Schmidhuber, 1997). Although recurrent neural networks have proven to be very effective at tasks where the order of data is important, due to their recurrent nature data has to be processed sequentially. More recently Transformers have been introduced, which allow for parallel processing of the data during training (Vaswani et al., 2017). Transformers make use of a self-attention mechanism, where different elements in the input sequence are related to each other hereby creating a represen-tation of this sequence. The introduction of Transformers led to many new models such as BERT (Devlin et al., 2019) and Reformer (Kitaev et al., 2020) obtaining state-of-the-art results on various NLP tasks.

2.3.1 Learning a Task-Specific Vocabulary

The text classification procedure as discussed in the previous section forms the basis for many NNs. We propose to extend this neural text classifier with a collection of latent variables z “ tz1, . . . , z|V |u, each zv P t0, 1u a binary gate used to regulate 3_{In theory sequences of arbitrary length can be modelled, in practice this turns out to be difficult}

(34)

2.3. Learning Latent Vocabularies 13 access to the corresponding token embedding. The proposed model parameterises |V | stochastic gates

Zv|ϕ „ Bernpσpϕvqq v “ 1, . . . , |V | (2.4a)

using a trainable parameter ϕ PR|V |_{, mapped to |V | independent probability values}

via sigmoid, and then computes a distribution over targets

Y |θ, x, z „ Catpsoftmaxpηqq (2.4b)

η “ f pembpx; z d Eq; θfq

using a masked embedding matrix z d E, where we use d to denote this masking operation (i.e. a row-wise scalar product).

Training this model requires computing the marginal log-likelihood function

LDpθ, ϕq “ log ÿ zPt0,1u|V | ppz|ϕq N ź k“1 ppypkq|xpkq, z, θq , (2.5)

which is clearly intractable, since z can take any of 2|V | _{assignments, each specifying}

a vocabulary for the task. Instead of optimising the likelihood-function directly, we optimise a lower-bound obtained via Jensen’s inequality (Jordan et al., 1999)

LDpθ, ϕq ěE « _N ÿ k“1 log ppypkq|xpkq, z, θq ﬀ , (2.6)

where the expectation is expressed with respect to the parameterised prior ppz|ϕq.4

Without a fixed prior, we cannot directly express our preference for small vocab-ularies, a shortcoming we remedy via a weighted penalty

´λ |V | ÿ v“1 PpZvR t0u|ϕq looooooomooooooon “σpϕvq (2.7)

against selecting with high probability. Here, λ denotes the weighting of the penalty and the sum calculates the effective size of the vocabulary for parameters ϕ. First, note that the probability of keeping the v-th entry of the vocabulary is precisely the v-th Bernoulli parameter σpϕvq, thus easy to compute and clearly differentiable. 4_{A reader familiar with variational autoencoders (VAEs; Kingma and Welling, 2014; Rezende}

et al., 2014) may think of assuming ppz|ϕq “ qpz|xq as unreasonably simple. We thus remark that unlike the typical VAE, the prior ppz|ϕq here is not fixed, but rather estimated along with the rest of the classifier.

(35)

Second, note this can be seen as an approximation to L0 regularisation, where L0 is

computed in expectation (Louizos et al., 2018).5

We cannot compute the objective, nor its gradients with respect to θ and ϕ, in closed form, since the expectation in (2.6) remains intractable. Nonetheless, from the point of view of optimisation, it is sufficient to follow unbiased gradient esti-mates obtained via Monte Carlo sampling (Hoffman et al., 2013; Robbins and Monro, 1951). The challenge in this case is that sampling from a Bernoulli distribution is a non-differentiable operation, which prevents us from estimating ϕ. Learning mod-els with discrete latent variables calls for application of the score function estimator (Rubinstein and Kreimer, 1983), a.k.a. REINFORCE (Williams, 1992), a rather noisy estimator that typically requires variance reduction techniques based on con-trol variates (Greensmith et al., 2002). To circumvent the need for REINFORCE and control variates, we follow Bastings et al. (2019) and employ their sparse relaxation to Bernoulli random variables. Effectively, we replace Bernoulli-distributed gates by sparse HardKumaraswamy-distributed gates. Unlike dense gates, sparse gates evalu-ate to exactly 0 or exactly 1 with non-zero probability. Unlike binary gevalu-ates, sampling a sparse gate is a differentiable function of the parameters of the distribution. Fi-nally, unlike other sparse relaxations (Maddison et al., 2017; Jang et al., 2017) based on straight-through estimator (Bengio et al., 2013), gradient estimates are unbiased, a formal requirement of stochastic gradient-based optimisation.

HardKumaraswamy. The Kumaraswamy distribution (Kumaraswamy, 1980) is a two parameters univariate distribution with support on the open interval p0, 1q. Drawing Kumaraswamy samples can be done by mapping from a probability, uni-formly sampled between 0 and 1, through the distribution’s known and differentiable inverse cumulative distribution function (cdf). This technique, also known as a

differ-entiable reparameterisation (Kingma and Welling, 2014), has been used to train VAEs

using Beta priors and Kumaraswamy approximate posteriors (Nalisnick and Smyth, 2017). Bastings et al. (2019) stretch a base Kumaraswamy to an interval slightly big-ger than p0, 1q and then map negative samples to 0 and samples beyond 1 to 1, thus creating a mixture of two point masses (one at 0 and one at 1) and a continuous den-sity in between. Sampling is then differentiable and produces sparse outcomes (i.e. 0 or 1) with non-zero probability. From a practical perspective, a HardKumaraswamy layer can be thought of as a stochastic sigmoid layer that sometimes produces sparse outcomes. In Figure 2.1 we show the probability density function for different pa-rameterisations of the distribution. Our model specification changes only slightly, in particular, we replace a product of |V | independent Bernoulli distributions by a 5_{Note that because of the independence assumption in Equation (2.4a), the penalty factorises}

(36)

2.3. Learning Latent Vocabularies 15

Figure 2.1: Probability density function for the HardKumaraswamy distribution for various α and β values. For every instance we stretch the distribution to the interval p´0.1, 1.1q before rectifying. The values α and β control the level of sparsity, the bars indicate the probability mass at either 0 or 1.

product of |V | independent HardKumaraswamy distributions:

Zv|ϕ „ HKumapαv, βvq

rα, βs “ softpluspϕq (2.8)

Parameterising this component takes two strictly positive parameters word in the vocabulary, which we achieve via softplus around our unconstrained trainable param-eters ϕ PR2|V |_{. The probability mass in Equation 2.7 corresponds to p1 ´ p}1_{₁₁_qαv_qβb

(Bastings et al., 2019, Appendix A) and is necessary to determine the effective vo-cabulary size.

On Lagrangian Relaxation. The proposed penalty in Equation (2.7) can be seen as Lagrangian relaxation of a constraint on the size of the active vocabulary, where we would like this to be below a certain pre-specified value. Bastings et al. (2019) use this view to justify gradient updates to the weight λ of the penalty. We experimented

(37)

with their approach and found it unstable for two out of four of our datasets. A simpler scheme based on only including the penalty in batches where the constraint is violated having λ as a fixed hyperarameter worked well throughout. For a selection rate R and a weight λ, we penalise the objective by

´λ max ¨ ˝R, 1 |V | |V | ÿ v“1 PpZvR t0u|ϕq ˛ ‚, (2.9)

where we scale the total probability by1_{_{|V |}_{, so that we can interpret R as a}

propor-tion.

2.3.2 Learning Class-Specific Vocabularies

So far we assumed that for each task there is a single vocabulary—shared by different target classes—that is compact and sufficient. In reality, words that are relevant for detecting one class might be irrelevant for another class. Even more so, depending on the intended reading of a document, that is, depending on the potential target class, a word might exhibit different connotation.

We operationalise this intuition by introducing |Y| latent variables of the kind

developed in the previous section, each associated with one particular target c PY. That is, now z “ tzp1q_{, . . . , z}p|Y|q_{u and z}pcq _{“ tz}pcq

1 , . . . , z pcq

|V |u with z pcq

v P t0, 1u for

c P t1, . . . , |Y|u and v P t1, . . . , |V |u. Likewise, we have |Y| embedding matrices,6_each

associated with one particular target c and denoted by Epcq_{. The model specification}

changes as expected, that is, we parameterise |Y| ˆ |V | latent selectors Z_vpcq|ϕ „ HKumapαv, βvq

rα, βs “ softpluspϕpcqq

(2.10) and a Categorical distribution over targets

Y |θ, x, z „ Catpsoftmaxpηqq

ηc“ f pembpx; zpcqd Epcqq; θfq ,

(2.11) though this time f computes one component of the natural parameter at a time depending on its inputs.7 _{Note that we must encode the text as many times as}

there are targets, since we have target-specific embedding layers.8 _{The trainable}

6_{Preliminary experiments gave gave better results when the matrices were not shared between}

classes.

7_{This corresponds to replacing a softmax output layer by a logistic regression layer.}

(38)

2.4. Datasets and Models 17

Dataset # train # test # dev |V | length

SST2 9,611 1,821 872 18045 (13320) 20.8

AG News 120k 7,6k - 90182 (35472) 47.7

BeerAdvocate 220k 31k - 115786 (57577) 160.6

MessageBird 7112 1779 - 16087 (3953) 26.9

Table 2.1: Dataset statistics for the three used datasets. We show the number of train/test/dev documents, the vocabulary size in combination with the number of words that have a corresponding Glove embedding, and the average document length in the training data.

parameters are ϕ “ tϕp1q_{, . . . , ϕ}p|Y|q_{u and θ “ tE}p1q_{, . . . , E}p|Y|q_{u Y θ}

f.

2.4 Datasets and Models

Datasets. In our experiments we use four datasets. The first dataset is Stanford Sentiment Treebank (SST) (Socher et al., 2013), containing 11,855 movie reviews and their corresponding sentiment. We remove the neutral reviews and use binary labels, this dataset is referred to as SST2. Secondly, we use the AG News dataset (Zhang et al., 2015), consisting of news articles of the genres World, Sports, Business, and Sci/Tech, where every document belongs to precisely one of the four classes. The third dataset we use is the BeerAdvocate dataset (McAuley et al., 2012) with 220,000 beer reviews, where each review has four different aspects (taste, aroma, palate, appearance) labelled on a 0-5 star scale, in addition to the overall rating. Similar to Lei et al. (2016), we rescale the ratings to the range r0, 1s. Finally, we used an internal dataset provided by MessageBird. This dataset contains 7112 documents, consisting of spam messages vs. no-spam messages.

Models. We use three different model architectures in order to validate the gener-alisability of our approach. The first architecture we use is a multilayer perceptron (MLP), where we use mean-over-time pooling of the input sequence in order to create a fixed size input. This input sequence represents the average of the word embed-dings. The second architecture we use is a convolutional neural network (CNN), where convolutional filters are applied to windows of several sizes (Kim, 2014). This is followed by max-over-time pooling, and a fully connected layer is used to map the output to the target size. Here max-over-time pooling is motivated by previous research (Kim, 2014) and good results in preliminary experiments. We use a

(39)

tional Long Short-Term Memory (LSTM) to test our setup with a recurrent neural network (Hochreiter and Schmidhuber, 1997). We denote the models where we use a single latent vocabulary by LV (e.g. LV-MLP), and the models where we use mul-tiple latent vocabularies by LV-c, with c the number of classes (e.g. LV-2-MLP). See Appendix A.2 for implementation details and optimisation hyperparameters.

2.5 Experiments

In this section we demonstrate the effectiveness of our approach. To better represent sensitivity to initial conditions, we train every model 5 times using different random initialisations (in plots we report averages and one standard deviation). For the SST2 dataset we show validation accuracy, and for AG News and BeerAdvocate we report test accuracy. We start with looking at the empirical results, showing for each dataset which words are considered important. Furthermore, we show that the latent vocabulary models learn task specific words (§2.5.1). Then we look at what effect the selection rate has on the classification accuracy (§2.5.2). For SST2, the words kept in the vocabulary should express strong sentiment, as neutral words are less informative for classification. Similarly, for an LV-2-model trained on SST2, we should observe more positive sentiment words in the latent vocabulary related to the positive class (and vice versa). These hypotheses are tested in §2.5.3. When learning class-specific vocabularies, we create a collection of topic words per class. For these topic words, we can calculate a topic coherence metric (§2.5.4), and use them as input for a simple classification model (§2.5.5). We test how well the model can detect artifacts in the train data. This is done by using our model for the detection of data leakage (§2.5.6). In §2.5.7 we compare our stochastic gating mechanism with a deterministic baseline. This is followed by a comparison with a linear alternative for determining feature relevance (§2.5.6). Finally, we test our model on the real world dataset provided by MessageBird (§2.5.9).9

2.5.1 Empirical Results

First we focus on which words are kept in the latent vocabularies for the different datasets. To illustrate the vocabularies qualitatively, we sort the words based on their selection probability and list the top 10 in each vocabulary. Using the BeerAdvocate dataset we also demonstrate that latent vocabulary models learn task specific words. We train four models on the same data, each time using one of the four different 9_{This dataset is discussed separately as its content cannot be shared publicly for commercial}

(40)

2.5. Experiments 19

SST2 AG News BeerAdvocate

solid AFP not

best AP drinkable

powerful Internet but

fun NASA easy

bad . drain

worst Linux very

too nuclear great

sweet Iraq good

mess Apple refreshing

? Arafat bad

Table 2.2: Top-10 selected words using LV-LSTM for the three different datasets.

Taste Aroma Palate Appearance

drain bad thin no

great worst not head

bad not smooth nice

good cheap water gorgeous

but corn great thick

tasty awful perfect black

delicious macro flat great

very drain nice quickly

excellent ! excellent yellow

balanced chocolate creamy poor

Table 2.3: Top 10 selected words from the BeerAdvocate dataset, each column cor-responds to LV-LSTM trained to predict one particular aspect.

aspects as training target (taste, aroma, palate, and appearance). This provides us with insights into what words are important for a specific classification target. Results In Table 2.2 we show results for the three different datasets using the LV-LSTM model. The empirical results for the other two models are very similar, and can be found in Tables A.1 and A.2 in Appendix A.3. The words that are selected by the model seem to be indicator words for a specific class in the dataset. In Section 2.5.5 we perform experiments to test this hypothesis. In Table 2.3 the most important words are shown for the four different experiments with BeerAdvocate. Some general sentiment words are important for several aspects (e.g. bad, nice), while other words are aspect-specific (e.g. tasty for Taste, head for Appearance).

(41)

Figure 2.2: Accuracy versus selection rate for SST2. Dashed lines indicate baseline performance and the shaded area shows one standard deviation.

2.5.2 Effect of Selection Rate

In our approach we can target a particular vocabulary size. Providing less information to a model may have an effect on performance. However, by assumption, we can learn a subset that is truly informative for the task with only a minor impact on performance. For the three different datasets and three different models, we vary the selection rate and observe the performance. We compare the interpretable models to the baseline models described in Section 2.4, without the latent vocabulary.

Results Figures 2.2, 2.3 and 2.4 show the effect of different selection rates on the accuracy. A summary of the statistics is also provided in Table A.3 in Appendix A.3. As expected, the lower the selection rate, the lower the accuracy is. However, for the different datasets the selection rate has a different effect. It appears that the smaller the dataset (both in number of documents as vocabulary size), the bigger the effect of selection rate is. Note that the latent vocabulary models sometimes outperform the baseline method, this is probably coincidental as the differences are not significant.

(42)

Figure 2.3: Accuracy versus selection rate for AG News. Dashed lines indicate baseline performance and the shaded area shows one standard deviation.

Figure 2.4: MSE versus selection rate for BeerAdvocate. Dashed lines indicate baseline performance and the shaded area shows one standard deviation.

(43)

2.5.3 Sentiment Words

For every word in the SST2 dataset we aggregate the sentiment annotations on word level,10 _{and determine their overall sentiment using a majority count. This provides}

us with a sentiment label per word in the dataset. We compare the total distribution of word sentiments with the distribution of the top-k words in the latent vocabulary model. Secondly, for the class-specific vocabulary model, we compare the sentiment distribution of the words from the different vocabularies.

Results Figure 2.5 shows the distribution of sentiment classes for the total vocab-ulary and the top-30 of the latent vocabvocab-ulary. We notice that the words in the latent vocabulary have a stronger sentiment—both negative and positive—compared to the total vocabulary. The distribution of the latent vocabulary appears much flatter and in particular puts more mass onto very negative and very positive, suggesting that it’s capturing sentiment effectively. In Figure 2.6 we show the sentiment distribution for the LV-2-LSTM model. For the class-specific vocabularies we observe that the sentiment distributions differ; one vocabulary learns mostly positive words, while the other learns negative words. This is in correspondence with the two target classes for the SST2 dataset.

2.5.4 Topic Coherence

Using the class-specific latent vocabularies, we are able to create per class a list of relevant words. This can be seen a form of supervised topic modelling, where every topic corresponds to a specific class. We use Topic Coherence Normalised Pointwise Mutual Information (TC-NPMI) as a measure of topic coherence (O’Callaghan et al., 2015). We compare the topic words determined by our model with topic words found by Latent Dirichlet Allocation (LDA) using as many topic clusters as there are target classes in the dataset. LDA (Blei et al., 2003) is an unsupervised Bayesian admixture model for discovering topics in natural language.

Results Table 2.4 shows the mean TC-NPMI results for the topic clusters. We observe that for SST2, LDA outperforms our method. For AG News the topic words indicated by our model achieve a better score. In Tables A.4 and A.5 in Appendix A.3 we display the topic words of the two approaches. We observe that the words determined by LDA have a lot of overlap between the different topics compared to the words from the class-specific vocabularies. Furthermore, topics from LDA seem

(44)

Figure 2.5: Percentage of words in each sentiment class for the top-30 words of the LV-LSTM model train on SST2, compared to the total distribution of the sentiment classes of the words in SST2.

Figure 2.6: Percentage of words in each sentiment class for the top-30 words of the LV-2-LSTM model, trained on SST2. The two latent vocabularies are portrayed, corresponding to the positive and negative class in the SST2 dataset.

incoherent and not specific to a class, in contrast to the class-specific vocabularies. This is expected since LDA does not receive direct supervision.

2.5.5 Generalisability of Indicator Words

In order to show that the words kept in the latent vocabulary models are clear indicators for the different target classes, we fit a linear/logistic regression and a

(45)

20-Dataset LV-LSTM LDA

SST2 ´0.158 0.053

AG News ´0.136 ´0.157

Table 2.4: Mean TC-NPMI of the top-10 words for the different classes. For LDA we use the same number of topics as the dataset has classes.

Figure 2.7: Accuracy of logistic regression (indicated by the dashed line), and a random forest (indicated by the solid line) on the SST2 dataset. The models are fitted on the top words determined by the latent vocabulary models, the topic words determined by LDA, and the most frequent words in the data.

estimators random forest (Pedregosa et al., 2011) on a subset of the most important words. We do the same for words determined by LDA and the most frequent words in the dataset, and look at classification accuracy (or MSE) as we use increasingly more words.

Results Figure 2.7 portrays the results of this experiment. For logistic regression we observe that the words found by the interpretable models consistently outperform the two other approaches. For the random forest, fitted on less than 400 words, we see that the words from the interpretable models allow for a higher accuracy. For more words, we observe that the different word sources give the same accuracy. See Figures A.3 and A.4 in Appendix A.3 for results on the other datasets.

(46)

Figure 2.8: Index of leaked word versus the percentage of documents that contain the leaked label. Tested using the three interpretable models on the SST2 dataset. For all models 1 standard deviation is indicated.

2.5.6 Detecting Data Leakage

By sorting the words in the latent vocabulary, we learn about the feature importance for the classification task. We use this to test for the detection of data leakage, using the following setup. First we create a new token that is not yet in the dataset. We then insert this token into a selection of documents of the same class. This creates an artificial exposure of the target class in the training signal. The index of the new token in the sorted latent vocabulary indicates how well the leakage is detected. Results Figure 2.8 shows the index of the inserted word for different percentages of documents that contain this word. The relation between the number of documents that contain the inserted word and the detection of this words seems to be exponen-tial. The artificial data leakage is easy to detect when the inserted word occurs in only 1 out of 200 documents.

2.5.7 Deterministic Gating

We perform a comparison to a deterministic model that employs sigmoid gates, rather than our sparse differentiable layers. The biggest caveat being that sigmoid gates are dense and thus never evaluate to 0. Neural networks are sufficiently expressive that they may overcome small gate values and it is hard to claim a word—however small its gate—is not being used by the classifier. This is partly the same criticism previous work has directed towards interpreting attention weights (Jain and Wallace, 2019; Serrano and Smith, 2019) and a strong motivation for sparsity. Nonetheless, a reasonable baseline—which we document in Appendix A.4. For testing, we rank

(47)

Selection rate ARD-LSTM LV-MLP

5% 66.7 ˘ 2.4 79.3 ˘ 0.2

10% 70.5 ˘ 2.9 80.5 ˘ 0.3

20% 76.6 ˘ 2.2 81.5 ˘ 0.5

30% 80.0 ˘ 2.6 82.6 ˘ 0.3

Table 2.5: Selection rate versus accuracy for SST2, using the most relevant words determined by Logistic Regression with ARD as vocabulary in an LSTM (ARD-LSTM). For comparison we also show the results from the LV-MLP model.

the vocabulary by gate value and retain a percentage of highest-weighted words. Downstream performances are similar to that of our LV models but the words retained by thresholding are less specific to the different classes and seem to relate much more to frequency.

2.5.8 Linear Probe - Logistic Regression with ARD

We compare our model to a linear alternative for selecting relevant features, where we first determine relevance using Logistic Regression with Automatic Relevance Determination (ARD) (Tipping, 2001), and then use these words as vocabulary for a neural network. Similar to our LV models, we can vary the number of selected words. In this experiment we use an LSTM11_{with the subset of the vocabulary selected by}

ARD.

Results In Table 2.5 we show the accuracy for SST212 _{for a varying selection rate}

of the total vocabulary. For a low selection rate, the accuracy is approximately ten percentage points lower compared to the LV models.

2.5.9 MessageBird

In this section we experiment with using the proposed latent vocabulary models on the MessageBird dataset. Preliminary experiments indicated that even with a relative small keep rate, the performance was not affected by a big margin. Therefore we set the keep rate constant at a value of 5% for the latent vocabulary models. This dataset is particularly interesting because it demonstrates the direct usefulness for MessageBird and its customers. Since baselines already achieve good results in

11_{With hyperparameters similar to baseline model as described in Appendix A.2.}

12_{We are limited to a comparison with SST2. Logistic Regression with ARD does not scale well,}

(48)

Selection rate Accuracy Macro-Accuracy

MLP 93.2 ˘ 0.2 93.1 ˘ 0.2 CNN 98.5 ˘ 0.3 98.4 ˘ 0.3 LSTM 96.9 ˘ 0.2 96.3 ˘ 0.2 LV-MLP 94.3 ˘ 0.2 93.1 ˘ 0.2 LV-CNN 98.6 ˘ 0.3 98.1 ˘ 0.3 LV-LSTM 97.2 ˘ 0.2 97.0 ˘ 0.2

Table 2.6: Results for the MessageBird dataset, using a keep rate of 5% for the latent vocabulary models.

Figure 2.9: Relative class specific word scores for two artificial sentences. Red corresponds to a higher value in the latent vocabulary of the spam class, green to a higher value in the latent vocabulary of the not spam class. The latent vocabulary values are determined by the LV-2-LSTM model trained on the MessageBird dataset. The first sentence is classified as spam, the second sentence as not spam.

terms of classification accuracy, usefulness is provided by the insights from the latent vocabulary models. We illustrate for an artificial test sentence what words are used for classification.13

Results Table 2.6 shows the classification accuracies using the baseline and la-tent vocabulary models. Using the LV models did not have significant effect on the classification accuracy. In Figure 2.9 we illustrate the words important for classifica-tion using the LV-2-LSTM model. For visualisaclassifica-tion purposes, we subtract the class specific word scores related to the spam class from those related to not spam. Here word scores correspond to the selection probabilities learnt by the class-specific latent vocabulary model. This subtraction provides a relative score, with negative values corresponding to spam, and positive values related to not spam.

13_{We can not share insights into the specific words kept in the vocabulary as the content of this}

(49)

2.6 Conclusion

In this chapter we introduced a technique that learns a latent vocabulary of words that are most important for a classification task. We show that removing a large portion of the vocabulary has little impact on performance while enabling us to gain insights into what words are important for our problem. These words appear to express strong sentiment and are clear indicators for a class. We demonstrate that we can extend our technique to learn class specific vocabularies. The results are verified using several datasets, both from academia as well as from real-world data. Furthermore, the model is generally applicable to NNs that use word embeddings.

The insights provided by the model are on a global level. The words kept in the latent vocabularies tell us what words in a dataset are important for a specific class and model. These insights are easily amenable for inspection, even non-experts can draw conclusions based on the the words kept in the latent vocabulary. This makes it possible to detect problems that are not obvious when observing classification accuracy alone, such as data leakage or overfitting the train (and possibly test) data. In future work, we intend to extend our investigation to other NLP tasks, such as natural language inference (NLI), where our technique can help spot annotation artifacts (Gururangan et al., 2018). Additionally, experiments using other NNs are needed. We also remark three current limitations of our approach. First, we learn to select words from an existing finite vocabulary. Second, for now we assumed access to a meaningful tokenization—which may be more challenging for languages with highly productive morphology. Finally, the provided insights are on the input level and the NN itself remains a black box.

(50)

Chapter 3 Transparent Text Classification

using Word Mover Distance and Prototype

Selection

Word mover distance example

Abstract

The Word Mover Distance (WMD) is a locally transparent distance metric that compares text documents on word level, assigning smaller distances to documents that have more semantically similar words. The computational complexity of this distance metric limits its widespread use. We propose the use of prototype selection techniques in combination with WMD, such that both local transparency and computation speed are improved. By shifting from document-wise to class-wise classification we obtain dataset specific prototypes, hereby creating global transparency. This further improves computation speed. Using several datasets we show that the proposed techniques are easy to inspect on a local level and can be extended to allow for global transparency.

3.1 Introduction

T

ransparency of text classification models can be provided at various levels asdiscussed in Chapter 1. The latent vocabulary models introduced in Chapter 2 provide transparency at a global level by giving insights into which input features are important for classification. The downstream model, parameterised by a NN, is not amenable for inspection and remains difficult to interpret. In contrast to globally