Probabilistic Models for Joint Classification and Rationale Extraction

(1)

MSc Artificial Intelligence

Master Thesis

Probabilistic Models for Joint Classification

and Rationale Extraction

by

Lina Murady

10776389

January 27, 2020

36 EC January 2019 - December 2019 Supervisor: Dr. W. Ferreira Aziz Assessors: Dr. I.A. Titov Dr. M.A. Ríos Gaona

(2)

(3)

”The grandeur of space, dig it. Zillions of stars, each one gets its own pixel.” ”Awesome.”

”Maybe, but it’s code’s all it is.” ~Thomas Pynchon

(4)

Abstract

Text classification has been dominated by solutions based on neural networks (NNs). These NN-based models are often considered to be black boxes and are very difficult to interpret. In this work, we aim to train text classifiers that are amenable to interpreta-tion (by humans). We do so by letting the classifier provide justificainterpreta-tions for its decision, i.e. which words contribute towards classification. This has been known in the literature as latent rationale extraction, where a rationale is short and informative piece of the in-put text. Previous work has proposed models based on binary latent variables as well as sparse relaxations to binary variables. The former is trained with reinforcement learning techniques and the latter with techniques based on variational inference and reparame-terised gradients. Furthermore, in order to achieve compact rationales, previous works have employed sparsity-inducing penalties which require either careful tuning of hyperpa-rameters or the use of constrained optimisation. Our novel contribution is a probabilistic approach to jointly learn to classify and extract rationales, where we express our prefer-ence towards compactness via priors. We also introduce a hierarchical prior which can be seen as a differentiable relaxation of the classic Beta-Bernoulli model. Our model learns a posterior distribution that agrees on average (under the data distribution) with our prior specification. We show that our method learns compact rationales without compressing classification performance and dispenses with constrained optimisation. In this thesis, we focus on sentiment classification using the Stanford sentiment treebank.

(5)

Acknowledgements

I would like to thank Dr. Wilker Aziz for his great supervision during the thesis. He was always enthusiastic about the work and kept me motivated. Furthermore, I would like to thank Dr. Miguel Ríos and Dr. Ivan Titov for being a part of the committee and for reading and assessing this work.

(6)

CHAPTER

1

Introduction

Deep neural networks tend to work very well and have been widely adopted in the field of Artificial Intelligence (AI). However, their success also seem to go hand-in-hand with a call for more interpretability. In general, we do not know why a network made a certain decision, i.e. we have no access to justifications. We can assess predictions with different evaluation metrics, but we cannot know why a certain decision has been made. This becomes more important when we employ models in a real-world setting e.g. detecting hate speech and fake news. Often, we want to be able to trust the model in its predictions and this is closely related to being able to understand what the model does (Ribeiro et al., 2016). However, interpretation is not a very well-defined concept in the field of deep neural networks (Lipton, 2016). In this work we will focus on one aspect of interpretation namely transparency, i.e. have a better understanding of the model. We do so by focusing on text classification where, neural networks have become a dominant paradigm. We extend a text classifier with a mechanism that allows it to be explicit about what words do or do not contribute towards classification. This has been known in the literature as latent rationale extraction where a rationale is a compact and sufficient piece of the input text. The classifier makes a prediction solely based upon the extracted rationales. Thus, with a sufficient rationale, we mean that it is sufficient for prediction, i.e. does not hurt performance. We can inspect whether this is the case by examining the classification loss. Furthermore, with compact rationales we mean rationales that are short. This is not encoded in our objective and previous work has employed sparsity-inducing penalties to achieve compactness (Bastings et al., 2019; Lei et al., 2016). These penalties are controlled either manually by a hyperparameter or automatically by a Lagrange multiplier. We can view these penalties as posterior constraints on the rationales. Besides, achieving compactness, we also require a method to acquire rationales i.e. a rationale selector. Lei et al. (2016) have done this by modelling the latent rationales as binary variables and employ Bernoulli distributed selectors. Bastings et al. (2019) propose a sparse relaxation to binary variables and employ HardKuma distributed selectors.

We will take a probabilistic stance towards rationale extraction where we express our preference toward compactness via (Bayesian) priors. Rather than penalising the distribu-tion once obtained, we guide how the distribudistribu-tion is obtained. Furthermore, by positing a prior we no longer need to tune manually a hyperparameter or turn to constrained optimi-sation. We also introduce a hierarchical prior which can be seen as differentiable relaxation of the classic Beta-Bernoulli model. We will employ this distribution to learn a trade-off

(9)

between compact and sufficient rationales rather than expressing our preference directly. In both cases, neural networks learn a posterior distribution that acts as a rationale extrac-tor and which on average (under the data distribution) agrees with our prior specification. Additionally, we argue for a simplification of the sparse relaxation to binary variables pro-posed by Bastings et al. (2019). The motivation is that the original distribution has up to three modes, whereas it would be better to have two states since this resembles more closely a binary variable than one with multiple states.

Goals and Contributions In summary, our research questions are:

1. Can we take a probabilistic approach to rationale extraction and express preferences for compactness via a prior?

Previous work has expressed preferences via sparsity inducing penalties and required careful tuning or resorting to constrained optimisation. The goal is by positing a prior we can dispense with constrained optimisation and have easier to interpret hyperparameters. We show that in sentiment classification we can achieve similar performance to previous work.

2. Can we simplify the recently proposed HardKumaraswamy (Bastings et al., 2019) making it behave more closely to a variable with two states?

Bastings et al. (2019) propose a sparse relaxation to binary variables by stretch-and-rectifying a reparameterisable base distribution. The resulted distribution lives in [0, 1] and essentially has three states: one, zero and in-between. We argue that a distribution in the interval of [0, 1) resembles more closely a binary variable by having only two states: zero and not zero. We show that our approximation performs as well as the HardKumaraswamy.

3. Does a hierarchical prior allow us to learn a selection rather than pre-specifying one beforehand?

We propose a hierarchical prior and show that posterior inference finds a compression that is a good compromise given our goal, namely, to have sufficient and compact rationales.

Notation Some brief notes on notation. Throughout the thesis when we write x we mean a sequence of xN₁ = hx₁, x2, .., xNi. We use bold letters h to denote a vector. We denote a probability density functions (pdf) with lowercase letters i.e. p(·) and use uppercase letters i.e. P (·) to denote the cumulative distribution functions (cdf).

(10)

CHAPTER

2

Background

In this chapter we will review the different topics that are discussed within the thesis. Our goal is to build a probabilistic interpretable text classifier. Therefore, we will first formulate a general text classification model in Section 2.1. Since our probabilistic approach consists of latent variable modelling, we need approximate inference in order to learn. Consequently, in Section 2.2 we will discuss variational inference. Furthermore, we consider in Section 2.3 the stretch-and-rectify technique that allows for a reparameterisable distribution with discrete behaviour. Moreover, in Section 2.4 we will sketch an overview of work done on interpretability in the field of NLP.

2.1 Text classification

In text classification the task is to find the corresponding label y to a certain piece of text x. Nowadays most of the classifiers make use of neural networks and learn from the data. The data can either be supervised i.e. we know the true label for a given text or unsupervised where we have no access to the true label. In this section we will describe the procedure for a supervised setting.

We can learn a text classifier by learning to predict a categorical distribution over K possible labels given the input text. The categorical distribution over K labels is parame-terised by a K-dimensional probability vector. Thus we need to learn a mapping from our input space to this probability vector. We can use a neural network to learn that function and say that the categorical is parameterised by a neural network:

Y |x ∼ Cat(f (x; θ)) (2.1)

where f (·; θ) is a neural network with its own parameters θ. 1

In order to find the optimal values for θ we need to optimise our log-likelihood function. We can write that down for given set of N i.i.d observations, D = {(x(1), y(1)), ..., (x(n), y(n))},

1

(11)

as following:

LD(θ) =

K

X

k=1

log p(y(k)|x(k), θ) (2.2a)

= K

X

k=1

log Cat(y(k)|f (x(k)_{; θ)).} _(2.2b)

We can then use the log-likelihood function to estimate the parameters of the model:

θ∗ = arg max θ∈Θ

L_D(θ) (2.3)

In order to find a local optimum of the log-likelihood function we can resort to e.g. stochastic gradient-based methods which only require an estimate of the gradient (Robbins and Monro, 1951; Bottou and Cun, 2004). In fact, as long as the function is differentiable with respect to the input and the parameters, we can resort to automatic differentiation packages to obtain gradient estimates (Abadi et al., 2016; Paszke et al., 2017). For a trained classifier, predictions involve searching for the label that attains maximum probability given an input text, i.e. y∗ = arg max p(y|x, θ).

2.1.1 Architecture

In these section we will briefly describe a possible architecture for text classification. In general, such an architecture can be summarized in three layers: embedding layer, encoding layer and a output layer:2

e_i= emb(x_i) (2.4a)

hi= rnn(hi−1, ei; θrnn) (2.4b)

f (x; θ) = softmax(linearK(hn; θlinear)) (2.4c)

The first step is to convert tokens into embeddings allowing for a continuous representation for a neural network to work with. These can be optimised during training but in our description are fixed.3 The second step is to encode our embeddings with a recurrent neural network(Hochreiter and Schmidhuber, 1997). This outputs a hidden state for each time step i.e. each token in our sequence. The last step is obtain a probability vector that parameterises the categorical distribution. We can realise this by applying a linear transformation over the last hidden state with a K-dimensional output, equal to the number of labels. Lastly, a softmax activation is applied to obtain a properly normalised vector.

2_{We describe the layer of an architecture as following: output = layer(input, parameters).}

3

(12)

2.2 Variational Inference

We now describe variational inference (Jordan et al., 1999), a method to approximate prob-abilities densities. Furthermore, we will discuss how we can employ variational inference combined with neural networks and explain how we can compute the gradients of such a model.

In Bayesian inference we want to infer the posterior density i.e. what is the probability of the latent variable given the observed data. In order to infer this, we need a model over the joint probability distribution of the latent and observed variables. Such a distribution factorises as following:

p(x, z) = p(x|z)p(z) (2.5) where we can specify a likelihood p(x|z) and a prior p(z). Now we want to condition on the data and using Bayes rule we can write the posterior density as:

p(z|x) = p(x|z)p(z)

p(x) . (2.6)

However, computing Equation 2.6 is not straightforward since computing the marginal probability of the observation means marginalizing out the latent variable:

p(x) =

Z

p(x|z)p(z)dz. (2.7)

Intuitively, this means we need to consider all possible assignments for z in order to obtain the marginal. In fact, this is in general either intractable or computationally very costly.

In variational inference this problem is tackled by approximating the posterior through the use of optimisation. Variational inference posits a family Q of densities over the latent variables and then tries to find the member that is closest to the true posterior(Blei et al., 2017). In other words, we want to find the optimal q∗(z):

q∗(z) = arg min q(z)∈Q

KL [q(z)||p(z|x)] . (2.8)

However, as shown in Equation 2.7, we cannot compute p(z|x). Instead, an objective that lower bounds the marginal is optimised:

log p(x) ≥ − KL[q(z)||p(z)] + Eq(z)[log p(x|z)] (2.9a) This lower-bound is also called the evidence lower bound (ELBO). We show a derivation in Equation A.1 of the Appendix. This shows that maximising the ELBO is equivalent to minimising 2.8, but does not involve computing p(z|x) neither the marginal p(x). We

(13)

could also derive this lower bound by starting from our original goal in Equation 2.8 and show that minimising the KL equals maximising the ELBO:

arg min q(z)∈L KL [q(z)||p(z|x)] = Eq(z) log q(z) p(z|x) (2.10a) = KL [q(z)||p(z)] − Eq(z)[log p(x|z)] + log p(x) (2.10b) During optimisation we can consider log p(x) to be a constant since it does not depend on

q(z). We provide a derivation in Equation A.2 of the Appendix. We have seen now that

by maximising the ELBO we are pushing the variational distribution q(z) to be close to the true posterior p(z|x). Thus we can consider q(z) to be an approximation to p(z|x).

We have now defined an objective but have no strategy to optimise the ELBO. In general, optimising the ELBO requires a way to estimate variational parameters. We can either do this manually, by specifying a set of update equations or we can resort to stochastic-gradient based optimisation where we parameterise the posterior approximation by a neural network. In the next section we will define the ELBO when both the likelihood and the variational distribution are parameterised by a neural network. Furthermore, we will discuss how we can estimate gradients for such an objective.

2.2.1 Gradient estimations

We will discuss now how to estimate gradients when both the variational distribution and the likelihood are parameterised by neural networks. We can write down the objective as following:

arg max λ,θ

Eq(z|x,λ)[log p(x|, θ)] − KL [q(z|x, λ)||p(z)] (2.11) where λ are the variational parameters and θ the generative parameters which both are learned by a neural network.

As long as we can compute the gradients with respect to the parameters of Equation 2.11 we can employ automatic differentation. Since q(z|x, λ) is not a function of θ and integration is with respect to z, and not θ, we can interchange differentation with respect to θ and integration, to rewrite the gradient of the generative parameters as an expectation, which we then can MC estimate:4

Eq(z|x,λ) _∂ ∂θ log p(x|z, θ) MC ≈ 1 S S X s=1 log p(x|z(s), θ) (2.12a) z(s)∼ q(z|x, λ)

However, the gradient with respect to the variational parameters is non-differentiable since we cannot write it as an expectation of the gradient.

4

(14)

We will discuss two options to compute gradients for variational parameters namely the score function estimator and reparameterised gradients.

Score Function Estimator

A famous and generic gradient estimator is the score function estimator (SFE) (Rubinstein and Shapiro, 1990) also known as REINFORCE (Williams, 1992). The idea behind this estimator is that we rewrite the gradient by making use of the log derivative trick:

∂ ∂λEq(z|x,λ)[log p(x|z, θ)] = ∂ ∂λ Z q(z|x, λ) log p(x|z; θ)dz (2.13a) = Z log p(x|z, θ) ∂ ∂λlog q(z|x, λ)q(z|x, λ)dz (2.13b) = Eq(z|x,θ) log p(x|z, θ) ∂ ∂λlog q(z|x, λ) (2.13c) MC ≈ 1 S S X s=1 log p(x|z(s), θ) ∂ ∂λlog q(z (s)_{|x, λ)} _(2.13d) z(s)∼ q(z|x, λ)

This is an unbiased estimator but it is known to suffer from a high variance(Ranganath et al., 2014). The estimator does not exploit any functional dependency of the likelihood on z, and in fact the direction of the gradient is only controlled by q, while the likelihood can only scale it. However, it is possible to reduce the variance of SFE by using control variates (Greensmith et al., 2004; Grathwohl et al., 2017).

Reparameterised gradients

One other way to estimate gradients is to employ the reparameterisation trick(Kingma and Welling, 2013; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014). The main idea is that we re-express i.e. reparameterise our variational distribution as a deterministic transformation of the variational parameters and an auxilary random variable from a fixed distribution. We can describe the procedure as following:

 ∼ φ() (2.14a)

z = t(; λ) (2.14b)

where φ(·) denotes some distribution and t(·) a deterministic transformation. If t(·) is differ-entiable and invertible then it follows that q(z|x, λ) can be re-expressed as φ(t−1(z; λ)) |det J_t−1(z)|

and the expectation w.r.t to q(·) can be re-expressed to φ(·). Essentially, the transforma-tion moves the dependence from the measure of integratransforma-tion to a transformatransforma-tion inside

(15)

the integral. We can then rewrite our objective in Equation 2.11 in terms of the fixed distribution:

arg max λ,θ Eφ()

[log p(x|φ(, λ), θ)] − KL [q(z|x, λ)||p(z)] (2.15) The reparameterised gradient is known to have a low variance and is a unbiased estimator. We will briefly describe the reparameterisation trick for when our approximate distri-bution belongs to the location-scale family and for when our approximate distridistri-bution has a tractable inverse CDF.

Location-scale families The set of distributions that can be parameterised by a location parameter and scale parameter belong to location-scale family. We can reparameterise this family of distribution by an affine transformation of a sample from the standard member of the family. As an example, we can take the Normal distribution:

 ∼ Normal(0, 1) z = σ + µ ∼ Normal(µ, σ2) (2.16a)

and conversely  = z − µ

σ ∼ Normal(0, 1) (2.16b)

where µ is the location parameter and σ the scale parameters which in our case are predicted by a neural network.

Tractable inverse CDF When the approximate distribution has a tractable inverse CDF we can transform samples from the uniform distribution. As an example, we take the Ku-maraswamy distribution (KuKu-maraswamy, 1980). This is a continuous distribution defined on the open interval (0, 1). It has two parameters that control the shape of the distribu-tion. Importantly, it resembles the Beta distribution but has a closed-form CDF. We can reparameterise the distribution as following:

U ∼ Uniform(0, 1) z = F−1(u|a, b) ∼ Kuma(a, b) (2.17a)

and conversely U = F (z|a, b) ∼ Uniform(0, 1) (2.17b)

where a and b are the predicted shape parameters of a Kumaraswamy and where F−1(u) = (1 − ua1)

1

b. This has been used to perform reparameterised gradient estimation of deep generative models (Nalisnick and Smyth, 2016).

2.3 Stretch-and-Rectify

Discrete distributions are not reparameterisable but have an application in field of machine learning. Louizos et al. (2017) propose a stretch-and-rectify technique that allows for distributions with discrete and continuous behaviour. They stretch and rectify a base distribution such that the support is in the closed interval [0, 1] and the probability of 0 and

(16)

1 is non-zero. Louizos et al. (2017) stretch-and-rectify a BinaryConcrete distribution and Bastings et al. (2019) based upon Louizos et al. (2017) stretch-and-rectify Kumaraswamy samples.

We will now describe the procedure in more detail. We start by stretching a continuous base distributed by an affine transformation. This transformation changes the support from (0, 1) to (l, r) where l < 0 and r > 1. Now the stretched distribution has 0 and 1 in its support but since it is a continuous distribution the probability mass of any outcome on its support is 0. In order to assign probability mass to 0 and 1, they integrate the outside regions. All density before 0 is collapsed into a point mass at 0 and all density after 1 is collapsed into a point mass at 1. This is achieved by rectifying the stretched distribution i.e. applying a hard sigmoid. The Dirac-delta function is employed to model the point masses. An illustration of this technique applied to the Kumaraswamy distribution can be found in Figure 2.1. We can write down the density function as following:

f (h) = π0δ(h) + π1δ(h − 1) + πc

1(0,1)fS(h)

πc

(2.18) where π₀ and π₁ correspond to the point masses at 0 and 1 respectively, π_carea under the continuous curve in-between and fS the density of the stretched distribution.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 Kuma(0.5, 0.5) SKuma(0.5, 0.5, -0.1, 1.1) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 HardKuma(0.5, 0.5, -0.1, 1.1)

Figure 2.1: Left) we stretch a base distribution from the open interval (0, 1) to the interval (−0.1, 1.1). We integrate the outside regions i.e. collapse density into a point mass. Right) the resulting distribution with a point mass at 0 and at 1.

(17)

2.3.1 Gradients

As long as we can reparameterise our base distribution, we can compute reparameterised gradients for a stretch-and-rectified distribution. This is due to the fact that the stochas-ticity in the model solely comes from the base distribution. We will now demonstrate that stretch-and-rectified distribution is differentiable almost everywhere. Consider the sampling procedure of a HardKuma:

u ∼ Uniform(0, 1) (2.19a)

z = (1 − ua1)

1

b Inverse CDF of Kuma (2.19b)

s = l + (r − l)z Stretch by affine transformation (2.19c)

h = min(1, max(0, s)) Rectify by hard sigmoid (2.19d)

We want to compute the gradient of a function F with respect to the uniform variable and by chain rule we get:

∂F ∂u = ∂F ∂h ∂h ∂s ∂s ∂z ∂z ∂u. (2.20)

Note that all these operations are fully differentiable expect for the gradient of the rectifi-cation which has two discontinuities:

∂h ∂s =              0, if s < 0 0, if s > 1 1, if 0 < s < 1 @ if s ∈ {0, 1}. (2.21)

Since it is differentiable everywhere expect for s ∈ {0, 1}, a set of measure 0 under the continuous stretched density, we say the transformation is almost differentiable everywhere.

(18)

2.4 Interpretation in Deep Learning for NLP

Natural Language Processing (NLP) has been dominated by solutions based on neural networks. These models often perform well but are considered to be black boxes i.e. they are not very interpretable. However, as noted by Lipton (2016) interpretation is not very well defined in the field and has different aspects to it. In this section, we will give a small survey on different approaches to interpretability in field of NLP.

General interpretation Lipton (2016) argues that there are two important aspects to interpretation namely transparency and posthoc explanations. The latter is concerned with giving explanations after we have trained the models. They might not be informative about how the model works but are informative to e.g. users of a system. Transparency is concerned with giving an insight into how the model works i.e. explain the black box. However, Rudin (2018) argues that for high-stake decisions that depend on neural networks, we should rather build interpretable models than trying to explain a black box. Our work is closer to this line of work, namely, learning models that are more amenable to interpretation.

Explaining Trained Models Ribeiro et al. (2016) introduce LIME which is a technique applicable to any classification task and works by approximating the predictions, provided by the model, locally by an interpretable model. Another line of work explaining trained models is training diagnostic classifiers (Hupkes et al., 2018). The idea is that if the model has certain information, we should be able to generalise it. For example, Giulianelli et al. (2018) train a diagonistic classifier and show that a LSTM tracks number agreement between subject and verb. Another approach to explaining trained models is to examine the internal representations of a network. One way to do that is to look at the attention mechanism, introduced by Bahdanau et al. (2014). Since attention provides scores for the different parts of the input representation, higher scores are interpreted as the model finding those parts important (Mascharka et al., 2018). However, there is also a line of work that argues that attention as explanation has its drawbacks (Serrano and Smith, 2019; Jain and Wallace, 2019). We too refrain from interpreting attention weights, and instead learn something much like a gating mechanism that either drops positions completely or retains them with an arbitrary non-zero value.

Word embeddings One approach, specific to NLP is, to make word embeddings more interpretable. Often these embeddings are very high-dimensional and thus hard to inter-pret. One line of work makes embedding more interpretable by learing sparse embeddings (Faruqui et al., 2015; Panigrahi et al., 2019; Rios et al., 2018). Another line of work is to apply rotation methods to the embeddings (Park et al., 2017; Dufter and Schütze, 2019). In case of a sparse embedding, often a new embedding must be learned whereas methods

(19)

based upon the rotation use the available dense embeddings. There is also a line of work that takes a probabilistic stance towards embeddings (Bražinskas et al., 2017; Vilnis and McCallum, 2014).

(20)

CHAPTER

3

Probabilistic interpretable text classifiers

3.1 Models

In this section we will discuss the different approaches to rationale extraction and introduce our probabilistic approach. We will first describe a probabilistic text classifier with a prior over the latent model in Section 3.1.1. Since we are interested in selecting and dropping words, we discuss sparse relaxations to binary variables in 3.1.2. More precisely, we propose an adjustment to an existing relaxation to binary variables aiming at better mimicking a two-state variable. In Section 3.1.3 we introduce a differentiable relaxation of to the classic Beta-Bernoulli model. We impose a sparse Beta prior on the parameter of our Bernoulli prior in an attempt to discover what is the ideal trade-off between compactness and informativeness for rationales.

3.1.1 A probabilistic approach to rationale extraction

In this section we introduce a probabilistic interpretable text classifier augmented with latent rationales. We will first discuss our probabilistic approach and then discuss how this relates to previous work.

We can build our model by extending a text classifier with a sequence of latent variables

z = hz1, ..., zNi where each zi decides whether or not a token is available for classification. Thus, we model z as a binary variable and view z as a mask to x. We can describe the model as following:

Zi|b ∼ Bern(b) (3.1a)

Y |z, x ∼ Cat(f (x z; θ)) (3.1b)

where b denotes the parameter of the Bernoulli prior distribution and denotes the mask-ing operation. By settmask-ing b to a certain value, we directly express our preference towards compact rationales e.g. b = 0.3 means we want target a selection rate of 30%.

In order to estimate the parameters of the model we need to train the log-likelihood function: LD(θ) = N X k=1 log X z∈{0,1}|x(k)| p(z|b)p(y(k)|x(k), z, θ). (3.2)

(21)

However, this is intractable due to the fact that z can take on 2|x| different config-urations. Nevertheless, we can resort to variational inference to optimise a lower-bound instead:

log p(y|x, z) ≥ Eq(z|x,y)[log p(y|x, z, θ)] − KL [q(z|x, y)||p(z|b)] (3.3) where the first term corresponds to an expected classification loss and the second term pushes our variational approximation close to the prior. We model our variational distri-bution as following: q(z|x, y, λ) = q(z|x, λ) = |x| Y k=1 q(z(k)|x) = |x| Y i=1 Bern(z_i|g_i(x; λ)) (3.4)

where g(·; λ) is a neural network that predicts N Bernoulli parameters. Furthermore, we make two modelling assumptions, namely, that the latent variable is independent from the true label y which is convenient since we do not know the label at test time. Secondly, we assume the latent variables to be independent given x. This is known as the mean-field assumption.

We will jointly train the parameters of the classifier f and the rationale extractor g:

L(θ, λ) = X

hx,yi∼D

Eq(z|x,λ)[log p(y|x, z, θ)] − KL [q(z|x, λ)||p(z|b)] . (3.5)

where L_D(θ, λ) is a lowerbound on L_D(θ).

For gradients with respect to the generative parameters θ, we can take a MC estimate. However, since sampling is a non-differentiable operation and the Bernoulli distribution is not reparameterisable, we cannot compute gradients with respect to the variational parameter λ. Therefore, we will train the model with REINFORCE. Furthermore, the KL-divergence between two Bernoulli distributions is available in closed-form 1 and thus we have all ingredients to train the objective in Equation 3.5.

Lei et al. (2016) are the first ones to model rationales as latent variables. In fact, their work, as viewed by Bastings et al. (2019), can be seen as a probabilistic approach with a data-dependent prior:

q(z|x, y, λ)def= p(z|x, λ) (3.6) where p(z|x, λ) is, again, learned by a neural network that predicts Bernoulli parameters. Note that in this formulation the KL-divergence evaluates to zero since the approximate posterior is equal to the data-dependent prior. The graphical model of the two different model formulations can be found in Figure 3.1. Since a neural network controls p(z|x) it is

1

(22)

x y

z _b

θ

N

(a) Fixed prior

x y

z _λ

θ

N

(b) Data-dependent prior

Figure 3.1: In (a) probabilistic model with a fixed prior where b pre-specifies a selection rate. In (b) the model with a data-dependent prior where a neural network with parameters

θ controls the selection rate as a function of the input x.

not possible to express a preference for compact rationales. Moreover, there is no incentive to discard predictors (i.e words) since that very presumably hurt the classifier. In order to achieve this Lei et al. (2016) add a sparsity inducing penalty to their objective namely

L0 regularisation.2 This term penalises the number of non-zero selections and the effect of

the penalty can be controlled by the fixed regularisation hyperparameter. This parameter will decide how much of the text is selected and thus requires careful tuning.

Nevertheless, we do not have a clear interpretation of this parameter and therefore we cannot guide the model to select a priori 30% of the text. We can only do this by evaluating the model for different values of the hyperparameter. To automatically target such a selection rate Bastings et al. (2019) follow the work of Louizos et al. (2017) and relax the L0 regularisation.3 They do this by computing the expected value under the

data-dependent prior p(z|x, λ): Ep(z|x,λ)[L0(z)] = N X i=1 Ep(zi|x,λ)[Izi6=0] = N X i=1 1 − Pr(Zi∈ {0}|λ). (3.7)

In the case of a Bernoulli latent model 1 − Pr(zi∈ {0}|λ) is equivalent to the Bernoulli parameter bi (i.e. gi(x, λ)). The expectation then no longer depends on the sample but on the parameter of the Bernoulli and therefore becomes a differentiable objective. Note that the expected value is computed whereas in the work of Lei et al. (2016) L₀ was computed on sampled latent assignments.

By relaxing the penalty, Bastings et al. (2019) can now formulate a constraint opti-misation problem where they target a specific selection rate r. Since this is a challenging problem they resort to Lagrangian relaxation:

2_{A technique common employed in regularising the parameters of a neural network is to add a penalty.}

Due to the non-differentiable nature of L0 it is less employed than its counterparts L1 and L2.

(23)

max

µ∈R minθ,λ − LD(θ, λ) + µ(R(λ) − r) (3.8)

where µ is a Lagrange multiplier and R(λ) a is regulariser, in this case expected L₀. In order to optimise the Lagrange multiplier Bastings et al. (2019) follow Rezende and Viola (2018) where they employ state updates for the Lagrange multiplier rather than using stochastic-gradient based methods.

By comparing the probabilistic approach with the penalty our motivation becomes more clear. First, by positing a prior over the latent model, we guide how the distribution is obtained and can express our preference a priori. In the penalty case, we cannot guide the distribution, solely penalise the distribution after assessing it. Second, positing a prior does not require tuning a hyperparameter or a scheme in order to meet our preferences (i.e. the hyperparameter is the preference).

(24)

3.1.2 Sparse relaxations to binary variables

Since our latent variable model is concerned with learning what to select, we prefer distri-butions that are binary. A Bernoulli distribution seems to be then a natural choice and was discussed in Section 3.1.1. However, training a model with REINFORCE can have high variance. Ideally, we would have a distribution that is binary and can be trained with the reparameterised gradient and therefore having lower variance than REINFORCE. Bastings et al. (2019) for this reason, introduce the HardKumaraswamy (HardKuma) a distribution that allows for discrete behaviour but has reparameterised gradients. The support of the distribution is in [0, 1] and essentially has three states: zero, non-zero and one. We will propose a variant to the HardKuma and introduce the HardKuma0 which essentially has two states: zero and non-zero. The reason is two-fold. First, a distribution with two states has a closer resemblance to a binary distribution than one with multiple states. Second, from the perspective of a neural network, it is hard to claim that a non-zero gate encodes a notion of importance in its magnitude. Any non-zero value gives the neural network access to the input text, albeit scaled.

As in the case of a HardKuma distribution, we start with stretching a Kumaraswamy base distribution. However, since we are interested in having only one point mass namely at 0, we transform the support such that it changes from (0, 1) to (−l, 1) where l < 0. Again, we have 0 now in the support but since it is a continuous distribution the probability mass of any outcome on its support is 0. In order to assign probability mass to the outcome 0, we integrate the outside region, 0 < l, and collapse it into a point mass at 0. A step-by-step illustration for the HardKuma0 can be found in Figure 3.2.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 Kuma(0.5, 0.5) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Skuma(0.5, 0.5, -0.1, 1.1) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 HardKuma0(0.5, 0.5, -0.1, 1.1)

Figure 3.2: Left) base distribution in (0, 1). Middle) Stretched distribution in (−0.1, 1) and collapse shaded area into a point mass 0. Right) Resulting HardKuma0 distribution, after collapsing, in [0, 1)

The distribution function of the HardKuma0 is a mixture of a point mass at 0 and a stretched distribution that is truncated to (0,1):

(25)

where π0= Z 0 l fS(s)ds (3.10a) πc= Z 1 0 fS(s)ds = 1 − π0 (3.10b) fT(t) = 1(0,1)fS(t) πc . (3.10c)

The sampling procedure for the HardKuma0 can be described as following:

k ∼ Kuma(a, b) Sample from base distribution (3.11a)

s = kr − l Affine transformation (3.11b)

h = max(0, s) Rectify by a ReLU (3.11c)

where l and r = l + 1 are the lower and upper bound respectively. We transform such that the stretched variable lives in the support of (−l, 1). Note that our variant is very similar to the HardKuma but that our affine transformation and rectifier is different. As in the case of the HardKuma, all these operations are fully differentiable expect for the rectifier:

∂h ∂s =        0, if s < 0 1, if 0 < s < 1 @ if s ∈ {0}. (3.12)

Again, this transformation is differentiable almost everywhere since it it is differentiable expect where s ∈ {0}, a set of measure 0 under the continuous stretched density. An important difference, as compared to the HardKuma01, is that we always have a gradient for selecting i.e. HardKuma01 has no gradient when the outcome is 1.4

The objective of our probabilistic model does not change and is equal to Equation 3.5. However, HardKuma models do not have a closed-form KL-divergence and are only defined for two HardKuma distributions.5 Thus, when positing a prior we need to posit a HardKuma prior for a HardKuma model. If we denote two HardKuma0 distributions by

fH and gH and their corresponding point masses at 0 as π0 and π00 respectively, we can

write down the KL-divergence as:

KL(f_H||g_H) = π₀logπ0 π₀0 + πcEfT logfS(t) gS(t) (3.13) where fS(·) and gS(·) correspond to the respective densities of the stretched distribution. In order to compute the expectation, we need to take MC estimates by sampling from the

4

If s > 1, then h = 1. Since s > 1 has no gradient, by chain rule we have no gradient.

5_{Or any other distribution that mixes point masses at the exact same locations and densities with the}

(26)

truncated stretched distribution fT. In Section A.2.4 of the Appendix, we explain how we can sample from the truncated stretched distribution. Note that our KL-divergence is partly exact but also partly estimated.

Furthermore, with the HardKuma0 we want to model two types of information namely 0 or more than 0. We want to encode information on not zero rather than a particular non-zero value. We will do this by fixing the shape parameter b of the underlying Kumaraswamy to 1. This prevents an asymptotic mode at 1. Additionally, this turns the distribution into a single parameter distribution which allows us to constrain the point mass at 0, as we will see in Section 3.1.3.

We describe our method to circumvent instabilities in Section A.2 of the Appendix. Furthermore, Bastings et al. (2019) target a selection rate by the expected L0 penalty.

This penalty depends on the term Pr(Z_i = 0) which for a HardKuma model is a tractable and differentiable quantity. Recall, that Kumaraswamy has a closed-form CDF and is reparameterisable. Since the stretched distribution is based upon an affine transforma-tion, the CDF of the stretched distribution can be expressed in terms of the CDF of the Kumaraswamy. We show this in Section A.2 of the Appendix.

(27)

3.1.3 Hierarchical prior for text classification

In this section we will discuss our model with a hierarchical prior. The idea is that rather than pre-specifying a selection rate, we want to learn a compression rate i.e. what is the ideal trade-off between compactness and informativeness for rationales. We do so by putting a sparse Beta prior with a bias towards 0 on the parameter of the Bernoulli prior. This means that the Bernoulli parameter will be close to 0 most of time and therefore samples will be 0 most of the time. We want the model to learn to push away from this bias. Furthermore, we will shortly discuss how we could also target a selection rate with a hierarchical prior.

We can write down the model formulation as following:

Bi|α0, β0 ∼ Beta(α0, β0) (3.14a)

Zi|bi ∼ Bern(zi|bi) (3.14b)

Y |z, x ∼ Cat(f (x z; θ)) (3.14c) where α₀and β₀ correspond to the shape parameters of the Beta distribution. Since we are interested in sparse Beta priors with a bias towards 0, we will fix β0 to 1 and vary values

for α0. A plot of the different Beta distribution can be found in Figure 3.3.

Moreover, we interpret the Beta sample, b, as the probability of selecting. In the case where our latent variable is modelled by HardKuma0, rather than a Bernoulli, we can find a value for α such that HardKuma0(0|α,1) = (1-b):

α = −log(1 − b)

log(11) (3.15)

where α is the shape parameter of the underlying Kumaraswamy distribution of the Hard-Kuma0 and b the Beta sample. We explain in Section A.2.3 of the Appendix, how we obtained this.

The graphical model of our hierarchical model can be found in Figure 3.4. We can write down the generative part of the model as:

p(y, b, z|x) = p(y|x, z) |x| Y i=1 p(bi)p(zi|bi) = Cat(y|f (x z; θ)) |x| Y i=1 Beta(b_i|α, β) Bern(z_i|b_i) (3.16)

We can write down our inference model that approximates the variational posterior as:

q(z, b|x, y) = |x| Y i=1 q (bi|x) q(zi|bi) = |x| Y i=1 Kuma(b_i|g(x; θ)) Bern(z_i|b_i) (3.17)

(28)

0.0 0.5 1.0 0 5 10 15 20 Beta(0.1,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.2,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.3,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.4,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.5,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.6,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.7,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.8,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(0.9,1) 0.0 0.5 1.0 0 5 10 15 20 Beta(1,1)

Figure 3.3: Beta priors for different values for α₀ where β₀ is fixed to 1. We can see that as α₀ increases, we become closer to uniform and thus move away from the bias towards 0.

where we use a Kumaraswamy distribution to approximate a Beta the distribution (Nal-isnick and Smyth, 2016).6 Note that again we have a mean-field assumption and the variational factors are independent given the data (though we let z_i depend on b_i). We can

6

(29)

y z b x α β N

Figure 3.4: Our hierarchical model with a Beta prior.

derive our objective as following: Eq(b)

h

Eq(z|b)[log p(y, b, z|x) − log q(z, b|x, y)]

i

= Eq(b)

h

= Eq(b)

h

Eq(z|b)[log p(y|x, z)] + log p(b) − log q(b|x)

i = Eq(b) h Eq(z|b)[log p(y|x, z)] i − KL [q(b|x)||p(b)] (3.18) where KL [q(b|x)||p(b)] is the KL-divergence between a Kumaraswamy posterior and Beta prior. Note that p(z|b) and q(z|b) cancel each other out since they are both e.g. Bernoulli distributions evaluated for the same Beta sample b. The result is that we only have to compute the KL-divergence between a Kumaraswamy and a Beta. We can compute the KL via an analytical approximation introduced by Nalisnick and Smyth (2016):

KL [Kuma(α, β)|| Beta(α₀, β0] = α − α0 a (−γ − Ψ(β) − 1 β) + log αβ + log B(α0, β0)− β − 1 β + (β0− 1)β ∞ X m=1 1 m + αβB _m α, β (3.19) where γ is Euler’s constant, Ψ(·) is the Digamma function and B(·) is the Beta function. Generally, the infinite sum can be well approximated with the first few terms.

Moreover, next to trying to learn a selection rate, we could also use a hierarchical prior to target a selection rate. Since we interpret the Beta sample as the probability of selecting, we could specify the Beta prior such that on average we recover the selection rate. The expected value of a Beta distribution is:

E [X] = α0

α0+ β0

(30)

Since in our model formulation β0 is fixed to 1, we could easily specify a Beta prior whose

(31)

3.2 Experiments

In this section we will describe the setup for our experiments and the corresponding archi-tecture. We will discuss how to evaluate our models and present results.

3.2.1 Dataset

The task we concentrate on is sentiment classification and we use the Stanford sentiment treebank (SST) (Socher et al., 2013). The dataset consists of movie reviews and their cor-responding sentiment label. There are five classes namely very positive, positive, neutral, negative and very negative. Besides proving sentiment labels for each movie review, they provide labels for subphrases and words. However, in our work we did not exploit the tree structure but use word annotations for analysis. Furthermore, we use the provided split: 8544 training sentences, 1101 validation sentences and 2210 test sentences.

3.2.2 Architecture and settings

All our models are trained jointly meaning we train the rational extractor and classifier at the same time. We do so by having one neural network that learns the rationales i.e. provides the latent variable z and one neural network that classifies using solely the rationales i.e. makes a prediction based upon the input text x masked by z. We can write down our rationale extractor for a HardKuma0 with hyperparameters l and r as following:

ei = emb(xi) (3.21a)

hn₁ = birnn(en₁; λbirnn) (3.21b)

ai = 0.9624 + tanh(linear(hi; λa)) × 0.9528 (3.21c)

ui = Uniform(0, 1) (3.21d)

zi = s(ui; ai, 1, l, r) (3.21e)

where emb(·) is an embedding layer, birnn(·; θbirnn) an bidirectional recurrent encoder,

s(·; ai) is a function of the predicted shape parameter that allows for a reparameterised sample and where 0.9624 + tanh(linear(·; λ_a)) × 0.9528 is a linear transformation followed by a tanh activation where the output of the activation is scaled and shifted such that

ai lives in (0.0042, 1.9205). See Section A.2 for more details on constraining the shape parameters. This general structure is in fact applicable to all of our models . However, for a HardKuma01 we predict two shape parameters rather than one. Further, the Bernoulli distribution knows no reparameterisation and is trained with REINFORCE. Moreover, we use different activations for different model, these are summed up in Table 3.1. The

(32)

activation

HardKuma01 1.001 + tanh(·)

HardKuma0 0.9624 + tanh(·) × 0.9528

Bernoulli sigmoid(·)

Hierarchical prior softplus(·)

Table 3.1: Different activations used for the different models.

architecture for our classifier is the same for all models and can be written as:

e_i= emb(x_i) (3.22a)

hn₁ = birnn(zn₁en₁; θ_birnn) (3.22b)

o = softmax(linear_K(hn₁; θ_linear)) (3.22c) where softmax(linear(·)) gives a normalised probability vector over K classes. Furthermore, each embedding vector e_i is governed by a scalar z_i that decides whether e_i is available for classification. Note that there are no shared parameters among the rationale extractor and the classifier.

Our hyperparameter settings are taken from Bastings et al. (2019) and are summarised in Table 3.2. We use BiLSTMs to encode our sequence and apply dropout before the final output layer and to the word embeddings.

Optimiser Adam

Learning rate 0.0002

Word Embeddings Glove 300D (fixed)

Hidden size 150

Batch size 25

Dropout 0.5

Weight decay 1 * 10−6

Table 3.2: Hyperparameter settings taken from Bastings et al. (2019)

For our probabilistic models we encountered posterior collapse. We observed that the approximate posterior collapses to the prior and does not make use of the data. The approximation becomes independent from the data i.e. q(z|x) ≈ q(z) ≈ p(z). In order to circumvent posterior collapse we used free bits (Kingma et al., 2016). Furthermore, we employ a KL warm-up phase where for the first 20 epochs the KL is not trained.

In order to reduce the variance of the REINFORCE estimator we resort to control variates which have the following form:

Eq(z|x,θ) (log p(x|z, θ) − C) ∂ ∂λlog q(z|x, λ) (3.23)

(33)

where C is called the baseline. The REINFORCE estimator is often used in the field of Reinforcement Learning where log p(y|x, z) is called the reward and _∂λ∂ log q(z|x, λ) the policy gradient. For our experiments we used the moving average baseline which keeps a running average and running standard deviation of past rewards:

C =

_{reward − running avg.}

running std.

× reward. (3.24)

3.2.3 Evaluation

We will use accuracy and F1-scores to determine the performance of our classifier: accuracy = Number of correct predictions

Total number of predictions (3.25)

F1 = precision × recall

precision + recall (3.26)

where precision = _{True positive+True negative}True positive and recall = _{True positive+False negative}True positive .

However, since we are dealing with multiple classes we employ macro-F1. Macro-F1 com-putes a F1-score for each class by considering its own class as positive and the rest of the classes as negative. Each class is considered to be equally important and an average is taken over the F1-scores for each class. When we report F1-scores, we report macro F1-scores.

We train all our models until convergence i.e. no improvement anymore in F1-score for 20 epochs on the validation set. Furthermore, during test time we want to make de-terministic predictions rather than stochastic predictions. We briefly describe our strategy during test time for the different models.

Bernoulli For a Bernoulli distribution in order to obtain deterministic predictions, we take the arg max:

zi =

(

1, if b_i > 0.5

0, else (3.27)

where b_i denotes the parameter of the Bernoulli.

HardKuma01 For the HardKuma0 we take the arg max over the different configurations of the distribution and when the continuous interval is the most likely configuration, we take the mean µ of the underlying Kumaraswamy:

zi =        0, if π₀ > π1 and π0 > πc 1, if π₁ > π0 and π1 > πc µ else (3.28)

(34)

where π0 is the probability of 0, π1 probability of 1 and πc the probability of being in the continuous interval.

HardKuma0 For the HardKuma0 we take also the arg max over the most likely configu-ration:

zi=

(

0, if π₀ > 0.5

1, else (3.29)

where π₀is the probability of 0. Note that we do not set z_ito the mean of the Kumaraswamy but rather to 1.7

Hierarchical prior In the case of a hierarchical prior we do the following:

zi =

(

0, if Pr(Bi < 0.5) > 0.5

1, else (3.30)

where Pr(Bi < 0.5) denotes the probability of the Kumaraswamy sample being lower than 0.5. Note that the decision is made upon the approximate Kumaraswamy and not the Bernoulli distribution.

7_{The actual value we set this to is irrelevant, we set to 1 for convenience, but any non-zero value would}

(35)

3.2.4 Results

We will now discuss results in two different settings: targeting a selection rate and learning a selection rate. For our baseline model we report a average across 5 runs together with the standard deviation. All other models, are based upon one run.

Our baseline model follows the architecture of the classifier and has access to the com-plete sequence. The results can be found in Table 3.3.

Targeting a selection rate

Acc. F1

Validation 45.65(±0.37) 44.20(±0.41)

Test 46.34(±0.96) 44.43(±0.59)

Table 3.3: Performance scores for the baseline classifier without rationales.

For our models with a rationale extractor, we aimed for a selection rate of 30%. The results can be found in Table 3.4. In order to hit the target rate for the models with a fixed prior, we needed to tune the free bits rate r. In Section B.1 of the Appendix we provide validation results for the models with a fixed prior. For the models with a hierarchical prior, we targeted a selection rate by setting the expected value close to 0.3. We can see in Table 3.4 that most of the models do reach around the target rate except the models with a hierarchical prior. Furthermore, we observe that the models that do reach the selection rate, give a performance that is already close to the baseline.

Strategy Prior Rationale Extractor Acc. F1 Sel. rate

Expected L₀ - Bern 43.21 42.26 31.44

Regulariser - Hk01 43.26 41.82 25.08

- Hk0 44.48 42.39 24.29

Fixed Prior Bern(0.3) Bern 44.48 41.45 31.79

Hk01(0.0003, 0.17) Hk01 44.30 42.14 31.39

Hk0(0.15, 1) Hk0 44.43 42.55 30.30

Hierarchical Prior Beta(0.4, 1) Bern 40.41 36.85 12.99

Beta(0.4, 1) Hk0 42.67 39.49 16.51

Table 3.4: Performance scores and selection rate for the test set while targeting 30% selection rate.

We noticed that specifying a selection rate for the hierarchical model via the expected value not to work. We tried different Beta priors where most of the density was towards 0

(36)

and towards 1 i.e. U-shaped priors. However, we ran into numerical instabilities and could not investigate whether this would have helped.

Learning a selection rate

Rather than targeting a selection rate, we try to learn a selection rate. In Table 3.5 we can see the validation scores for different Beta priors for our Beta-Bernoulli model and our our Beta-HardKuma0 model. For both models a Beta prior with shape parameters (0.6, 1) seems to be a good compromise between performance score and selection rate. We observe that for both our hierarchical models, as the beta prior becomes closer to uniform, the model starts to select more. This in line with what we would expect since a uniform distribution has essentially no preference.

Prior Acc F1 Sel. rate

0.1 27.06 14.24 0.03 0.2 34.52 31.17 4.31 0.3 37.10 33.91 7.92 0.4 40.41 36.85 12.99 0.5 42.04 39.52 20.73 0.6 43.53 41.08 30.64 0.7 43.12 40.97 43.47 0.8 46.20 42.99 64.65 0.9 44.43 41.92 68.20 1 45.02 43.28 97.54 (a) Beta-Bernouli

Prior Acc F1 Sel. rate

0.1 25.79 15.84 0.65 0.2 40.77 37.09 6.67 0.3 40.14 37.21 10.45 0.4 42.67 39.49 16.51 0.5 43.57 41.88 24.65 0.6 44.25 42.00 31.97 0.7 44.98 42.52 44.29 0.8 43.30 41.87 57.65 0.9 44.34 43.42 68.24 1 44.80 42.13 83.63 (b) Beta-HardKuma0

(37)

3.3 Analysis

In this section we will analyse some of our models. We will discuss the effect of the selection rate and investigate the sentiment of the words that were selected. Furthermore, we inspect some rationales and plot

3.3.1 Selection rate

In Figure 3.5 we can see the performance of the HardKuma0 and Bernoulli model with a fixed prior for different selection rates. We can see that both models achieve a similar performance to the baseline. However, the Bernoulli model drops a bit in performance when selecting more.

0 20 40 60 80 100 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 50.0 HardKuma0 Bernoulli baseline

Figure 3.5: Selection rates plotted against accuracy.

3.3.2 Selecting sentiment

Since SST provides sentiment annotations at word level, we can inspect the sentiment of the words that were selected. In line with the analysis of Bastings et al. (2019), we observe that words that express a sentiment are more often selected than words that have a neutral sentiment. We also observe that words with a stronger sentiment(very negative, very positive) are relatively selected more. We can investigate this by plotting for each model which percentage of a sentiment class was selected.

In Figure 3.7 we make a comparison between the models with a fixed prior. We do not observe a large difference between the models. There seems to be a slight bias towards selecting negative words as opposed to positive words. In Figure 3.6 we compare the models with a fixed prior against the models with a L0penalty. We can see that a Bernoulli model

(38)

very negative negative neutral _positive very p ositive sentiment 0.0 0.2 0.4 0.6 0.8 1.0 Hk0-Hk0 Bern-Bern Hk01-Hk01 L0 Hk0 L0 Hk01 L0 Bern

Figure 3.6: Priors compared to L0

very negative negative neutral _positive very p ositive sentiment 0.0 0.2 0.4 0.6 0.8 Hk0-Hk0 Bern-Bern Hk01-Hk01

Figure 3.7: Fixed priors

with L0 and a HardKuma01 model with L0 select relatively more compared to the other

models. As opposed to the HardKuma0 with L₀ seems to select much less compared to the other models.

In Figure 3.8 we compare the models with a hierarchical prior against the models with a fixed prior. We observe that models with a hierarchical prior select relatively less than models with a fixed prior. Furthermore, the difference in selecting is also much smaller than compared with a fixed prior.

(39)

very negative negative neutral _positive very p ositive sentiment 0.0 0.2 0.4 0.6 0.8 Hk0-Hk0 Bern-Bern Beta-Hk0 Beta-Bern

Figure 3.8: Fixed prior compared to Hierarchical prior

3.3.3 Rationale inspection

Figure 3.9: Correct predictions for the HardKuma0 model with a fixed prior. We will look into some examples of rationales that are either correctly classified or wrongly. In order to obtain the rationales we used a HardKuma0 with a fixed prior with a selection rate of 29.91%. We inspect rationales extracted on the validation set. In Figure 3.9 we can see extracted rationales that lead to a correct prediction and in Figure 3.10 rationales that lead to a false prediction. A sentiment number of 4 means very positive and sentiment number of 0 means very negative.

We can see in Figure 3.9 that when a correct prediction is made, the sentiment of the golden label is often present in the extracted rationales. The model is thus for some examples capable of selecting words that express the sentiment of a sentence.

In Figure 3.10 we observe that for false predictions there are many rationales extracted with the sentiment 2 i.e. neutral sentiment. If the model only selects neutral words, it

(40)

Figure 3.10: False predictions for the HardKuma0 model with a fixed prior.

(41)

3.3.4 Plotting the probability of selecting

(a) (b)

(c) (d)

Figure 3.11: Beta-Bernoulli

For the hierarchical models we plot the probability of selecting. For each token in our sequence we compute what the probability of selecting is and plot heatmaps. In Figure 3.11 we can see a heatmap for a Beta-Bernoulli model and in Figure 3.12 for a Beta-HardKuma0 model. We can see that words that are selected have a higher probability than words that are not selected. In fact, words that are chosen tend to have a higher probability than 50%. This is not surprising since during test time, we choose words with a probability of at least 50%.

(42)

(a) (b)

(c) (d)

Figure 3.12: Beta-HardKuma0

3.4 Related work

We will now discuss related work to the topics that make up the thesis.

Rationales Modelling rationales as latent variables was first introduced by Lei et al. (2016). However, the idea of rationales was first proposed by Zaidan et al. (2007). These rationales were not latent but rather human-annotated. The aim is to improve the per-formance of text classifiers by incorporating rationales into the model. The idea is that annotating rationales might be more benefical than annoting more data (Zaidan et al., 2007; Zaidan and Eisner, 2008; Zhang et al., 2016). This work of line has thus complete supervision for rationales whereas in latent rationale modelling we want to learn what the rationales are. The work of Titov and McDonald (2008) is also concerned with learning which pieces of the text are informative and jointly learn a classifier and an extractor. However, their aim is to improve summarisation as opposed to providing justifications for classification.

(43)

Hierarchical models Using a hierarchical prior is a well established technique in the field of Bayesian modelling. However, we will now focus on the use of a hierarchical priors in the field of machine learning. Goyal et al. (2017); Klushyn et al. (2019) note that the prior used commonly in Variational Auto-Encoder (VAEs) is often too simple and not capable of capturing the data. Therefore, Goyal et al. (2017) propose a tree-based non-parameteric Bayesian prior which allows for infinite capacity and can learn, due to the hierarchical nature, a aggregated structured representation of the data. Klushyn et al. (2019) view VAEs as a constrained optimisation problem and within that view introduce hierarchical priors. However, in this line of work the prior is learned from the data whereas in our work the prior is fixed.

Rectified distributions The idea of a rectified distribution is not novel and has for example been applied to a Gaussian distribution (Socci et al., 1998). Furthermore, as noted before, Louizos et al. (2017) obtain a rectified distribution by applying the stretch-and-rectify principle to the BinaryConcrete distribution. However, their aim is to achieve sparse parameters as opposed to sparse activations.

Rolfe (2016) learn a relaxation of discrete random variable by specifying directly a mixture of a point mass at 0 and a continuous normalised density over the unit interval. However, in their work the point mass at 0 and the shape of the distribution over (0, 1) are modelled independently. In the case of the stretch-and-rectify principle the shapes are tied via the base distribution which ensures some smoothness.

Sparse models Sparsifying models is a way to obtain more interpretable models. Martins and Astudillo (2016) propose to use sparsemax, as opposed to softmax, as an activation that allows for assigning zero probability to certain outcomes. Importantly, they show that this is differentiable. Niculae and Blondel (2017) make use of this fact and by smoothing the max operator are able to learn sparse attention mechanisms. However, as noted by Bastings et al. (2019), their work is concerned with K-valued outcomes and latent rationale modelling with binary outcomes.

(44)

CHAPTER

4

Conclusion

In this this work, we have taken a probabilistic approach to latent rationale modelling. The aim was to express preferences via a prior rather than employing sparsity-inducing penalties and thereby dispense with hyperparameter tuning and constrained optimisation. While employing a fixed prior, we encountered posterior collapse. We were capable of circumventing it by applying certain heuristics to the KL i.e. free-bits. We could tune the free-bits rate such that we would get the desired selection rate and corresponding performance. However, it does still require tuning but the approach is capable of reaching the pre-specified selection rate and get a similar performance to the baseline.

Furthermore, we introduced the HardKuma0, a simplification to the HardKuma01. We observed that the model employed with a penalty and in the fixed prior setting, we get a similar performance and reach similar selection rates. It seems that by simplifying the HardKuma01, we have not lost any expressiveness but have less zero gradients.

Moreover, we introduce a hierarchical prior which can be viewed as a differentiable relaxation the classical Beta-Bernoulli model. We used this model to learn a selection rate by positing different Beta priors with a bias towards 0. The goal was to find a Beta prior that was sufficient but also compact. The prior we found had a selection rate of around 30% and had close to baseline performance. However, pre-specifying a selection rate for this model seemed rather unstable. We had to deal with numerical instabilities which we could not solve.

Overall, we noticed that models that do work, are inclined to select sentiment words rather than neutral words which is line with previous work. However, the hierarchical models seem to conform less to this trend.

For future work, one could investigate whether structured inference plays a role. We have chosen to model all variational factors independent from each other given the data. Previous work on sparsity inducing penalties did employ dependency and it could make a difference as well in the probabilistic setting. Furthermore, the proposed models were only applied to the case of sentiment classification. However, one could view latent rationale modelling as a latent gating mechanism and apply it to different problems. Moreover, the hierarchy of the prior could also be extended by e.g. putting a gamma prior over our beta prior. In fact, how deeper the hierarchy how less of an influence the prior is going to have.

(45)

Bibliography

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learn-ing. In 12th {USENIX} Symposium on Operating Systems Design and Implementation

({OSDI} 16), pages 265–283.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate.

Bastings, J., Aziz, W., and Titov, I. (2019). Interpretable neural predictions with differ-entiable binary variables. CoRR, abs/1905.08160.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859âĂŞ877. Bottou, L. and Cun, Y. L. (2004). Large scale online learning. In Advances in neural

information processing systems, pages 217–224.

Bražinskas, A., Havrylov, S., and Titov, I. (2017). Embedding words as distributions with a bayesian skip-gram model.

Dufter, P. and Schütze, H. (2019). Analytical methods for interpretable ultradense word embeddings.

Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C., and Smith, N. A. (2015). Sparse overcomplete word vector representations. In Proceedings of the 53rd Annual Meeting of

the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1491–1500, Beijing,

China. Association for Computational Linguistics.

Giulianelli, M., Harding, J., Mohnert, F., Hupkes, D., and Zuidema, W. (2018). Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information.

Goyal, P., Hu, Z., Liang, X., Wang, C., and Xing, E. P. (2017). Nonparametric varia-tional auto-encoders for hierarchical representation learning. In Proceedings of the IEEE

International Conference on Computer Vision, pages 5094–5102.

Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. (2017). Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv

Probabilistic Models for Joint Classification and Rationale Extraction

MSc Artificial Intelligence

Master Thesis