Applying Latent Graphical Structures To Natural Language Inference

(1)

MSC

ARTIFICIAL

I

NTELLIGENCE

MASTER

THESIS

Applying Latent Graphical Structures to

Natural Language Inference

by

ALEXANDER

THOMAS

G

EENEN

11855924

October 16, 2019

36 ECTS February 2019 - October 2019 Supervisor: Dr M Rios Gaona Assessors: Dr J Zuidema Dr W Ferreira Aziz

(2)

Abstract

Natural Language Inference (NLI) is unique in that it combines logical inference with a grounding in natural language. Two sentences, a premise and a hypothe-sis, are provided, and one has to predict the nature of the entailment relationship between the two. Due to the many facets of language this task is not a formalized setting where strict logical rules can be applied. As such, most neural Natural Lan-guage Inference models are unable to demonstrate that they have correctly captured aspects of language to then perform inference in this setting. This has been attributed to artifacts in the current datasets for this task, as well as shortcomings in the mod-els themselves. We attempt to address this by proposing modmod-els which incorporate latent graphs. These latent graphical structures are constructed using the external knowledge provided by dependency parses and the WordNet knowledge base, in order to help neural models better understand natural language facets. These graphs are then used as the basis for Graph Convolutional Layers (GCL), which enable the model to learn the relationships within and between sentences using these graphs. We furthermore extend the model to be able to induce these latent structures in a generative manner, in order to enable the model to learn latent structures beyond the proposed graphs. We train this generative model in a semi-supervised setting in order to incorporate the external knowledge into these induced graphs. We find that the induced structures do not reflect the external knowledge introduced in train-ing, but do result in improvements on the task of Natural Language Inference. We show that the proposed latent graphical structure based models improve on general language language benchmarks, suggesting that they aid model generalization.

(3)

iii

Acknowledgements

This thesis has been a very challenging but satisfying body of work. I would like to thank my supervisor Miguel Rios Gaona for his fantastic support. His guidance and attention kept me on track and inspired. Furthermore, I would like to thank Dr Wilker Aziz and Dr Jelle Zuidema for taking the time to read and assess my thesis. I also want to thank my friends and colleagues in the MSc AI program, and in the masters room. You have made these past two years unforgettable. Finally, I’m grateful to my family, friends, and Alice for their endless support and patience.

(4)

(5)

v

List of Figures

2.1 Depiction of the Reparameterization Trick, where the Source of

Ran-domness is Moved to Noise ε . . . 11

3.1 Graphical Model For Deterministic Natural Language Inference . . . . 17

3.2 Dependency Parse . . . 18

3.4 The Proposed Adjacency Matrix for the Dependency-based Graph of a Sample Premise-Hypothesis Pair . . . 19

3.6 The Proposed Adjacency Matrix for the Dependency and WordNet-based Graph of a Sample Premise-Hypothesis Pair . . . 20

3.7 Graphical Model For Latent Graph Generation . . . 22

3.8 A Visualization of the Composition of the Latent Graph Structure’s Adjacency Matrix by the Latent Random Variables . . . 22

3.9 The Proposed Model For Inducing a Latent Graph Adjacency Matrix . 26 3.12 Proposed Ablation Architectures . . . 29

4.1 Layer Comparison Loss Curves . . . 36

4.2 Validation Accuracies Per Epoch For Proposed Graphs & Ablations . . 38

4.3 Loss Curves For Proposed Graphs & Ablations . . . 40

4.4 Visualization of Latent Graphs For An SNLI Entailment Training Ex-ample . . . 45

A.1 GraphSIM Variants Loss Curves . . . 55

A.2 GraphSIM Variants Accuracy Curves. . . 56

(8)

(9)

ix

List of Tables

4.1 Performance Comparison of # of GCN Layers on the SNLI and MultiNLI Validation Datasets using Parse Graphs . . . 36 4.2 Performance comparison of the Proposed & Ablated Graphs On Both

The SNLI & MultiNLI Validation Datasets . . . 39 4.3 Ablation of the Proposed LSTM-GCN Encoder On Both The SNLI &

MultiNLI Validation Datasets . . . 40 4.4 Comparisons of the Proposed Induced Latent Graphs On Both the

SNLI & MultiNLI Validation Datasets . . . 41 4.5 Test Set Performance of the Baseline and Proposed Graph Networks

When Trained on the SNLI Dataset . . . 42 4.6 Test Set Performance of the Baseline and Proposed Graph Networks

When Trained on the MultiNLI Dataset . . . 43 4.7 Breakdown of GLUE Category Performance for Best Performing

(10)

(11)

1

Chapter 1

Introduction

Natural Language Processing (NLP) tasks concern the understanding and manipu-lation of language by machines. Although many advancements have been made in recent years using neural approaches, on top of the previous rules-based methods, various sub-domains and sub-tasks in this research field are still open problems.

One such task is Natural Language Inference (NLI)[1], [2], also known as Tex-tual Entailment (TE). This is a task in which two sentences are provided: a premise statement and a hypothesis statement, where one must determine whether or not the claims made in the hypothesis sentence contradict the premise, or are entailed by it. It is also possible that the two sentences have no hard determining factors, leading the hypothesis to be declared as neutral.

This inference task is not as straightforward as logical inference problems since NLI is grounded in natural language, and is not in a formalized setting where strict logical rules can be applied. The natural language grounding of this task not only obscures the logical facet of inference, but it also affects the evaluation of the infer-ence system. Since natural language can contain many ambiguous statements, the conditions in which the sentences are evaluated can be deciding factors for whether or not entailment is present. One such problem is entity coreference, meaning the en-tities/subjects involved in the two sentences are the same, e.g. ”a dog” refers to the same dog in both the premise and the hypothesis. An example of this is as follows:

premise: ”A dog is sleeping.” hypothesis: ”A dog is running.”

In the above case, the ”dog” entity could be seen as the same dog. Whether or not it is labeled as neutral, or as a contradiction, depends on if the sentences do or do not refer to the same event. This form of ambiguity, and others like it, form a problem that is unique to the domain of natural language. Some approaches have been proposed to address these issues with varying degrees of success [1], [3].

Additionally, if the ideal is to have an NLP system that is able to understand and manipulate language successfully, it would invariably have to be able to perform inference in order to meet end goals. As such, the NLI task is a key part of NLP moving forward.

One of the shortcomings of current NLI models is that they are constricted by a failure to capture aspects of language correctly[4], [5]. It has been suggested that this

(12)

can be attributed to artifacts in the current NLI datasets, as well as shortcomings in the models themselves. In order to remedy this, we propose methods to incorpo-rate latent graphs, grounded in external knowledge, into a model we refer to as GraphSIM. Through this research we seek to demonstrate that the modeling of latent graphs is of benefit to NLI models and that the introduction of external knowledge can address shortcomings in existing approaches.

1.1 Research Questions

In this work we address the following research questions:

Question 1: Can we use syntax parses and knowledge bases to help neural approaches to better understand natural language facets for use in inference?

In this thesis we apply deterministic approaches to constructing graphs for use in the task of natural language inference (NLI). In particular, dependency syntax parses and knowledge bases such as WordNet [6] are used to construct graphs using deterministic rules for use in NLI networks. We show that the Graph Convolutional Networks [7] that use these graphs demonstrate comparable performance to baseline approaches on the task of natural language inference - although they demonstrate similar results on evaluation datasets that test language understanding.

Question 2: Does the introduction of semi-supervised generative methods for latent graphical structures improve a model’s language comprehension and inference?

In addition to applying deterministic approaches to the construction of graphs, we also hypothesize that there are possible influences, flaws, and/or biases that the introduction of syntax parses and knowledge bases introduce into NLI models. In order to address these possible issues, we propose the usage of generative models for the creation of the latent graph structures, which are then subsequently incorporated in the Natural Language Inference models.

(13)

3

Chapter 2

Background

In this chapter we give an overview of the previous techniques and work that our contributions are based on. In Section2.1we describe the task of textual entailment, its datasets, and the existing critiques of the datasets. In Section2.2we discuss the existing models for textual entailment, including models that use external knowl-edge, and those that incorporate and/or induce latent tree structures for use in this task. In Section2.3we discuss Graph Convolutional neural networks and their prior applications in Natural Language Processing. In Section2.4 the family of machine learning techniques known as Variational Inference (VI) is described. In Section2.5 the Variational Auto Encoder, a model that applies VI is discussed. In Section2.6 existing methods to learn discrete latent spaces, and to minimize variance when learning are discussed. Finally, in Section2.7, variance reduction techniques are dis-cussed, which are relevant when taking small samples.

2.1 Textual Entailment Datasets

The Stanford NLI (SNLI) dataset is one of the largest datasets for the task of textual entailment[1]. This dataset was constructed using the Amazon Mechanical Turk ser-vice and is the first standard dataset for the task with a large number of examples, making it well suited as a dataset for neural approaches. The authors took care to address the issues of coreference of events and entities by using a specific viewpoint to avoid it. Event coreference is when it is unclear whether or not the same event is being referred to, which can lead to confusion about the entailment status of a premise-hypothesis pair. This is because the pair could have a different label de-pending on the clarification of this ambiguity. The same goes for entity coreference, which is when it is unclear who, what, or where is being referred to.

Another dataset is the MultiNLI dataset [3], which aims to address the idea of domains into a NLI dataset. It contains ten different unique domains,and test sets, which enable testing between domains in order to test whether or not the models that have been trained are generalizable or are able to handle unseen domains.

Khot et. al.[8] introduce the SCITAIL dataset, that differs from the above two datasets. Unlike these datasets, the sentences were not derived soley for use in NLI,

(14)

and come from science multiple choice questions. The authors argue that these ques-tions are a more real-world example of the task at hand. They also go on to show that existing approaches such as Parikh et. al., that perform very well on the SNLI and MultiNLI datasets, struggle on SCITAIL. They introduce a newer architecture that exploits the linguistic structure of the sentences to achieve a higher score. 2.1.1 Dataset Weaknesses

Although the datasets mentioned in Section2.1 are very large and cover multiple domains, they do not fully test the real world effectiveness of models. Gururangan et. al.[4] highlights that classifiers are able to extract unintended features from the SNLI and MultiNLI datasets in order to perform without learning the intended task of NLI. Classifiers are able to learn unintentional features in the data which allow them to make reasonably accurate predictions without learning the core inference task. Examples include patterns in hypothesis language - for example, linguistic patterns are too class specific, which make them easy targets for the models. The sentence length of the hypothesis alone is a relatively strong predictor for the final class label. This leads to current classifiers being able to inflate their real world per-formance since they are exploiting dataset patterns. Further issues are highlighted in Glockner et. al.[5], which introduces a new test set that is derived from the SNLI dataset. The changes are relatively small alterations to the dataset, which drastically effect the model performance. Emphasis is placed on the performance on contradic-tion examples, since these are where neural models struggle the most.

Additionally, an adverserial approach to test model performance is introduced in Nie et. al.[9], that attempts to unpack and explore how and what aspects of language NLI models are using when they are trained. The two different things that a model should be able to learn for NLI (according to the authors) are lexical information (i.e. which words are present), and compositionaliy (i.e. how well the semantic understanding is captured by the models). They evaluate current state of the art models, that all struggle when presented with compositionally different examples, which are clearly distinguisable by humans.

2.2 Neural NLI Models for Textual Entailment

Many neural models have been proposed to tackle the task of Natural Language Inference. Some effective approaches have focused around attention based mecha-nisms, which are the neural equivalent of alignments. Parikh et. al.[10] propose a model that is heavily based on these alignments. It uses a three step process which makes it parallelizable. The model computes soft-alignments between the two sen-tences (the premise and the hypothesis), in addition to attention within the same sentence (called intra-attention) in order to model the relationships between words in the sentences. This a relatively ”Bag of Words” based approach, which means that it is an approach in which the word order is ignored.

(15)

2.2. Neural NLI Models for Textual Entailment 5

Another model that utilizes attention mechanisms is ESIM[11]. ESIM a relatively complex architecture which uses a layered approach, first encoding the input words by embedding them into a vector space, then using an Long Short Term Memory (LSTM) layer to preprocess them. The LSTM can either be a bidirectional LSTM, meaning that it has a backwards and forwards pass over a sentence, or a tree LSTM, which is based on a syntax tree. Basic attention between phrases in the sentences is then applied, which is concatenated in various combinations. These intermediate attended vectors are then once again fed through an LSTM layer in order to perform Inference Composition, and is then classified.

The Densely Interactive Inference Network (DIIN)[12] is a proposed architecture that uses Convolutional Neural Networks (CNNs), to arrive at a mechanism that is similar to attention. An element-wise interaction tensor is created which is a mas-sive tensor where each word in both sentences interacts with the words in the other sentence. Then CNNs such as AlexNet can be applied to this tensor, which can then classify the sentence pairs.

Finally, the Bidirectional Encoder Representations from Transformers (BERT)[13] model is a language model that has achieved state of the art performance on NLI through transfer learning. The language model itself is based on the Transformer architecture[14], that uses multiple attention mechanisms to arrive at final represen-tations. The BERT model trains this architecture and builds on it using a word token masking strategy where tokens are masked at random in order to aid language learn-ing. The resulting pre-trained BERT model is then transferred to downstream tasks such as Natural Language Inference.

2.2.1 Incorporating External Knowledge

Research has also been conducted into dynamically incorporating external knowl-edge into models, so models can make use of more domain specific information and be able to generalize beyond what they were initially trained on.

Weissenborn et. al.[15] propose a framework for task-specific word embeddings using contextual information, such as background knowledge. This contextual in-formation is free-text and is used to modify a base general word-embedding. These general word embeddings are then combined with trainable character embeddings to produce the context free embedding matrix. The resultant embeddings can then be fed into any model that uses word embeddings as the primary input.

In addition, KIM[16] is a model that aims to enhance NLI performance by using attention over external knowledge. The external knowledge is structured semanti-cally into categories such as synonyms, antonyms, hypernyms, etc. These are then used to check whether or not words between the premise and the hypothesis are similar. If so, then an indicator function is used, which enhances the knowledge when computing attention. This approach requires relational embeddings such as WordNet[6], and is a model that is trained end-to-end.

(16)

2.2.2 Incorporating Latent Tree Structures

One important facet of Natural Language is syntax, which determines the structure of sentences and the ordering of words and phrases. For any syntactically valid sen-tence, a syntax tree exists which composes the individual words into phrases. These phrases codify the relationships and interactions between words, and are recursively joined together using grammatical rules until they form the complete sentence (at the top of the tree). Recent approaches in NLI have explored the use of syntax trees, and other latent tree structures to aid inference.

Choi et. al.[17] propose a recursive tree model that is used to induce latent tree structures called the Gumbel Tree-LSTM. It uses the Gumbel trick to create an argmax operation that is differentiable. The LSTM is applied recursively to create the tree structure, computing candidates for the new joined parent nodes at each step, and picks the new parent node based on the highest "validity" estimation. The authors also analyze the trees that are produced. If the same model (the tree LSTM) is separately trained on two different datasets or tasks (NLI and sentiment analy-sis), then two different structures are produced. This suggests different tasks lead to different trees.

Zhao et. al.[18] introduce a model to make more use of the structured nature of sentences. It computes the sentence attention over the premise and hypothesis sentences based on binarized syntax trees. These attentions and entailment proba-bilities are propagated upwards through the trees in a recursive manner from the leaves to the root, using LSTMs along the way. The alignment model over the trees can be improved using dual attention (attention from the premise to the hypothesis and vice versa). The elementwise product between these two is taken as the final attention, which helps to eliminate uncertainty.

Williams et. al.[19] explore the latent trees produced by neural models. The neural models tend to produce trees that don’t really reflect the structure of actual parse trees. The learned trees tend to be shallow (ST-Gumbel), induce constituents from words that are close together in sentences (ST-Gumbel), or are biased in terms of branching (RL-SPINN). These tree do not correspond with formalized syntax such as that found in the Penn TreeBank. However, these learned trees used in the neural models outperform the syntax tree based models. The authors have been unable to find a cause as to why these perform so well - this could also indicate the formalized trees are less than optimal/desirable for certain tasks.

Additionally, Niculae et. al.[20] propose a model which induces sparse latent structures. They use the SparseMAP [21] inference method to jointly induce latent distributions over parse trees, for both the premise and hypothesis sentences. This results in a sparse representations per sentence, which are then fed to a Tree-LSTM for inference.

(17)

2.3. Graph Convolutional Neural Networks 7

2.3 Graph Convolutional Neural Networks

Graph Convolutional Neural Networks (GCNs) have been successfully applied to various NLP tasks [22], [23]. These networks employ what are known as Graph Convolutional Layers, as introduced by Kipf et. al. [7]. This layer uses a linear approximations of Spectral Graph Convolutions to create a scaleable layer. This ap-proximation is used in a local neighborhood (i.e. it only applies/uses nodes that are one edge away from the node being convolved over). Kipf et. al. [7] demonstrate that it is competitive on semi-supervised graph labelling tasks.

Marcheggiani and Titov [22] introduce the application of GCNs to the task of se-mantic role labeling. The GCN acts on a graph created by the syntactic dependency trees. The node values the GCN iterates over are first fed through a bidirectional LSTM. This helps to implicitly propagate information through that would otherwise not be contained in the embeddings of the dependencies in the syntactic graph. The GCN layer also uses edge-wise gating to further improve the flow of relevant syn-tactic information. It is demonstrated that LSTMs and GCN layers complement one another, and their proposed network achieves SOTA performance.

Bastings et. al. [23] propose an application of Graph Convolutional layers to the task of neural machine translation (NMT). They model the sentence structure in a generative manner as a latent variable, which the authors then sample a graph from. This graph is then fed to the GCN layer in the translation portion of their architecture. They find that if the GCN layer is applied before an LSTM the GCNs do not provide a lot of useful information. However, it was found that if they precede a Convolutional Neural Network encoder, they provide useful dependencies.

2.4 Variational Inference

Variational Inference (VI) [24], also known as Variational Bayes (VB) [25], is a family of machine learning techniques that is used to approximate probability density that would otherwise be intractable. The approximations can be optimized to improve their closeness to the target distribution.

2.4.1 Entropy

The roots of VI can be found in the field of Information Theory, namely, in entropy. Entropy can be described as the expected minimum number of bits needed to encode the location of a random variable X. This entropy H(X)is formulated as follows:

H(X) = −

Z

p(x)log p(x)dx=E[−log p(x)] (2.1) This quantity can be used to measure the similarity (uni-directional) between two probability distributions. The relative entropy of a distribution q(x) with respect to

(18)

a distribution p(x) can be measured using the Kullback-Leibler (KL) Divergence as follows: KL(p(x)kq(x)) = E p(x) logp(x) q(x) = E p(x) −logq(x) p(x) (2.2) The KL Divergence encodes p(x)with q(x)’s amount of entropy, and then takes the average of the true distribution. This measure can then be used to help approximate intractable distributions using a distribution that is tractable/computable.

2.4.2 Bayesian Statistics

Bayesian inference is founded on Bayes’ Rule: p(z|x) = p(x|z)p(z)

p(x) (2.3)

This is directly related to the computation of the joint probability:

p(x, z) = p(x|z)p(z) = p(z|x)p(x) (2.4) This joint probability, as well as what is known as the posterior probability, are often the distributions that are approximated in VI problems. The posterior probability p(z|x)is often a distribution of latent variables z conditioned on the observed vari-ables x.

The posterior, using Bayes’ Rule, can be computed as follows: p(z|x) = p(x|z)p(z)

p(x) (2.5)

However, the goal of solving for this posterior is often not analytically possible due to the intractability of computing what is known as the "evidence" p(x) =R_zp(x, z), since the latent variables z need to be marginalized out.

The ELBO

The log can then be taken of the ”evidence” as follows: log p(x) =log Z p(x, z)dz (2.6) log p(x) =log Z p(x, z)q(z) q(z)dz=logqE(z) p(x, z) q(z) (2.7) Once we have arrived at this form, then we can use Jensen’s Inequality, which states that for any concave function f ,E[f(x)] ≤ f(E[f]). Since the log function is con-cave, this can be applied to our case:

E q(z) log p(x, z) q(z) ≤log E q(z) p(x, z) q(z) (2.8)

(19)

2.4. Variational Inference 9

This quantity on the left-hand side that has been introduced is then the lower bound of log p(x). This lower bound, known as the Evidence Lower Bound (ELBO), can then be linked to the difference between the posterior p(z|x) and the function that approximates it, q(z).

This ELBO can be rewritten asL = E_q(z)[log p(x, z)]] +H(z)where H(z)is the

entropy of z in q. It can also be rewritten to see how it will behave when trying to find an optimal q(z).

L = E

q(z)[log p(x, z)] −qE(z)[log q(z)]

= E

q(z)[log p(x|z)] +qE(z)[log p(z)] −qE(z)[log q(z)]

= E

q(z)

[log p(x|z)] −KL(q(z)kp(z))

(2.9)

This first term on the right-hand side encourages the latent variables to help explain the observed data points in x. The second term encourages moving the density of the approximating function towards the prior distribution p(z).

The KL divergence between the true posterior and the approximator of the pos-terior can be used to provide some more insight into how the ELBO behaves:

KL(q(z)kp(z|x)) = Z q(z)log q(z) p(z|x)dz = − Z q(z)logp(z|x) q(z) dz= − Z q(z)log p(x, z) q(z)p(x)dz = − Z q(z)log p(x, z) q(z) dz− Z q(z)log p(x)dz = − Z q(z)logp(x, z) q(z) dz+log p(x) (2.10)

The left part of this term is the negative ELBO, so this can be substituted in: = −L +log p(x)

L =log p(x) −KL(q(z)kp(z|x)) (2.11) Since the KL divergence is always greater than or equal to zero, the ELBO is always less than or equal to the evidence p(x), proving that it is the ”lower bound”.

Mean Field Approximation

When trying to approximate the posterior distribution, a family of functions for q(z) must be chosen with which to approximate. A very common and simple assumption made is that the latent variable functions qi(z)are independent when conditioned

under the data X. The drawback of this assumption is that there is no correlation between any of the latent variables. However, the convenience of this assumption cannot be ignored, since the approximating function can be written as follows :

(20)

q(z) = _∏M_i₌₁qi(zi), which means the each z variable is governed by an individual

variational factor density qi.

This mean field assumption subsequently enables the iterative optimization of the approximator, through Coordinate Ascent Mean Field VI, or CAVI. The idea be-hind this technique is a message passing algorithm, in which we iteratively optimize each qi while keeping all others fixed. This has the quarantee that it will reach a

lo-cal optimum. The optimal q∗_i(zi)is then proportional to exp(E−i[log p(zi, z−i, x)])

(which can still be difficult to compute).

2.5 Variational Auto Encoders

The Variational Auto Encoder (VAE) is a model introduced by Kingma et. al. [26] and Rezende et. al. [27], that seeks to approximate a posterior p(z|x)parameterized by variables θ, by using an inference model q(z|x) parameterized by variables φ. This inference model can be almost any directed graphical model, which takes the form: q(z|x) = M

∏

j=1 q(zj|Pa(zj), x) (2.12)

In the Auto Encoder setting, the latent space and the inference model are the first of two stages in a pipeline. The second stage is the application of another model (a generative model) that seeks to reconstruct the input x from the latent space. By optimizing the accuracy of the reconstruction, a more accurate mapping in latent space can be learned, such that the inference model learns the latent space. This is done by using the ELBOL, since it is tight to the lower bound of log pθ(x):

L =E

qφ

log pθ(x, z) −log qφ(z|x)

(2.13) Since it is tight to the log of the evidence pθ(x), this will improve the generative model pθ(x|z). This will also improve the encoder (the inference model) since it minimizes the KL Divergence between qφ(z|x)and the posterior pθ(z|x)(as seen in Section2.4.2).

2.5.1 The Reparameterization Trick

As the inference and generative models in the VAE are typical parameterized neu-ral networks, it is essential to be able to perform backpropagation on the objective function. Given a datasetD, then the ELBO is simply a sum over the ELBOs of the individual data points:L_θ_,φ(D) =_∑_x_∈DL_θ_,φ(x).

Then both networks need to be optimized using this ELBO. The gradients for the generative model can be computed as follows:

(21)

2.5. Variational Auto Encoders 11 ∇_θL_θ_,φ= ∇_θE qφ log pθ(x, z) −log qφ(z|x) =E qφ [∇_θlog pθ(x, z)] ≈ ∇_θlog pθ(x, z) (2.14)

The generative model’s gradient can be easily computed since the gradient op-erator can be moved inside of the expectation, as the expectation does not depend on θ. This gradient can then be approximated using a Monte Carlo estimate as seen above.

FIGURE 2.1: Depiction of the Reparameterization Trick, where the Source of Randomness is Moved to Noise ε

The gradients for the inference model are not as straightforward: ∇φLθ,φ =

∇_φE_q_φlog pθ(x, z) −log qφ(z|x). Since the expectation depends on φ, this gradi-ent operator cannot be moved inside of the expectation. In order to address this problem, the reparameterization trick can be introduced. Instead of sampling from the inference distribution qφ(z|x), the sample can be rewritten based on the family of functions being used. If the approximating functions are Gaussians, then the infer-ence model can be parameterized by a neural network that produces both the mean and standard deviation of the sample. This can then be combined with a random noise sample from a standard normal gaussian (ε∼ N (0, 1)). As depicted in Figure 2.1, the sample is then constructed as z=µ+εσ.

Since the source of randomness is no longer the sample from the latent distribu-tion qφ(z|x)but instead from a random noise variable, the expectation is then over the random noise ε. This means that the gradient∇_φ can be moved inside, so that backpropagation can be performed:

∇φLθ,φ =E ε ∇φ(log pθ(x, z) −log qφ(z|x)) (2.15)

(22)

This reparameterization trick is not limited to the Gaussian distribution, but can be applied to any family of distributions where the source of randomness can be moved into a variable that parameterizes it.

2.6 Learning Discrete Latent Spaces

When applying Variational Inference, the goal is often to learn a latent space. When this latent space is continuous, learning this space is relatively trivial through back-propagation since gradients can be applied. This ease of learning breaks down if the latent space is discrete, e.g. when inducing the edges of a graph. Potential solutions to this problem can be of interest when constructing a latent graphical structure for use in the NLI task.

2.6.1 The Concrete Distribution

The problem that arises is that discrete variables present discontinuities that do not allow gradients to be backpropagated through to their conditioning variables. In both Jang et. al. [28] and Maddison et. al. [29], a relaxation of these discrete variables is presented, known as the Concrete distribution (or as the Gumbel-Softmax). This Concrete distribution allows the discrete variables to be relaxed and brought into a continuous space, which enables gradients to be backpropagated across them.

The relaxation takes the form of: Xi = exp((log(αi) +gi)/λ) n ∑ j=1 exp log(αj) +gj /λ

where Xi is the resultant continuous class probability in a vector of class

proba-bilities that replaces the discrete variable. The αiare the un-normalized class

proba-bilities prior to this relaxation, and giis an i.i.d. random variable sampled from the

Gumbel distribution. The variable λ is a temperature hyperparameter that deter-mines how ’peaked’ the softmax is. As λ → 0, the distribution becomes discrete in the limit, whereas the larger it becomes, the more flat the distribution becomes. Since the discrete variables have now been relaxed into a continuous domain, backprop-agation can be performed on them. However, when the relaxation temperature is too low (i.e. approaching the discrete case), it exhibits a far higher variance, whereas it exhibits bias when the temperature is too high. This means that the temperature needs to be tuned to balance this bias-variance trade-off.

Since the variables are discrete, the gradients that are computed based on relax-ations or other methods are estimrelax-ations of the true gradient. Several methods have been proposed in order to make these estimators unbiased as well as able to reduce variance in the resulting estimations, such as MuProp, REBAR, and RELAX [30]– [32]. These relaxations could potentially be applied to NLI, where the latent tree structures can be modeled as discrete latent variables.

(23)

2.6. Learning Discrete Latent Spaces 13 2.6.2 MuProp

MuProp [30] is a method that aims to reduce the variance of estimators used when computing gradients that cannot be solved analytically. It is very similar to the re-gression estimator, and borrows from the REINFORCE gradient estimator [33] that is often used in Reinforcement Learning. MuProp is an unbiased estimator that uses a modified baseline to reduce the variance while preserving its unbiasedness. The modified baseline is derived from a second network called the mean field network that propagates mean values instead of making samples as in the original network. This can then be used to propagate the first order Taylor expansion that is used as the control variate.

Since it uses a second network to approximate gradients, it does not suffer from discontinuities when estimating the gradients for discrete latent variables. This al-lows for Taylor expansions regardless of whether or not the variables are discrete or continuous.

MuProp is shown to perform as well as two other estimators, the Straight-Through and the 1/2 estimators, which are biased methods. In contrast to these other meth-ods, MuProp is unbiased, while still having relatively low variance.

2.6.3 REBAR

REBAR [31] is a similar method to MuProp in that it seeks to incorporate a control variate in order to reduce variance, and is also based on REINFORCE. REBAR seeks to incorporate the Concrete distribution’s continuous relaxations for discrete vari-ables. One of the largest contributions, however, has to do with the way that the distributions are parameterized.

Given a standard control variable estimator ∇_θ E p(b,c)[f(b)] = ∇θ E p(b,c)[f(b) −c] +p(Eb,c)[c] (2.16) where c is the control variate, the desired discrete quantity can be seen as a function of a variable z: b= H(z), where z = g(θ, u)is a logistic variable parameterized by θand the random variable u∼U(0, 1), with the function H that is a hard threshold

function.

This logistic variable can then be approximated using the Concrete distribution’s relaxation as b ≈ σλ(z), which is a sigmoid that is controlled by the temperature parameter λ. The key contribution of this paper is the observation that a control variable can be constructed which is a conditional marginalization of z through b.

An obvious candidate for the control variable would be the relaxed variable form of the control variable:

E p(z) f(σλ(z)) ∂ ∂θlog p (z) (2.17)

(24)

A conditional marginalization can be applied to this equation, which results in the following: E p(z) f(σλ(z)) ∂ ∂θlog p(z) = E p(b) ∂ ∂θp(Ez|b)[f(σλ(z))] + E p(b) E p(z|b)[f σλ (z)] ∂ ∂θlog p (b) (2.18)

The first part of this right hand term can be estimated using the reparameteriza-tion trick, where ˜z=g(v, b, θ)is a reparameterized estimation of z, with v∼U(0, 1):

E p(b) ∂ ∂θ p(Ez|b)[f(σλ(z))] = E p(b) E p(v) ∂ ∂θf(σλ(˜z)) (2.19) Applying this control variate to the gradient of the original discrete variable b we arrive at the REBAR estimator, where η is a variable that controls the scale of the control variates, which can be estimated by minimizing the variance of the Monte Carlo estimator (typically using SGD):

∂ ∂θ pE(b)[f(b)] =p(Eu,v) " [f(H(z)) −η f(σλ(˜z))] ∂ ∂θlog p(b) b=H(z) +η ∂ ∂θf(σλ(z)) −η ∂ ∂θf(σλ(˜z)) (2.20)

The relaxation hyperparameter can be learned as well, in an online setting, by op-timizing the variance of the REBAR estimator. Here REBAR is shown to outperform MuProp and an implementation of the Concrete distribution for discrete variables. 2.6.4 RELAX

RELAX [32] is a further expansion of the ideas found in MuProp, REBAR, and the Concrete distribution. This estimator is constructed using the score-gradient estima-tor (i.e. REINFORCE), the reparameterization trick, and control variables.

The key difference separating RELAX and REBAR is the choice of control vari-ables. The control variable in RELAX is the output of a parameterized neural net-work that can be learned or optimized using the gradient of the variance of the estimator.

The estimator can be applied to the case of discrete random variables by porating a Concrete distribution relaxation of the discrete variable. Then by incor-porating the insight of REBAR, the control variate is evaluated both at the relaxed input, as well as the relaxed input conditioned on the discrete variable.

Since the control variate can be optimized using the variance, the function that is optimized does not necessarily need to be known when constructing the control variate. The authors note that REBAR is a special case of RELAX, when the surrogate

(25)

2.7. Reducing Variance 15

variable that is optimized is cφ(z) = η·f(so f tmaxλ(z))). As with the improvement of REBAR over MuProp, RELAX shows a significant gain over REBAR.

2.7 Reducing Variance

In the case of discrete latent space modeling, the reparameterization trick is often applied to the sampling of these spaces. Monte Carlo estimates are often used to approximate the expectations in these computations. The issue that can arise is that this approximation comes with a very high variance, since samples and data points may vary widely. Monte Carlo estimators are unbiased, meaning that they are not skewed away from the true mean. When enough samples are taken, the true mean of the quantity estimated will emerge. However, this comes at a price: by relying on the number of samples to allow the estimator to converge, when a lower number of samples is taken, there is a greater chance that the end results will inconsistent. This variance in the sampling is an issue, since it can decrease the accuracy of models that make use of it. Several methods exist for reducing this variance, among which control variates are counted.

2.7.1 Control Variables

A control variable is a variable that has a known mean, correlated with the quantity that is being estimated. It can be be used to anchor the higher variance estimation by exploiting the correlation. One example of how a control variable can be used is in a regression estimator. A regression estimator can estimate the mean of quantity

f(Xi)as follows: ˆ µβ = 1 n ∞

∑

i=1 (f(Xi) −βh(Xi)) +βθ (2.21)

where h(Xi)is the control variable, with a known mean θ. The β factor is a value

that regulates how strongly the control variable influences the mean. This estimator can be show to be unbiased as follows:

ˆ µβ = 1 n ∞

∑

i=1 (f(Xi) −βh(Xi)) +βθ =µˆ−β(ˆθ−θ) (2.22)

Where ˆθ is the sample estimate of the control variate’s mean.

E ˆµβ =E ˆµ−β(ˆθ−θ) =E[µˆ] −βE ˆθ−E[θ] =µ−β(θ−θ) E ˆµβ =µ (2.23)

(26)

As evidenced by the above, the β factor does not have an influence on the bias of a regression estimator. However, it does influence the computation of the variance:

Var(µˆβ) = 1

n Var(f(X)) −2βCov(f(X), h(X)) +β

2_Var₍_h₍_X₎₎

(2.24) The optimal value of this parameter can be found by differentiating the above variance equation [34], which results in the following:

βopt=

Cov(f(X), h(X)) Var(h(X)) =

E[(h(X) −θ)f(X)]

E[(h(X) −θ)2] (2.25)

Placing this back into the equation for the variance of this regression estimator results in a variance that is equal to σ2

n(1−ρ2), where ρ is the correlation coefficient

between the desired quantity f(X)and the control variable h(X). Since the variance of the original Monte Carlo estimator ˆµis Var(µˆ) = σ_n2, if the control variable and

the target variable f(X)are correlated, this regression estimator will only ever have a variance that is equal to or less than the original estimator due to the factor(1−ρ2).

A more simple estimator also exists, which is known as the difference estimator. This is a simplification of the regression estimator, since it is computed without a factor β: ˆ µdi f f = 1 n ∞

∑

i=1 (f(Xi) −h(Xi)) +θ =µˆ− ˆθ+θ (2.26)

This difference estimator is commonly applied to gradient estimation, for exam-ple in Reinforcement Learning (RL), where it is known as a baseline. This baseline is applied to what is known in RL literature as the policy gradient∇_θJ(θ)[35], which

is the gradient of the function that determines which action should be taken. The question that then arises is how the baseline variable should be chosen. If the policy gradient takes the form of∇_θJ(θ) =E

ε

[C(ε)∇θlog p(ε|θ)], then a convenient control variate can be derived from the observation thatE

ε

[∇_θlog p(ε|θ)] = 0. This means

that for any value of m ∈ R, m· ∇_θp(ε|θ)is a good control variable. The optimal

value for m is known as the optimal baseline, and just as in the regression estimator, it takes the form of a ratio between two expectations:

m∗ = −

E C(∇_θp(ε|θ))2

(27)

17

Chapter 3

Applying Graphical Structures To

Natural Language Inference

In this chapter we describe our proposed approaches for constructing latent graphs. In Section3.1we describe the proposed methods for building these graphs in a de-terministic (rules-based) manner. Next, in Section3.2, we describe proposed ap-proaches to induce latent graphs in a generative fashion. These graphs are integrated into our proposed models for performing inference on natural language inference datasets, which we then lay out in Section3.3.

3.1 Deterministic Graphs

pn₁ y hm₁ θ N

FIGURE 3.1: Graphical Model For Deterministic Natural Language Inference

Figure3.1 depicts the graphical model for the task of Natural Language Infer-ence, where shaded circles indicate observed random variables, and the determin-istic quantities are not circled. For each example in N, there are the two sentences (deterministic quantities), the premise p and the hypothesis h, as well as the label y for the sentence pair, which is an observed random variable. The label is also param-eterized by the deterministic model parameters θ, which are global to all examples. This means that for each example, we attempt to model the distribution pθ(y|pn₁, hm₁), in order to be able to predict the label for a given premise p and hypothesis h. In our case, θ can be seen as encompassing the parameters of an inference network, as well as any external data.

(28)

Within this model of NLI, we attempt to add linguistic knowledge into the model, so that it may be able to capture linguistic features in the data. In order to do so, we introduce two different sources of data used to graphs that are passed into the re-sulting Graph Convolutional Network models. Please note that in this setting we do not introduce any extra latent variables to the task of NLI.

3.1.1 Syntax Parses

The first of these two sources of data is syntax parses. The act of parsing a sentence is that of trying to discover the syntactic structure of a sentence, based on the grammar of the language at hand. The type of syntax parse that we use in our proposed model is a dependency parse. A dependency parse is a parse that links words in a tree based manner. There is a ”root” word, from which dependency arcs flow, and each word that is a dependency informs the ”head” word, which is the word from which the dependency flows.

FIGURE3.2: An example of the dependency arcs linking the words in a sentence.

As can be seen in Figure3.2, the dependency parse links the words in a sentence in a dependency tree structure, which we use in our model to inform the inference model of intra-sentence linguistic information. These syntax parses are constructed using an existing performant parser proposed by Honnibal and Johnson [36], im-plemented in the SpaCy NLP package1. We place these dependency parses in an undirected graph so that the syntactic links allow for information to bubble up as well as travel down the tree in the GCN.

In the case of the natural language inference datasets, two sentences are pro-vided: a premise and a hypothesis. This means that there are two dependency parses per training example. As can be seen in Figure3.3, the resultant graph has two inde-pendent sub-graphs that are not connected. This results in an increased reliance on the components of the network outside of the GCN layers, since information is not exchanged between the two sentences within these layers.

The resulting adjacency matrix for a sample premise-hypothesis pair can be seen in Figure3.4. The cells in the adjacency matrix are binary in nature. If there is a de-pendency edge present between the two tokens, then an edge is added. This matrix is also symmetrical, in order to allow the information to be propagated both up-wards and downup-wards in the parse. This means that the ”head” of the dependency will be able to glean information from the ”child”, and vice versa. In addition to the

(29)

3.1. Deterministic Graphs 19

FIGURE3.3: An example of the dependency parse graphs for a sam-ple premise and hypothesis pair in the SNLI dataset.[1]

FIGURE 3.4: The Proposed Adjacency Matrix for the Dependency-based Graph of a Sample Premise-Hypothesis Pair

dependency parse edges, self-loops have been added to the graph, meaning that the information that a node currently holds will also be retained (and will not just flow out from that node).

3.1.2 WordNet

When using the graphs described in Section3.1.1, the two sub-graphs are not linked, which means no information flows between them when using Graph Convolutional layers. In order to allow information to flow between the two, we look to the intro-duction of external knowledge.

The WordNet knowledge graph [6], contains knowledge about words such as morphological information, conceptual hierarchies, and other such links. We pro-pose using the entity relations contained in this knowledge graph to construct rules governing linking words between the premise and the hypothesis.

(30)

FIGURE 3.5: An example of the full linked graphs for a sample premise and hypothesis pair in the SNLI dataset[1] using both parse

graphs and WordNet

As can be seen in Figure3.5, the two sentences are linked using rules based on relations in WordNet. For each pair of words if they are synonyms, antonyms, hy-pernyms, or co-hyponyms of each other (in either direction), then a link is added between the two words with a weight of one. This results in a single graph that covers both inter and intra sentence connections for both the premise and the hy-pothesis.

FIGURE3.6: The Proposed Adjacency Matrix for the Dependency and WordNet-based Graph of a Sample Premise-Hypothesis Pair

The proposed adjacency matrix for a sample premise-hypothesis pair can be seen in Figure3.6. The matrix contains the same values in the intra-sentence quadrants (upper-left and bottom-right) as in the Dependency Parse adjacency matrix, but also contains the binary inter-sentence edges. These are also symmetrical, which means that information can flow from the premise to the hypothesis and vice versa.

(31)

3.2. Latent Graph Generation 21 3.1.3 Graphs incorporating knowledge

The resultant graphs created in Sections 3.1.1and 3.1.2introduce external knowl-edge into our proposed models. The key part of the hypothesis tested here is that by discriminating in which words we choose to connect through the use of external knowledge, the model will both perform and generalize better. If we were to not discriminate and connect every word to each other word in either the intra or in-ter sentence setting, the graph convolutional layer’s weights would instead be the sole gates in the layer discriminating as to how the information flows. This discrim-inative learning would, if treated in a neural/backpropagation setting, have to be learned through the backpropagation of error gradients. Additionally, there would be no guarantee that the network would be in fact incorporating learned facets of language instead of artifacts into its decision making process. By incorporating the proposed syntax parses and WordNet links, this allows at least a portion of the facets of language (namely syntax and concepts/relations between words) to be added to the network before supervised training. We hypothesize that this will lead to the model being able to incorporate these sparser pathways (informed by language) as a way to increase the capacity of the model to learn entailment relations. Overall this will enable the model to better perform inference and generalize more broadly.

3.2 Latent Graph Generation

In addition to the creation of static graphs (i.e. graphs that will not be modified by the network), we also propose methods to learn these graph structures. We model these graph structures as discrete latent variables, which are constructed using a generative model.

3.2.1 Graphical Model

Figure3.7depicts the graphical version of Natural Language Inference when mod-eling it using a generative model for latent graphs. Just as in the deterministic case in Figure3.1, for each example, the label y depends on the following deterministic variables: the premise p, the hypothesis h, and global parameters θ. However, in the case of the generation of latent graph structures, we choose model the latent graph structure using four latent variables: Dp, Dh, Cph, and Chp. These four latent

vari-ables represent the various portions of the latent graph structures generated. These latent variables depend on additional global parameters λ.

As can be seen in Figure3.8, each of the four latent variables corresponds with a quadrant of the adjacency matrix of the latent graph structure. The premise p’s to-kens are indicated in blue, and those of the hypothesis h in orange. The first portion of this matrix if the upper left quadrant: Dp. This variable is analogous the

(32)

pn₁ h₁m Dp Dh Cph Chp y λ θ N

FIGURE3.7: Graphical Model For Latent Graph Generation

FIGURE3.8: A Visualization of the Composition of the Latent Graph Structure’s Adjacency Matrix by the Latent Random Variables

the edges that link premise tokens to one another. In the bottom right quadrant, Dh

is similar to Dp, but acts on the hypothesis tokens instead.

The latent variables Chp and Cph make up the bottom left and top right

quad-rants of the adjacency matrix respectively. These represent the latent alignments be-tween the bebe-tween the premise and the hypothesis. The choice was made to model these alignments as two separate variables in order to avoid the assumption that the alignments from the premise to the hypothesis (Cph) and from the hypothesis to the

(33)

3.2. Latent Graph Generation 23 3.2.2 Generating Latent Graph Structures

Since the four proposed latent variables Dp, Dh, and Dhp, and Dph each represent

a collection of edges between nodes in a latent graph, we propose to model each variable individually as a Concrete random variable. As mentioned in Section2.6.1, the Concrete distribution is a relaxation of a discrete categorical random variable as follows: Xi = exp((log(α_i) +g_i)/λ) n ∑ j=1 exp log(αj) +gj /λ (3.1)

The advantage of using this distribution is that the relaxation provides gradient information, which is useful when performing backpropagation on models contain-ing discrete (or relaxed) random variables. As mentioned in Section3.2.1, the latent variables can be split into two categories. We propose to model these two categories in slightly different ways.

Latent Dependency Edges

In order to model the ’dependency parse’ distributions of the latent variables Dp

and Dh, we utilize a generative model proposed by Bastings et. al.[23]. This model

conditions the generated edges on a given source sentence of dimension of n, in this case a premise sentence, which results in a n-dimensional probability vector:

Dp,i|pn1 ∼Concrete(τ, λi) (3.2)

This probability vector, sampled from the Concrete distribution, forms the i-th row in the adjacency matrix from the premise to itself. This means that this vector contains the weights of the edges from the i-th token in the premise to the rest of premise (including itself).

The Concrete distribution in Equation 3.2 is parameterized by two values, τ, which is the temperature of the distribution, and the head potentials λi ∈ Rn. The

input to the generative model consists of the given source sentence of dimension n. This sentence is embedded into a sequence of word embeddings, which are then fed into a bi-directional LSTM. The hidden output sequence of this LSTM, sn₁, is then used to compute the two vectors which are later used to compute the head poten-tials.

These two vectors are the key and query vectors, which are computed as follows: ki =Wksi

qi =Wqsi

(3.3)

These resultant vectors are linear projections, using the projection matrices Wk

(34)

of the hidden outputs of the LSTM. The scaled dot product is then used to compute the head potentials as follows:

λij =    1 √ dq T i kj if i6=j −∞ if i= j (3.4)

This produces a vector of head potentials which masks the edge of the i-th token in the sentence with itself, since the self-loop edge of a token to itself is not what we want to induce. Instead we would like the induced edges to focus on dependencies with other tokens. As previously mentioned, these head potentials are then fed into the Concrete distribution as the parameters.

This generative model is used for both the ’dependency parse’ latent variables

Dp, and Dh. One important modification is that the sampled latent variables are

then masked using identity matrices corresponding to their dimensionalities. This operation adds self-loops to the adjacency matrices, which are necessary to preserve a token’s prior information, and avoids inducing these edges using the latent distri-butions. This further allows the intra-sentence dependencies to be induced by the latent distributions, while still preserving the local token information.

While the latent random variables Dpand Dh are induced separately and have

different given source sentences they do, however, utilize the same generative model. This means that they share parameters - this is useful because both latent variables seek to model the intra-sentence edges for a given sentence.

Latent Alignment Edges

Similar to the latent random ”dependency” variables Dp and Dh, the latent

”align-ment” edges C_hp and C_ph are induced in a similar fashion, using Basting et. al.’s proposed generative model [23], but with a few key modifications.

The variable C_ph consists of an adjacency matrix from the premise sentence (of sequence length n) to the hypothesis sentence (of sequence length m). The Concrete distribution produces m-dimensional probability vectors:

Cph,j|pn1, h1m ∼Concrete(τ, λj) (3.5)

based on the two input sequences. The key and query vectors are computed as follows:

k_i =W_ka_i qj =Wqbj

(3.6)

where an₁ and bm₁ are the hidden state outputs of the LSTM for the premise and hypothesis sequences respectively. The key and query vectors are then used to com-pute the head potentials, using the scaled dot product without symmetry breaking

(35)

3.2. Latent Graph Generation 25 as follows: λij = 1 √ dq T i kj (3.7)

Unlike in Section3.2.2, there are no self-loops that can be induced since it is an alignment, which means that the masking of values in the head potentials is not necessary.

This generative model produces Cph, and the same model is also sampled from

in order to produce Chp. As they are sampled from the same model, this means that

they share model parameters, with the key difference being that the given source sentences are switched. This means that the Concrete distribution produces n-dimensional vectors instead, as the latent variable Chp is meant to ’align’ the hypothesis to the

premise. Additionally, the key and query vector inputs are switched:

ki =Wkbi

qj =Wqaj

(3.8)

which is necessary in order switch the direction of the induced edges.

Once the variables Cphand Chphave been sampled, they represent the adjacency

matrices between the premise and the hypothesis. The final resultant adjacency ma-trix A, composed using the above latent variables, takes the following form:

A= " Dp Cph Chp Dh # (3.9)

Graph Induction Model

The induced adjacency matrix in its final form is generated by the model found in Figure3.9. It is important to note again that the two inter-sentence modules respon-sible for inducing Dpand Dhshare parameters, meaning that this module learns

in-formation about sentence structure from both the premise and the hypothesis. The same is true for the intra-sentence module used to induce Cph and Chp, which

al-lows this module to jointly learn information about intra-sentence relationships in both directions.

3.2.3 Training Latent Variable Induction

In order to introduce external knowledge into the adjacency matrices induced by the generative model, we propose a two-phased training methodology. The first phase is pre-training the generative model portion of the inference model. The determin-istic adjacency matrices E proposed in Section3.1.2are used as supervised training targets for the induced adjacency matrices A. By training the generated matrices in this manner we can seed initial external knowledge in the model.

(36)

FIGURE3.9: The Proposed Model For Inducing a Latent Graph Adja-cency Matrix

The generative model is trained using backpropagation, using a binary-entropy loss as the objective:

Hpre(q) = − 1 (n+m)2 n+m

∑

i=1 n+m

∑

j=1 Eij·log(Aij) + (1−Eij) ·log(1−Aij) (3.10)

This can be treated as a binary classification task because of the nature of the knowledge-based adjacency matrices E. Since each cell of this matrix contains either a 0 or a 1, denoting whether or not there is an edge present, we are able to use this as an objective for the induced graph.

Once the generative portion of the model has been pre-trained, we then proceed to the second phase of training, in which two different approaches can be used: Supervised Multi-Class Classification, which is also used to train the deterministic graph-based networks, and second maximizing the lower bound of the Marginal Likelihood.

Marginal Likelihood

By training the model using a marginal likelihood, we treat the task of NLI in a generative manner, since we seek to learn an objective based on the joint distribution. Based on the graphical model in Section3.2.1, the joint factorizes as follows:

(37)

3.3. Graph-based Networks 27

p(y, Dp, Dh, Cph, Chp|pn1, hm1, λ, θ) =p(y|Dp, Dh, Cph, Chp, pn1, hm1, θ)

·p(Dp|pn1, λ) ·p(Dh|hm1, λ)

·p(Cph|pn1, hm1, λ) ·p(Chp|pn1, hm1, λ)

(3.11)

As such, since we would like to be able to infer good values for the latent ad-jacency matrix, we will therefore need to infer good values for the latent random variables Dp, Dh, Cph, and Chp. The distribution we would like to be able to

in-fer is the posterior of these latent variables which according to Baye’s rule can be computed as follows:

p(Dp, Dh, Cph, Chp|y, p1n, hm1, θ) =

p(y, Dp, Dh, Cph, Chp|pn1, hm1, λ, θ)

p(y|Dp, Dh, Cph, Chp, pn1, hm1, λ, θ)

(3.12) Unfortunately, the denominator for this term is the marginal likelihood, which is intractable. Instead of computing the posterior, we seek to optimize the Evidence Lower Bound (ELBO) of this marginal, as outlined in Section2.4. Since the networks that we use to generate the induced adjacency matrices are approximations of the true probability distributions, we denote these generative networks as Qλ. Based on these distribution approximators, the ELBO for our proposed graphical model is as follows: L =_E Qλ log p(y|Dp, Dh, Cph, Chp) +

∑

Qλ qλ(Dp, Dh, Cph, Chp|y)log p(Dp, Dh, Cph, Chp) qλ(Dp, Dh, Cph, Chp|y) (3.13)

where the first term is in the negative log-likelihood, and where the prior p is composed of categorical priors for each latent variable, as proposed by Jang et. al.[28].

3.3 Graph-based Networks

We propose a set of architectures in Sections3.3.1and3.3.2for evaluating the perfor-mance of latent graphs as well as the contribution that a latent graph has with exist-ing approaches. Finally, in Section3.3.3, we describe the architecture for our model which combines latent graphs with existing approaches for Natural Language Infer-ence.

(38)

FIGURE3.10: Proposed Model Architecture for Applying Proposed Graphs to NLI

3.3.1 Graph Comparison Architecture

The first network we propose can be seen in Figure3.10, and as it is relatively shal-low when compared to existing networks such as ESIM[11]. This is because the only mechanisms acting to combine the information from the individual words/tokens are the Graph Convolutional layers, which facilitates graph vs graph based bench-marking.

This architecture first embeds the word tokens into a 300 dimensional vector space, and then feeds them into a GCN encoder, with n layers. The GCN encoder can utilize any adjacency matrix, either deterministic or induced. This encoder ap-plies a Rectified Linear Unit[37] after each layer, in order to introduce non-linearities into the network, and each successive GCN Layer halves the vector space (e.g. af-ter 1 layer the encoded vector space has 150 dimensions). The resultant encoded vectors are then combined per sentence, using both average and max pooling op-erations. These resultant vectors, one for the premise and one for the hypothesis per pooling operation, are then concatenated, before being fed to the feed forward output prediction layer. This produces a log probability for each of the three classes: entailment, neutral, or contradiction.

3.3.2 Encoder Contribution Architectures

In order to measure the contribution that the proposed graphs give to existing ar-chitectures and encoders, we propose the LSTM-GCN architecture that can be seen in Figure3.11. This architecture is similar to the architecture proposed in Section 3.3.1, but incorporates a Bi-directional LSTM Layer between the embedding and GCN encoder layers, and takes the 300 dimensional vectors as outputted by the embedding layer, and converts them into 600 dimensional vectors. This is because

(39)

3.3. Graph-based Networks 29

FIGURE 3.11: Proposed LSTM-enhanced architecture for applying proposed graphs to NLI

a Bi-directional LSTM encodes information in both the forwards and backwards di-rections. This architecture enables us to see whether the graph structures applied in a GCN encoder can augment the information that an LSTM is able to model.

(A) LSTM Ablation Architecture

(B) Embedding Ablation Architecture

FIGURE3.12: Proposed Ablation Architectures

In order to accurately gauge this difference we also propose the ablation architec-tures in Figure3.12, which model the NLI task solely using embeddings, and solely using the LSTM encoder. These architectures will allow a more direct comparable measurement for the contribution of our proposed GCN Graphs.

3.3.3 GraphSIM Architecture

The final architecture we propose is the Graph Sequential Inference Model (Graph-SIM) network. As can be seen in Figure3.13, this network incorporates the Graph Convolutional Layers into an ESIM-like architecture [11]. The premise (of length n)

(40)

FIGURE 3.13: Proposed GraphSIM architecture for applying pro-posed graphs to NLI

and hypothesis (of length m) sentences are embedded into 300-dimensional embed-ding vectors pi and hj, which are initialized using pre-trained 840B GloVe

embed-dings [38].

These embeddings are then fed into a bi-directional LSTM with one layer, result-ing in encoded premise and hypothesis vectors ai and bj. Dot-product attention is

used between these two encodings in order to produce attended vectors ci and dj.

These vectors are then combined with the encoded vectors in the following fashion

combp = [ai, ci, ai−ci, ai∗ci]

combh = [bj, dj, bj−dj, bj∗dj]

(3.14)

just as in Chen et. al.[11], which helps to combine information from the attention mechanism and the original encodings. These vectors are then projected back into 300-dimensional space as ei and fjrespectively.

Next, the vectors are passed into the Graph Convolutional Layers. These layers take an adjacency matrix G, which in the deterministic setting is constructed us-ing the methods proposed in Section3.1.2. In the generative setting, this matrix is generated using the latent graph generative module proposed in Section3.2.2. It is important to note that this module uses the original embeddings pjand hjto induce

G. Next, together with the adjacency matrix G, the encodings ei and fj are passed

through two Graph Convolutional Layers with projections to 450-dimensional space and then 600-dimensional space.

The outputs from the final GCN layer are passed to pooling operations, which seek to combine the learned information from the vectors (since each token in both the premise and the hypothesis has a corresponding vector). The vectors in the premise are combined with one another, and the same is done for the hypothesis vectors. Two kinds of pooling are performed: average-pooling, which averages the vectors, and max-over-time pooling, which takes the maximum values for each di-mension from the sequence of vectors [39]. These operations result in four final

Applying Latent Graphical Structures To Natural Language Inference

MSC

ARTIFICIAL

I

MASTER

THESIS

Applying Latent Graphical Structures to

Natural Language Inference

ALEXANDER

THOMAS

G

October 16, 2019

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Research Questions

Chapter 2

Background

2.1

Textual Entailment Datasets

2.2

Neural NLI Models for Textual Entailment

2.3

Graph Convolutional Neural Networks

2.4

Variational Inference

2.5

Variational Auto Encoders

∏

2.6

Learning Discrete Latent Spaces

2.7

Reducing Variance

∑

∑

∑

Chapter 3

Applying Graphical Structures To

Natural Language Inference

3.1

Deterministic Graphs

3.2

Latent Graph Generation

∑

∑

∑

3.3

Graph-based Networks