• No results found

Lexical ambiguity with density matrices

N/A
N/A
Protected

Academic year: 2021

Share "Lexical ambiguity with density matrices"

Copied!
81
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

M

ASTER

T

HESIS

Lexical ambiguity with density matrices

by FRANCOIS MEYER 12198145

June 25, 2020

48EC November 2019 - June 2020 Supervisor: DRMARTHA LEWIS Assessor: DRWILLEMZUIDEMA

(2)
(3)

I would like to thank my supervisor, Martha Lewis, for making my work a lot easier over the past 8 months. She was always available for meetings and her feedback has significantly improved the quality of this thesis. I especially appreciated her help with some of the more perplexing aspects of the category theory that I got to delve into.

I would also like to thank Gijs Wijnholds for contributing to early discussions about some of the ideas that ended up in this thesis.

I gratefully acknowledge the funding I received from Zuid-Afrikahuis during my thesis.

I am also grateful to the people at Ambrite for the opportunity to work part-time during periods of my masters, and for being understanding when my studies kept me busy.

Even though they will never see this, I would like to thank Waxahatchee and Phoebe Bridgers for making music that often kept me company during the final busy months of my thesis.

Last but not least, I would like to thank my parents, Maretha and Frans, for enthusiastically supporting my two-year adventure in Amsterdam.

(4)
(5)

In natural language processing (NLP) the standard approaches to modelling word meaning are vector-based. These approaches model lexical ambiguity by either representing all the senses of a word with a single vector, or by representing the different senses with separate vectors. As an alternative to vector-based semantics, density matrices have been proposed to model word meaning. Their mathematical properties allow for explicitly encoding certain aspects of language, such as ambiguity and entailment. Furthermore, density matrices also allow us to operate within the categorical compositional distributional model of meaning (DisCoCat). This framework provides us with a category theoretic method for composing the meaning of a sentence from the meaning of its constituent words.

We propose three new models for building density matrices, all of which draw inspiration from well-established neural word embedding models. Our models are designed to encode lexical ambiguity. We also survey several composition methods for density matrices from DisCoCat. We evaluate how well our density matrices encode ambiguity and to what extent the composition methods achieve disambiguation. We find that our models exhibit varying degrees of success on different tasks, but that one of our models (called multi-sense Word2DM) emerges as the best density matrix model overall. When paired with a particular composition method (Phaser), multi-sense Word2DM outperforms all other models (including existing baselines) on most of the disambiguation tasks. This experimentally validates the idea of modelling lexical ambiguity with density matrices, and demonstrates the advantage of working within the theoretical framework of DisCoCat.

(6)
(7)

1 Introduction 9 1.1 Research Questions . . . 9 1.2 Outline . . . 10 1.3 Contributions . . . 11 2 Background 13 2.1 Distributional semantics . . . 13

2.1.1 Static word embeddings . . . 14

2.1.2 Multi-sense embeddings . . . 15

2.1.3 Contextualised word embeddings . . . 15

2.2 The categorical compositional distributional model of meaning (DisCoCat) 16 2.2.1 Category theory . . . 17

2.2.2 Pregroup grammars . . . 18

2.2.3 Composing sentence vectors . . . 19

2.2.4 Frobenius algebra . . . 20

2.3 Density matrices . . . 21

2.3.1 Building density matrices . . . 22

2.3.2 Representing verbs and adjectives . . . 24

2.3.3 Composition with density matrices . . . 25

3 Methods 29 3.1 Density matrix models . . . 29

3.1.1 BERT2DM . . . 29

3.1.2 Word2DM . . . 33

3.1.3 Multi-sense Word2DM . . . 39

(8)

4 Experiments 49 4.1 Experimental setup . . . 49 4.1.1 Baselines . . . 49 4.1.2 Training . . . 51 4.2 Word similarity . . . 52 4.3 Disambiguation results . . . 54 4.3.1 Data sets . . . 54

4.3.2 Results and discussion . . . 56

4.4 Ambiguity analysis . . . 63

4.4.1 Ambiguity and polysemy . . . 64

4.4.2 Ambiguity and composition . . . 67

5 Conclusion 71 Bibliography 73 Appendix A Composition equations 79 A.1 ML2008 . . . 79

A.2 GS2011 . . . 80

(9)

Introduction

Vector-based models have emerged as the standard approach to modelling word meaning in natural language processing (NLP), many of which are based on the distribution of words in a corpus. Modelling sentence meaning is a related, but much harder, problem. Compositional semantics tries to do this by composing a sentence representation from the representations of its constituent words. The categorical compositional distributional model (DisCoCat) [11] has been proposed as a way to achieve this within a vector-based framework. The theoretical framework of DisCoCat, which is based on category theory, combines word vectors in such a way that the resulting sentence vector represents the meaning of the sentence.

DisCoCat has been extended by moving from representing words as vectors to repre-senting them as density matrices [2, 45, 3, 50]. Density matrices were originally proposed in quantum theory to describe a quantum system as a statistical ensemble of states. This is useful if there is uncertainty regarding the state that a system is in. Similarly, they have been proposed to represent polysemous words, since there is uncertainty regarding which sense of the word is being employed in a specific context. Since density matrices encode ambiguity, they allow for richer representation of word meaning.

1.1

Research Questions

The goal of this thesis is to design new methods for building density matrices and to evaluate them in the context of modelling and resolving ambiguity. We can elaborate this goal as three research questions.

1. Can we design novel methods for building density matrices inspired by recent word embedding models? Modelling word meaning with density matrices was proposed within DisCoCat as an extension of distributional semantics [45]. A few methods for building

(10)

density matrices have been proposed [2, 45, 3, 50]. However, these methods have ignored recent advances in NLP. The first aim of this thesis is to design methods for building density matrices that rely on, or are based on, neural methods for learning word embeddings. There are two classes of word embedding models - those that produce static embeddings such as skip-gram [37, 38] and GloVe [43]), and those that produce contextualised embeddings such as ELMo [44] and BERT [14]. We design algorithms for building density matrices inspired by models from both of these classes.

2. Can we implement methods for composing density matrices of words into phrase representations that capture phrase meaning? DisCoCat is a framework for composing word representations to form phrase and sentence representations. One of the main challenges with implementing DisCoCat is finding methods of composition that are computationally feasible. The composition computations originally proposed within DisCoCat, as well as with its density matrix extension, can be too computationally expensive to implement. Some computationally simpler composition methods for density matrices have been proposed [3, 9, 35, 10]. We investigate these and evaluate their suitability for composing phrase density matrices from word density matrices.

3. Do our density matrices and composition methods model ambiguity? To evaluate our methods we apply them to the problem of resolving ambiguity. We test our methods on existing disambiguation data sets [41, 19, 46], which consist of phrases that provide the necessary context to disambiguate the words in them. We build density matrices with our models and compose phrase representations with our composition methods. Then we assess how well our methods are able to disambiguate phrase meanings, comparing them to existing techniques and each other. We also perform a quantitative analysis of how well our models capture ambiguity on the word level. To do this we use the notion of von Neumann entropy, which allows us to measure the ambiguity encoded in density matrices.

1.2

Outline

In Chapter 2 we provide an overview of the body of research on which this thesis builds. We start the chapter by reviewing the field of distributional semantics. We then turn to DisCoCat, the theoretical framework underlying this work. We introduce the mathematics that it relies on (category theory and pregroup grammars) and then present the mathematical framework itself. In the last section of Chapter 2 we introduce density matrices and survey existing methods for building and composing them.

(11)

We start Chapter 3 by presenting our three new density matrix models - BERT2DM, Word2DM, and multi-sense Word2DM. We outline each of their learning algorithms and motivate the design decisions behind them. We derive the gradients of Word2DM, which are shown to be suboptimal for learning semantic representations. Multi-sense Word2DM is presented as a modification of Word2DM designed to overcome these gradient issues and to explicitly model ambiguity. We then introduce the composition methods used in our experiments. We demonstrate their mathematical properties by going through a few practical examples of composition in the presence of ambiguity.

In Chapter 4 we introduce our experiments and present our results. We compare our models to existing methods and discuss the implications of our results. Chapter 5 concludes this thesis by reiterating its main contributions and summarising its findings. We also propose possible directions for future research.

1.3

Contributions

The main contributions of this thesis are as follows:

• We propose three new models for learning density matrices that encode word meaning. As far as we know these are the first neural models for doing so. Two of the models (Word2DM and multi-sense Word2DM) are the first neural architectures designed specifically to learn density matrices from scratch i.e. without using the pretrained weights or embeddings from an existing model.

• We perform a systematic comparison of several composition methods that have been proposed within the DisCoCat framework. Density matrices were originally introduced as a way to model lexical ambiguity, but the question remains as to which composition method most effectively resolves ambiguity. We present the first extensive experimental comparison of composition methods for density matrices.

Taking a broader view of the literature on which this work builds, these contributions continue a line of research that aims to move from the theory of DisCoCat to a practical application of it. In contrast to the applied nature of deep learning NLP models, DisCoCat is a theoretically driven approach to semantics. While a number of techniques exist that model ambiguity, our methods exist within DisCoCat, and so are accompanied by a theoretically sound model of composition. Much of the literature on DisCoCat is focused on extending its theoretical framework. This work operates within this framework, but focuses on implementing models for it and subjecting them to practical experimentation.

(12)
(13)

Background

2.1

Distributional semantics

In the past decade deep learning has become the predominant paradigm in NLP. Deep learning models are now state-of-the-art in practically all common NLP tasks [49], which include problems as diverse as machine translation, part-of-speech tagging, and grammatical error correction. At the heart of deep learning lies representation learning - training a model to build representations of data points (e.g. images, graphs, sentences) that are useful for the task at hand (e.g. classification, regression). This allows the model to discover which features are important and how features interact. The model can then build representations for data points that encode this information. In deep learning these representations are usually continuous vectors in some n-dimensional space.

In NLP we often want to build representations for words that encode their meaning. Sometimes this will be in the context of a specific task, in which case a model can be trained to build representations that are useful for the task. However, it is also of interest to build word representations that are task-agnostic. These representations should encode the semantic and syntactic properties of words. If they do this effectively, they should be more generally usable in a wide range of tasks.

Distributional semantics is a field within NLP that aims to build such word representations based on how words are used in practice. More formally, techniques in distributional semantics build word representations by leveraging the distributional information of a word (its co-occurrence statistics) obtained from a large corpus. The theoretical idea behind this approach is the Distributional Hypothesis, which states that words occurring in similar contexts generally have similar meanings [21, 18]. Distributional semantic models use this idea to build representations that capture word meaning. Distributional semantics has played an important role in applying deep learning to NLP. Deep learning models rely on

(14)

representation learning and distributional semantic models learn useful representations for words. These word representations can be incorporated into deep learning models, which can utilise the linguistic information encoded in them for their task. In the deep learning literature the vector representations learned by distributional semantic models are referred to as word embeddings. Many word embedding algorithms have been proposed and two types of models have emerged - static word embeddings and contextualised word embeddings.

2.1.1

Static word embeddings

Distributional semantic models take a large corpus as input and produce word embeddings for all the words in the corpus as output. They produce one embedding for each word, based on how that word co-occurs with other words in the corpus. Such embeddings are known as static word embeddings, because they are fixed after training. If we use them to represent words in some NLP task, we use a single embedding to represent each word, irrespective of the context in which the word us used. There are two types of static word embedding models - count-based and prediction-based.

Count-based models work by collecting the co-occurrence statistics of all the words in a corpus (how many time each word occurs in the same context as every other word). They use this to construct a |V |-dimensional vector representation for each word, where V is the vocabulary being modelled. The basis vectors of this vector space correspond to words in the vocabulary. Most models then proceed by applying some weighting scheme (e.g. pointwise mutual information) sometimes followed by dimensionality reduction (e.g. singular value decomposition or non-negative matrix factorisation). The most popular recent count-based model is GloVe [43], in which the matrix factorisation step retains semantic relations between word vectors.

Prediction-based models do not collect co-occurrence counts explicitly, although they have been shown to be implicitly related to count-based models [34]. Instead, they frame the problem of learning word embeddings as a prediction task in which we try to predict the contexts in which a word occurs. The most popular of these is the skip-gram model of Word2Vec [37, 38]. Skip-gram works by iterating through a corpus and predicting the words that occur around each word (within some context window). The objective function in skip-gram is such that by optimising for prediction accuracy, high-quality word embeddings are trained in the process. Words that occur together often, or words that are often used in the same context, have similar embeddings.

Static word embeddings are computationally inexpensive to train and query. They are incorporated into a model by feeding them as input representations, which can result in improved performance on NLP tasks. Word embeddings trained on large corpora with models

(15)

like Word2Vec and GloVe are publicly available and easy to add to a model. In doing so the model can utilise the semantic and syntactic information encoded in the embeddings.

2.1.2

Multi-sense embeddings

Word embeddings represent all possible meanings of a word with a single embedding. In doing so they produce low-dimensional embeddings that can be used off the shelf, but they fail to deal properly with polysemy. One approach to modelling polysemy is to move from wordembeddings to sense embeddings, where each word is represented by multiple vectors that correspond to different senses. Doing so requires combining word sense induction with a method for learning sense embeddings.

Schütze [51] proposed context-group discrimination, an algorithm for automatic word sense disambiguation that relies on pretrained word embeddings. The approach starts by representing all the contexts of a word in a corpus as the average of the embeddings of the context words. These context representations are then clustered into a predefined number of clusters, after which the centroids of these clusters are used as sense embeddings.

Neelakantan et al. [42] proposed multiple-sense gram, an extension of the skip-gram model for learning sense embeddings. They also represent each context in which a word occurs as the average of the context word embeddings, and cluster these into clusters corresponding to senses. However, they maintain separate embeddings that represent the senses of a word and are compared to the context embedding centroids to predict which sense is being employed. They learn the sense and context embeddings of words with a training procedure based on skip-gram with negative sampling.

2.1.3

Contextualised word embeddings

In reality the exact meaning of a word can vary greatly and depends on the context in which it is used. Contextualised word embeddings model this explicitly. Instead of producing a fixed number of embeddings for each word, they produce embeddings for words that are a function of the context in which they occur. In doing so they can encode the distributional information of a word, as well as how its meaning varies in different contexts.

Contextualised word embeddings are produced by pre-trained language models, which have recently taken NLP by storm. One of the first of such models is ELMo [44], which is a bidirectional Long Short Term Memory (biLSTM) [23] trained as a language model. ELMo produces context-dependent word embeddings that are a function of the entire sentence in which a word occurs. It achieves this by computing a linear combination of all the internal representations of a word in the biLSTM. The weights of the linear combination

(16)

are learned for specific tasks. Different layers of the biLSTM capture different linguistic features, so different NLP tasks will require different weighting schemes. Task-independent contextualised embeddings can also be obtained by taking the word representations in the final layer.

Following ELMo a number of pre-trained language models were proposed, but these followed a different approach. ELMo produced contextualised embeddings that can be used as input representations to models that have to be trained from scratch. The models that came after ELMo utilised transfer learning - training a language model on a large corpus (billions of words) and fine-tuning the pre-trained language model itself for downstream tasks. This eliminates the need to train a new model from scratch and leverages the linguistic information learned by pre-trained language models for other tasks. [24] drew inspiration from transfer learning in Computer Vision and proposed ULMFiT, a pre-trained LSTM language model that demonstrated the effectiveness of transfer learning in NLP. [47] proposed OpenAI GPT, which employed the Transformer architecture [53] and allowed them to capture long-term dependencies in their pre-trained language model. The most popular model to emerge from this new paradigm is BERT [14]. BERT drew largely from the ideas of previous models but innovated in important ways. Firstly, BERT is a bidirectional Transformer, so it combines the left and right context of a word when computing its representations. Secondly, it is trained on two tasks during pre-training: masked language modelling (predicting missing words in a sentence) and next sentence prediction (predicting whether or not a candidate sentence follows the current sentence). These changes enabled BERT to produce high-quality contextualised word embeddings and sentence embeddings. When BERT was proposed it achieved state-of-the-art results on a wide variety of NLP tasks by fine-tuning its parameters or utilising its contextualised embeddings.

2.2

The categorical compositional distributional model of

meaning (DisCoCat)

In NLP, word embeddings, whether static or contextual, have emerged as the standard approach to modelling the meaning of words. Modelling the meaning of sentences is considered a much more challenging problem. Various models have been proposed for this [32, 30, 12, 7, 14], but there is currently no standard approach. Distributional information is not sufficient and the model must somehow take into account how words combine to produce phrase and sentence meanings. Compositional distributional semantics tries to model sentence meaning by combining the distributional vectors of words in such a way that

(17)

it encodes the meaning of a sequence of words. Different methods for combining word vectors have been proposed. Mitchell and Lapata [41] evaluate simple composition methods, such as element-wise addition and multiplication. Baroni and Zamparelli [4] proposed modelling adjectives as matrices and multiplying them with noun vectors to model adjective-noun phrase composition. Coecke et al. [11] proposed the categorical compositional distributional model of meaning, otherwise known as DisCoCat.

DisCoCat is a model for combining the vectors of the words in a sentence according to the rules of a formal grammar. It is a mathematically sound composition method that unifies two seemingly orthogonal approaches to natural language semantics - distributional semantics and formal semantics. The mathematical unification is achieved using category theory. Here we introduce the relevant category-theoretic concepts and summarise the mathematical theory underlying DisCoCat. For an extensive introduction to category theory see Leinster [33] and for a complete description of DisCoCat see Coecke et al. [11].

2.2.1

Category theory

Category theory is a field in abstract mathematics that studies mathematical structures and the relationships between them. The basic concept in category theory is a category C, which consists of:

• objects A, B,C, ... ∈ ob(C);

• morphisms f , g, h, ... ∈ C(A, B) from A to B, for each pair of objects A, B ∈ ob(C); • a composition operator that composes any f ∈ C(A, B) and g ∈ C(B,C) to produce

g◦ f ∈ C(A,C). The composition operator must satisfy associativity and identity (each object has an identity morphism that, when composed with another morphism, leaves it unchanged).

Another basic concept in category theory is a functor, which is a map between categories. If C and D are categories, then a functorF : C → D, consists of:

• a function that maps each object A ∈ ob(C) to an objectF (A) ∈ ob(D);

• a function that maps each morphism f ∈ C(A, B) to a morphismF ( f ) ∈ D(F (A),F (B)). This function must preserve identity morphisms (i.e. F (1A) = 1F (A)) and preserve

composition of morphisms (i.e.F (g ◦ f ) = F (g) ◦ F ( f )).

Many mathematical structures (e.g. sets, groups, fields, rings) can be formalised as categories. To demonstrate this we consider a category that is used in DisCoCat - the category of finite dimensional vector spaces, referred to as FVect. Its objects are finite dimensional vector spaces and its morphisms are linear maps. Ordinary function composition is the

(18)

composition operator and identity functions are the identity morphisms. DisCoCat formalises distributional semantics by casting the semantic vector spaces that they produce as objects in FVect. While most distributional semantic models do not distinguish between word types, DisCoCat assigns different vector spaces to different word and phrase types. Nouns live in the vector space N and sentences live in S, but relational word types that operate on other words live in higher order tensor product spaces. For example, adjectives live in the tensor product space N ⊗ N (which means they are matrices) and transitive verbs live in N ⊗ S ⊗ N (which means they are order-3 tensors).

2.2.2

Pregroup grammars

Formal semantics is an approach to modelling meaning in natural language with mathematical models. One of the goals of Coecke et al. [11] in proposing DisCoCat was to unify formal semantics and distributional semantics. A key ingredient is pregroup grammar. A pregroup grammar is a mathematical model of grammar and composition proposed by Lambek [31]. Informally, a pregroup grammar consists of a vocabulary of words and a set of grammatical types to which the words are mapped. The mathematical elements of a pregroup grammar are its basic types. Each basic type p has a left adjoint pl and a right adjoint pr that satisfy the following equations:

pl· p ≤ 1 ≤ p · pl (2.1)

p· pr≤ 1 ≤ pr· p (2.2)

Grammatical types can be atomic or complex, depending on whether they are represented by one of the basic types or by a composite of basic types. A simple example of a pregroup grammar has the basic types n (nouns) and s (sentences). Nouns and sentences are represented by these atomic types. Transitive verbs are represented by the complex type nrsnl. To see why this is the case, consider the example sentence “Bob likes Joan”:

Bob likes Joan

n nrsnl n

Applying the inequalities of equations 2.1 and 2.2 as type reductions, this reduces to the sentence type:

n(nrsnl)n → (nnr)s(nln) → 1s1 → s (2.3)

A pregroup grammar can also be formalised as a category P. The objects of P are all the grammatical types, simple and complex, of the pregroup grammar. There exists a morphism p→ q in P if p ≤ q in the pregroup grammar.

(19)

2.2.3

Composing sentence vectors

We have introduced two categories that model different aspects of natural language. FVect models the meaning of words with vector representations produced by distributional semantic models. P models sentence composition with a type-logical approach to grammar. The main contribution of Coecke et al. [11], and the central idea of DisCoCat, is the unification of these two models. This can be done by defining a functorF : P → FVect that maps each grammatical type in P to a vector space in FVect and each type reduction in P to a linear map in FVect. The functor maps atomic types to vector spaces:

F (n) = N F (s) = S (2.4)

Complex types are mapped to tensor product spaces:

F (nrsnl) = N ⊗ S ⊗ N (2.5)

Type reductions in P are mapped to to a linear maps in FVect. For example, the type reduction for a transitive sentence α : nnrsnln→ s (expanded in equation 2.3) is mapped to a linear map:

F (α) : N ⊗ N ⊗ S ⊗ N ⊗ N → S (2.6)

The introduction of such a functor achieves two things. Firstly, it pairs each word vector in FVect with a grammatical type in P. Secondly, it means that each grammatical type reduction implies a linear map from the word vector spaces to the sentence vector space. This provides us with a mathematical formulation for computing the meaning vector of a sentence from the meaning vectors of its constituent words:

1. Compute the grammatical type reduction α : p1p2· · · pn→ s of a sentence, based on

the grammatical types of its words.

2. Consider the tensor product of the vectors of all the words in the sentence−→v1⊗ · · · ⊗ −→vn, which lives in the tensor product space F (p1p2· · · pn) =F (p1) ⊗ · · · ⊗F (pn) =

V1⊗ · · · ⊗Vn. This assigns vector spaces to the words based on their grammatical types.

3. Compute the sentence vector−−−−→v1· · · vn=F (α)(−→v1⊗ · · · ⊗ −→vn).

In short, the type reductions of a sentence in P determine the sequence of linear maps applied to the vectors in FVect of the words in the sentence. The result of this sequence of linear maps is a vector representation for the meaning of the entire sentence. This is achieved by composing the vectors, matrices, and tensors of the words with a sequence of

(20)

compositions specified by the sequence of type reductions. Relational words that operate on a single word are modelled as matrices, so composition is computed with matrix multiplication. Relational words that take operate on multiple words are modelled as higher order tensors, so composition is computed with tensor contraction (a generalisation of matrix multiplication).

2.2.4

Frobenius algebra

The composition model of DisCoCat is theoretically sound, but proves difficult to implement. This is because of the computational challenges around building higher order tensors. For example, according to the pregroup grammar types, transitive verbs have to be represented as three-dimensional tensors. If atomic types are represented by vectors of size 20 (which is already much smaller than the size of standard word embeddings), then the transitive verb tensors would require 8000 entries each. Kartsaklis et al. [26] use a Frobenius to overcome this computational problem. A Frobenius algebra defined on the vector space V with basis vectors {ei}iequips us with the following operations:

∆ ::−→ei 7→ −→ei ⊗ −→ei ι ::−→ei 7→ 1 (2.7)

µ ::−→ei⊗ −→ei 7→ −→ei ζ :: 1 7→−→ei (2.8)

These are morphisms that allow us to encode vectors as tensors in higher order tensor spaces. The ∆ morphism is a copying mechanism that embeds a vector as the diagonal entries of a diagonal matrix. The µ morphism is an uncopying mechanism that extracts the diagonal entries of a matrix and embeds it as a vector. The ι and ζ morphisms respectively delete and createbasis vectors. These operations also generalise to higher order tensor spaces.

Instead of building higher order tensors for complex words directly, we can build lower order representations and encode them in higher order tensor spaces with the ∆ morphism. Doing so results in sparse tensor representations, thereby reducing the space complexity of the model. It also simplifies the composition computations - it can be shown that when the Frobenius algebra is used, the tensor contractions specified by equation 2.6 simplify to element-wise multiplications between the lower order tensors. For an extensive introduction to the role of the Frobenius algebra in DisCoCat, see Kartsaklis et al. [28]. We do not elaborate on it here, but refer back to at a later stage in the thesis. It is sufficient to know that equations 2.7 and 2.8 make it possible to avoid building higher order tensors, while staying faithful to the category theoretic framework of DisCoCat.

(21)

2.3

Density matrices

In the original formulation of DisCoCat the meaning of words are represented by vectors. However, there may be other mathematical structures more suitable for representing word meaning. One such alternative is density matrices, which have been proposed within Dis-CoCat as an alternative to vectors. Before we describe density matrices mathematically we must introduce the notation used throughout this thesis, known as bra-ket notation. Dirac [15] introduced bra-ket notation as a method for working with quantum states. We make use of the following features of the notation:

• A column vector is denoted by a ket |v⟩. • A row vector is denoted by bra ⟨v|.

• The inner product of two vectors is denoted by a bra-ket pairing ⟨v|u⟩. • The outer product of two vectors is denoted by a ket-bra pairing |v⟩⟨u|.

Density matrices are used in quantum theory to model mixed states - when the state of a quantum system is unknown it can be modelled by a statistical ensemble of pure states. The quantum theoretic definition of a n × n density matrix is

ρ =

i

pi|vi⟩⟨vi|, (2.9)

where p1, p2, ... are the probabilities assigned to the quantum states |v1⟩, |v2⟩, ..., and for

real-valued vectors |vi⟩⟨vi| is the outer product of the n-dimensional vector |vi⟩ with itself. From

this definition it follows that a valid density matrix must satisfy the following conditions:

1. Hermiticity: ρ⊤= ρ

Since we will only work with real numbers this is equivalent to symmetry i.e. ρ⊤= ρ. 2. Positive semi-definiteness: ⟨x|ρ|x⟩ ≥ 0 ∀ |x⟩ ∈ Rn

3. Unit trace: tr(ρ) = 1

DisCoCat has been extended by representing the meaning of words with density matrices instead of vectors. In moving from vectors to density matrices, DisCoCat gains some expressive power. In the natural language definition of a density matrix, equation 2.9 can be viewed as a mixing of all the possible meanings of a word. For example, the word bright is a polysemous word that could either mean something similar to shining or clever. Suppose that when the word bright is used, it is twice as likely to mean shining as it is to mean clever.

(22)

The density matrix for bright, denoted asJbrightK, is computed as follows: JbrightK = 2 3|shining⟩⟨shining| + 1 3|clever⟩⟨clever|

We refer toJbrightK as a mixed state, since it is a mixture of the two pure states |shining⟩⟨shining| and |clever⟩⟨clever|. A pure state corresponds to a specific word sense and a mixed state corresponds to a polysemous word, which can take on different meanings with different probabilities. This allows density matrices to explicitly model aspects of natural language that are ignored by word embeddings, such as ambiguity [5, 45] and entailment [2, 3, 50].

The category theoretic formulation of DisCoCat must be modified to incorporate the shift from vectors to density matrices. This theoretical extension is outlined by Piedeleu et al. [45]. The basis of the extension is Selinger’s CPM-construction [52], which maps a category C to its category of completely positive maps CPM(C ). Applying the CPM-construction to FVect results in a new category CPM(FVect) in which the objects are density matrices and the morphisms are completely positive maps (linear maps that map density matrices to density matrices). Most of the work on density matrices within DisCoCat has focused on extending the mathematical theory of the framework. Here we review existing work related to the more practical aspects, such as building density matrices for words and composing them into sentences.

2.3.1

Building density matrices

Piedeleu et al. [45] proposed the use of density matrices within DisCoCat. They introduced it to model lexical ambiguity, distinguishing between homonymy and polysemy. The density matrix of a word is obtained by mixing the vectors of its homonymous meanings (the word bankwill have separate vectors for the financial and river-related meanings). The polysemous meanings of a word are still collapsed into a single vector, since these meanings are relatively coherent (bank as a financial institution and as that institution’s building is represented by the same vector). This means that ambguity is modelled at the level of homonymy.

They obtain the polysemous senses of a word through a word sense induction algorithm. The method requires an initial set of meaning vectors assigned to each word in the vocabulary through some distributional semantic model. A vector is then computed for each context in which a word occurs by averaging the vectors of the words in the context. Now that each context is represented by a vector, the contexts are clustered into clusters representing distinct senses of the word. The centroids of these clusters are taken as the vectors representing the puremeanings of a word. The probabilities of the different meanings are computed as their relative frequencies. They tested their model with a few preliminary experiments. They use

(23)

Von Neumann entropy to model ambiguity and show how ambiguity evolves when meanings are composed with the DisCoCat model. They find that, as expected, modifying a noun reduces some ambiguity. They also informally compare composite semantic representations and find that for a few cases the composite representations of semantically related phrases are close to each other.

Balkir et al. [2], Bankova et al. [3], and Sadrzadeh et al. [50] use density matrices to model entailment. They build density matrices in ways that capture some degree of lexical entailment and specify asymmetric measures that reflect hyponymy-hypernymy relations. These works extend the theoretical constructions of DisCoCat to model entailment with density matrices. They also propose methods for building density matrices and perform small-scale experiments to demonstrate their effectiveness, but focus more on the category theoretic aspects of their extensions to DisCoCat.

Balkir et al. [2] use taxonomies to distinguish between pure word meanings (leaves in the taxonomy) and mixed word meanings (non-leaf nodes in the taxonomy). They use word co-occurrence frequencies to build density matrices for words with mixed meanings. [3] use taxonomies to find hyponyms for words. They build the density matrix of a word by summing (and normalising) the density matrices of its hyponyms, as computed from their word vectors. Sadrzadeh et al. [50] propose two methods for building density matrices. The first uses a word’s co-occurrence counts with all other pairs of words to build an upper-or lower-triangular matrix. They expand the triangular matrix to a symmetric matrix and enforce the properties of a density matrix. The second method obtains vector representations for all the contexts in which a word occurs (e.g. averaging all the word vectors for a context). To compute a word’s density matrix they sum the density matrices computed from its context vectors and normalise the result.

Independent of DisCoCat, Blacoe et al. [5] proposed a semantic space model in which the meaning of a word is represented by a density matrix. They do not consider composition at all, focusing only on building density matrices for individual words. They learn the density matrix of a word from its dependency relations in a corpus. They cluster relation types together (based on syntactic similarity) and assign each cluster a subspace of the semantic space. This allows them to build a representation for each dependency sub-tree in which a word occurs. They assume that a word’s usage is uniform throughout the same document and compute a word’s density matrix for a specific document by summing all of its sub-tree representations. Finally, they compute a word’s density matrix by summing the density matrices learned from all the documents in which a word occurs (and normalising the result). They evaluate their model on word similarity and association tasks, and achieve results comparable to the state-of-the-art distributional semantic models of the time.

(24)

2.3.2

Representing verbs and adjectives

The methods surveyed above build density matrices for atomic word types. Unlike most dis-tributional semantic models, DisCoCat distinguishes between atomic word types (essentially nouns) and complex word types. Complex word types are those that are represented by the product of multiple atomic types in the pregroup grammar. These are words, like verbs and adjectives, that take other words as arguments.

Past works proposing methods for building word density matrices have proposed different methods for atomic words and complex words. This is motivated by the fact that atomic words and complex words convey meaning in different ways. Atomic words are used to refer to some concept, while complex words act like functions on atomic words. For example, an intransitive verb acts like a function on a noun, taking the meaning of the noun as input and producing the meaning of the noun-verb combination. Furthermore, since complex words are represented by compound words in the pregroup grammar, they must be represented by higher order tensors in the semantic space. This alone requries different methods for building complex word and atomic word density matrices.

Piedeleu et al. [45] follow Grefenstette and Sadrzadeh [19], who proposed building complex word representations by taking the sum of the representations of its arguments in a corpus. In the case of a complex word type with multiple arguments (e.g. a transitive verb has a subject and object) they sum the tensor product of its argument representations. They build tensor representations for complex words by summing the vectors of their arguments, and then taking the outer product of the computed tensor with itself to produce a density operator. However, they note that this can lead to a mismatch between the order of complex types in the pregroup grammar and the semantic space. To resolve this mismatch they use the copying mechanism of a Frobenius algebra (the ∆ morphism of equation 2.7) to expand the tensors of complex words types. This expansion is done by encoding the elements of a tensor on the diagonal of a higher-order tensor.

Sadrzadeh et al. [50] modify this approach by also point-wise multiplying the summed argument representations by the complex word’s own distributional representation. This enriches the resulting representation for the complex word, since it adds distributional information about it. Balkir et al. [2] also follow the approach of summing the argument representations of a complex word, but omit the Frobenius algebra tensor expansion. They ignore the mismatch between the order of types in the pregroup grammar and the semantic space. Bankova et al. [3] do not distinguish between atomic and complex word types when building density matrices. However, they do encode complex word types in higher order tensors using the copying mechanism of a Frobenius algebra.

(25)

2.3.3

Composition with density matrices

The theoretical shift from FVect to CPM(FVect) introduces a new computation for meaning composition. Instead of computing the sentence vector from its constituent word vectors, the sentence density matrix has to be computed from its constituent word density matrices. In FVect the meaning of a sentence is computed as morphisms in the category. These morphisms are linear maps i.e. linear transformations that map vectors to vectors through matrix multiplication or tensor contraction (depending on the order of the tensors involved). In CPM(FVect) the meaning of a sentence is still computed as morphisms, but now these morphisms are completely positive maps, which are linear maps that map valid density matrices to valid density matrices through tensor contraction. The density matrix of a sentence is computed by computing a series of tensor contractions in which complex words (represented by higher order tensors) operate on atomic words (represented by density matrices). Therefore when we move from vectors to density matrices, the composition computation of DisCoCat moves from tensors and matrices operating on vectors, to tensors operating on density matrices.

Although this method of composition is theoretically sound, it presents a number of practical challenges for implementation. Firstly, complex word types in the pregroup grammar have to be represented by higher order tensors in the semantic space. This emerges from the category theory underlying DisCoCat. Simple word types like nouns (n) are represented by density matrices that live in N ⊗ N (the tensor product of the vector space N with itself). Complex word types are represented by higher order density operators. For example, adjectives (nnl) are order 4 tensors that live in N ⊗ N ⊗ N ⊗ N. Transitive verbs (nrsnl) are order 6 tensors that live in N ⊗ N ⊗ S ⊗ S ⊗ N ⊗ N.

Building high order tensors is computationally difficult because of the space complexity. Furthermore, the maps that are used for meaning composition are tensor contractions (a gen-eralisation of matrix multiplication). Tensor contraction is computationally expensive if high order tensors are involved. These computational limitations make it difficult to implement and apply DisCoCat. This has led to a number of past works proposing computationally efficient alternatives to tensor contraction as composition methods. These alternatives, which we survey in this section, get rid of the need to directly build and store high order tensors for complex words. Instead they build density matrices for complex words and lift them to the theoretically required higher order tensor spaces. Computing composition through tensor contraction with these lifted tensors is equivalent to simpler composition computations. We now review a number of such simplified composition computations that have been proposed for density matrices in DisCoCat.

(26)

Bankova et al. [3] propose two simpler computations for verb-noun and noun-verb composition. Their first option for composition is point-wise multiplication of the noun and verb density matrices:

JsK = JnK ⊙ JvK, (2.10)

whereJnK and JvK are the density matrices of the noun and verb respectively, and the goal is to compute the density matrixJsK of the sentence. This method is computationally more feasible than having to build high order tensors for verbs. It arises out of the introduction of a Frobenius algebra in DisCoCat (reviewed in section 2.2.4). The Frobenius algebra is used to encode the density matrices of complex words in higher order tensors without having to learn extra parameters. This is done to resolve the mismatch between the order of types in the pregroup grammar and the semantic space. The composition computation then simplifies to equation 2.10.

Their second option, referred to by Coecke and Meichanetzidis [10] as the Fuzz, involves finding the square root of the verb density matrix i.e. the matrixJvK1/2 such that the matrix productJvK1/2JvK1/2is equal toJvK. Composition is then computed as

JsK = JvK

1/2

JnKJvK

1/2. (2.11)

Another simplified composition method, referred to by Coecke and Meichanetzidis [10] as the Phaser, is proposed by Lewis [36] and Coecke [9]. The verb density matrix is factorised with spectral decomposition asJvK = ∑ipiPi, where p1, p2, ... are its eigenvalues and P1, P2, ...

are its orthogonal projectors (the outer products of its eigenvectors). Composition is then computed as

JsK =

i

piPiJnKPi. (2.12)

Equations 2.10, 2.11, and 2.12 ensure that composition returns a positive semi-definite Hermitian matrix, which can be normalised to produce a valid density matrix.

Besides these composition methods, all of which have been proposed for density matrices, we can also extend composition methods that have been proposed for vectors in DisCoCat, to density matrix equivalents. We extend one such method, referred to as Tensor. Tensor was first proposed by Grefenstette and Sadrzadeh [20], and subsequently also used by Wijnholds and Sadrzadeh [55], as a method for creating matrices for complex words (specifically verbs) from their word vectors. It achieves this by taking the outer product of a word’s vector with itself. Grefenstette and Sadrzadeh [20] showed that DisCoCat composition then simplifies to

(27)

equations involving element-wise multiplications. We extend this method to density matrices by computing the tensor of a complex word as the Kronecker product (a generalisation of the outer product from vectors to matrices that produces a 4-dimensional tensor) of the word’s density matrix with itself. Tensor composition is then computed, according to the DisCoCat framework, through tensor contractions between the tensors of complex word types and the density matrices of simple word types. This again simplifies to element-wise multiplications between the density matrices of words, the details of which depend on the grammatical structure of the phrase being composed (we list these simplified equations, along with all the composition equations that we use in our experiments, in appendix A).

(28)
(29)

Methods

Applying DisCoCat in practice consists of two independent steps. Firstly, some model has to build word representations (density matrices in our case) for all the words in a vocabulary. Secondly, when we want to compute the representation for a sentence, a method is needed for composing word representations. In this chapter we describe the methods that we will use for both of these steps. To build density matrices we propose two new models inspired by recent word embedding models. We outline their learning algorithms and show how they relate to neural methods. To compose density matrices we make use of four existing composition methods. We demonstrate how they compose words into sentences with some simple examples of composition.

3.1

Density matrix models

We propose three novel methods for building density matrices, which we refer to as BERT2DM, Word2DM, and multisense Word2DM. BERT2DM makes use of BERT -a pre-tr-ained l-angu-age model th-at produces contextu-alised word embeddings. Word2DM is inspired by the skip-gram model of Word2Vec - a prediction-based model that produces static word embeddings. Multi-sense Word2DM is a modification of Word2DM that explicitly models ambiguity.

3.1.1

BERT2DM

BERT is a 12-layer bidirectional transformer with weights that have been trained on the tasks of masked language modelling and next sentence prediction. The pre-trained model is publicly available1and can be used to produce contextualised embeddings for a sequence

(30)

of words. When BERT processes a sentence it produces, at each of its 12 layers, a vector representation for all the words in the sentence. All the representations are contextualised in the sense that they are computed as a function of the entire sentence. The transformer architecture of BERT uses the attention mechanism [1] to compute the representation of a word at each layer as a weighted sum of the all the word representations in the preceding layer. This enables the model to incorporate information about a word’s context into its representations. Since each layer has different weights, they incorporate different contextual information. It has been shown that different layers of BERT encode different aspects of linguistic structure [8, 25].

The contextualised embeddings produced by BERT encode the meaning of a word in a specific sentence. The representation of a word will be different, depending on the context in which it used. This is the advantage of contextualised word embeddings over static word embeddings - they encode context-specific meaning. Polysemic words, disambiguated by their contexts, are allowed separate embeddings for different uses. The simplest way to obtain contextualised embeddings from BERT is to extract the final layer of word representations. In the original paper, Devlin et al. [14] also experiment with other ways of extracting contextualised embeddings. They extract different layers individually and test them as features. They also extract multiple layers and sum or concatenate them. Wiedemann et al. [54] showed that the contextualised embeddings produced by BERT for a polysemic word form clusters that correspond to its senses.

BERT2DM uses the contextualised embeddings of BERT to build density matrices that encode multiple senses of a word. BERT is applied to a corpus and all the contextualised embeddings for a word are combined to compute the word’s density matrix. This is achieved by summing the outer products of all the contextualised embeddings produced by BERT for a specific word when processing the corpus. For example, consider the following toy corpus consisting of three sentences, each containing the word star.

1. Astronomers discovered a star orbited by twenty planets. 2. Tom Hanks is the biggest star in Hollywood.

3. She is considered a rising star in the Labour party.

Processing each of these sentences with BERT produces three contextualised representations for star, which encodes its meaning in each of the sentences. If these representations are |star1⟩, |star2⟩, and |star3⟩ (refer back to the start of section 2.3 for an overview of the bra-ket

notation used throughout this chapter), then the density matrix for star, denoted asJstarK, is computed as

(31)

1. Iterate the training corpus and for each sentence s:

(a) Process the sequence of words in s with BERT.

(b) Extract and store the contextualised embeddings produced by BERT for the words in s.

For a training corpus of length T this produces contextualised embeddings |w1⟩, |w2⟩, ..., |wT⟩ for all the words in the corpus.

2. Discard all contextualised embeddings corresponding to stop words.

3. (Optional) Cluster the contextualised embeddings of each word (using hierarchical agglomerative clustering for k = 2, ..., 10 and the variance reduction criterion to select the number of clusters) and subsequently use the cluster centroids as the contextualised embeddings of a word.

4. Apply dimensionality reduction (PCA or SVD) to the contextualised embeddings of all the words.

5. For each word v in the vocabulary of the corpus, compute its density matrix as

JvK =

i∈ind(v)

|wei⟩⟨wei|,

where ind(v) are the indices at which the word v occurs in the corpus and |wei⟩ is the reduced contextualised embedding |wi⟩.

6. Normalise all density matrices by dividing by their traces.

Figure 3.1: Procedure for building density matrices with BERT2DM.

The density matrix of star is a mixed state, computed as a mixture of the pure states of star1,

star2, and star3. The procedure is outlined in figure 3.1. BERT2DM builds the density matrix

of a word by combining contextualised embeddings obtained from different examples of the word’s use. Assuming that some of these usage examples correspond to different senses of the word, the model encodes information about the different senses into the word’s density matrix, thereby modelling the ambiguity of the word.

Contextualised embeddings can be extracted from BERT in different ways. We experi-ment with extracting individual layers and summing or concatenating multiple layers. BERT representations are 768-dimensional and concatenated representations are even larger. Taking the outer product of such vectors directly would lead to density matrices that take up a very large amount of memory (a vocabulary of 20,000 words, each of which has a 768 × 768

(32)

density matrix, would require almost 50GB to store). To avoid this, dimensionality reduction is applied to all the contextualised embeddings obtained from the corpus (step 4 in figure 3.1). This produces smaller contextualised representations for all the words in the corpus. The outer products of these smaller representations are then used to compute density matrices for words. Consider again the example of the word star in the toy corpus. When BERT processes the three sentences, it produces 768-dimensional vectors for all the words in the corpus. These can be stored in a T × 768 matrix, where T is the size of training corpus (26 words in the case of the toy example). Among the rows of this matrix will be |star1⟩, |star2⟩,

and |star3⟩. Dimensionality reduction is applied to this matrix, resulting in a T × d matrix,

where d ≤ 768. Among the rows of this new matrix will be the reduced vectors for star, denoted as |starg1⟩, |starg2⟩, and |starg3⟩. Then the d × d density matrix for star, denoted as JstarK, is computed as

JstarK = | gstar1⟩⟨starg1| + |starg2⟩⟨starg2| + |starg3⟩⟨starg3|.

This ensures that the resulting density matrices are computationally feasible, but might also discard useful contextual information encoded by BERT.

We experiment with using singular value decomposition (SVD) and principal component analysis (PCA) for dimensionality reduction. The most important difference between SVD and PCA is that PCA centers the data (subtracts the mean) before reducing its dimensionality. Centering shifts the values of each dimension to have zero mean, which ensures that all dimensions have a more similar range of values. This can be useful if the particular range of a dimension is irrelevant or misleading. However, it is possible that BERT does encode contextual information in the actual values of a dimension. In that case centering the data would discard relevant information. Moreover, centering the BERT representations would alter relations between word representations (as measured by cosine similarity). Since it is not known whether the dimensions produced by BERT are shift-invariant, we experiment with both SVD and PCA.

We remove the contextualised embeddings of stop words before applying dimensionality reduction. Stop words don’t add significant semantic information and including them might lower the quality of the reduced representations.

We also experiment with clustering the contextual embeddings of a word and applying dimensionality reduction to the cluster centroids instead of the contextualised embeddings. The motivation for this is that clustering contextualised embeddings can produce clusters that correspond to distinct senses (as shown by Wiedemann et al. [54]). Using cluster centroids could eliminate some of the noise present in contextualised embeddings, and lead to reduced representations that better capture the distinct meanings of a word. This clustering procedure

(33)

is also similar to the approach of Schütze [51], in which the context embeddings of a word are clustered and the resulting cluster centroids used as multi-sense embeddings. Only in our case we use the cluster centroids to build a density matrix that encodes the different senses of a word.

3.1.2

Word2DM

The skip-gram model of Word2Vec is a procedure for training static word embeddings. It starts by randomly initialising the embedddings for all the words in the vocabulary of a corpus. It then processes a corpus one word at a time and, at each word, tries to predict the surrounding words (within a context window of some size). The objective function that it optimises at each prediction with regard to model parameters θ is

J(θ ) = log σ (vt⊤vc) + K

k=1

log σ (−vt⊤vwk) (3.1)

where vt is the embedding of target word, vc is the embedding of the context word, and

v1, v2, ..., vK are the embeddings of K negative samples (words drawn from some noise

distribution that are false contexts of the target word). By optimising equation 3.1 over a large corpus, skip-gram learns word embeddings that encode distributional information. Maximising equation 3.1 adjusts the embeddings of words occurring in the same context to be more similar and adjusts the embeddings of words that don’t occur together to be less similar. This becomes clear when we consider the gradients used to update embeddings during training. We briefly recall the details of the gradient calculation so as to refer back to it later in this section. The derivative of equation 3.1 with respect to the target vector vt is

∂ J ∂ vt = (1 − σ (vt⊤vc))vc− K

k=1 (1 − σ (vt⊤vwk))vwk (3.2)

which is used to update the target vector as follows:

vt← vt+ α

∂ J ∂ vt

(3.3)

The target vector is updated by adding the scaled context vector to it and subtracting the scaled negatively sampled vectors from it. The vectors are scaled proportionally to how dissimilar they are to the target vector. This ensures that the target vector is “pulled closer” to the true context vector and “pushed away” from the negative context vectors. It is this

(34)

computationally simple training procedure which makes skip-gram with negative sampling so effective.

We would like to design an equivalent training procedure for density matrices. When training density matrices, we must ensure that they satisfy the conditions resulting from their definition. Density matrices must be (1) Hermitian, (2) positive semi-definite, and (3) with trace equal to one. Condition (1) can be satisfied by enforcing symmetry, since we only work with real numbers. Condition (3) is trivial to enforce, since any matrix can be normalised to have unit trace. Condition (2) is the most challenging to enforce, since the optimisation techniques used for training adjust weights (the entries of the matrix) indiscriminately according to their influence on the objective function. Adjusting the entries of a density matrix during optimisation could cause the matrix to lose its positive semi-definiteness, so it would no longer be a valid density matrix. To avoid this we utilise the following property of positive semi-definiteness:

Property 1. For any matrix B, the product BB⊤is positive semi-definite.

We enforce positive semi-definiteness on the matrices that we train by training the weights of a matrix B and computing our density matrix as A = BB⊤. By updating the weights of Band computing A we indirectly train positive semi-definite matrices. This allows us to train valid density matrices with a procedure similar to Word2Vec’s skip-gram with negative sampling. To achieve this we modify the training objective of skip-gram to maximise the similarity of the density matrices of co-occurring words. Word2Vec uses the inner product of two word embeddings as a proxy for semantic similarity during training. Similarly, we can use the trace inner product defined on matrices as a proxy for semantic similarity when training density marices. The objective function at each target-context prediction is then

J(θ ) = log σ (tr(AtAc)) + K

k=1

log σ (−tr(AtAwk)) (3.4)

where At and Ac are the the density matrices of the target and context words respectively,

A1, A2, ..., AKare the density matrices of K negative samples, and θ is the set of weights of

the intermediary matrices Bt, Bcand B1, B2, ..., BK. Computing this objective function naively

requires multiple matrix multiplications. For each tr(AtAc) term (including the terms of the

Knegative samples), the matrices At and Achave to be computed respectively as Aw= BwB⊤w

and Ac= BcB⊤c and then the matrix product AtAchas to be computed. This means that, for

each target-context prediction, we require 3(K + 1) matrix multiplications. One of the most attractive features of Word2Vec is its computational efficiency, which enabled training on very large corpora in reasonable time. The introduction of multiple matrix multiplications

(35)

into the objective function means that much of this efficiency is lost. In order to reduce the complexity of our model, we make use of the following property and lemma to find a new objective function that is computationally simpler, but equivalent to equation 3.4.

Property 2. The trace of the product of two matrices can be expressed as the element-wise products of their elements. If A is an n× m matrix and B is an m × n matrix, then the trace of the n× n matrix AB can be computed as

tr(AB) = n

i=1 m

j=1 ai jbji

Lemma 3. If Bt and Bc are n× m intermediary matrices, then trace of the matrix product

AtAccan be written as the sum of the squared elements of an m× m matrix C = B⊤cBt:

tr(AtAc) = m

i=1 m

j=1 c2i j

Proof. We can express tr(AtAc) as a trace computation involving the intermediary matrices

Bt and Bc:

tr(AtAc) = tr(BtBt⊤BcB⊤c)

Then we can use the cyclic property of the trace function to rewrite this as the product of a matrix C and its transpose:

tr(AtAc) = tr(B⊤cBtBt⊤Bc)

= tr(B⊤cBt(B⊤cBt)⊤)

= tr(CC⊤), where C = B⊤cBt

Now we can use property 2 to express this as the element-wise products of the elements of C and its transpose:

tr(AtAc) = m

i=1 m

j=1 [C]i j[C⊤]ji = m

i=1 m

j=1 ci jci j = m

i=1 m

j=1 c2i j

(36)

1. Iterate the training corpus and build a vocabulary of words. For each word v in the vocabulary, randomly initialise a n × m matrix Bv.

2. Iterate the training corpus and at each word wt:

Iterate the context words wc= wt−l, ..., wt−1, wt+1, ..., wt+l within a window of size 2l,

and at each word wc:

(a) Sample K negative samples from some noise distribution.

(b) Maximise equation 3.5 with regards to Bt, Bc, and Bwkfor k = 1, ...K.

3. For each word v in the vocabulary, compute Av= BvB⊤v and normalise it as eAv= tr(AAvv)

to obtain the final density matrix of v.

Figure 3.2: Procedure for building density matrices with Word2DM.

This allows us to rewrite equation 3.4 to find an equivalent objective function that requires fewer computations than straightforward matrix multiplication would. The objective function at each target-context prediction becomes

J(θ ) = log σ ( m

i=1 m

j=1 [B⊤c Bt]2i j) + K

k=1 log σ (− m

i=1 m

j=1 [B⊤w kBt] 2 i j). (3.5)

By using the result of lemma 3 we have reduced the number of matrix multiplications required for each target-context prediction from 3(K + 1) to (K + 1). Density matrices are trained by maximising equation 3.5 with respect to the intermediary matrices Bt, Bc, Bw1, ..., BwK over a

large corpus. The training procedure is outlined in figure 3.2.

In addition to basing our model’s objective function on skip-gram, we also use other aspects of Word2Vec’s training algorithm introduced by Mikolov et al. [37, 38]. We use a dynamic window size i.e. the size of each context window is sampled uniformly between 1 and the maximum window size. We also discard words that occur less than some minimum threshold and sub-sample frequently occurring words. Negative samples are drawn from a unigram distribution raised to the power of 34. Furthermore, we train two density matrices for each word - one that represents it as a target word and another that represents it as a context word. After training we use the target density matrices as our final density matrices.

The model is trained using stochastic gradient descent. We now derive the gradients used to update Bt during training, and subsequently show that these gradients lead to suboptimal

(37)

updates to the density matrices during training. Deriving the gradient with respect to Bcand

Bwk would proceed similarly. To compute the gradients of equation 3.5 we first rewrite it in

terms of the elements of the n × m matrices Bt, Bc, and Bwk:

J(θ ) = log σ ( m

i=1 m

j=1 ( n

l=1 bclibtl j)2) + K

k=1 log σ (− m

i=1 m

j=1 ( n

l=1 bwk li b t l j)2), (3.6)

where bxpq denotes the pqth element of Bx. We derive the gradient of this objective function

with respect to btpq, an element of the intermediary target word matrix Bt. In order to use the

chain rule in gradient calculations we rewrite J(θ ) as a composite function:

J(θ ) = log σ (y(θ )) + K

k=1 log σ (zk(θ )), (3.7) where y(θ ) = m

i=1 m

j=1 ( n

l=1 bclibtl j)2 and zk(θ ) = − m

i=1 m

j=1 ( n

l=1 bwk li b t l j)2.

The derivative of J with respect to btpq can now be computed as follows:

∂ J ∂ btpq = ∂ log ∂ σ ∂ σ ∂ y ∂ y ∂ btpq+ K

k=1 ∂ log ∂ σ ∂ σ ∂ zk ∂ zk ∂ btpq = 1

σ (y)(1 − σ (y))σ (y) ∂ y ∂ btpq+ K

k=1 1 σ (zk) (1 − σ (zk))σ (zk) ∂ zk ∂ btpq = (1 − σ (y)) ∂ y ∂ btpq + K

k=1 (1 − σ (zk)) ∂ zk ∂ btpq = (1 − σ (y)) m

i=1 2bcpi n

l=1 bclibtlq− K

k=1 (1 − σ (zk)) m

i=1 2bwk pi n

l=1 bwk li b t lq = (1 − σ (y))2[BcB⊤cBt]pq− K

k=1 (1 − σ (zk))2[BwkB ⊤ wkBt]pq

The last line in the above derivation is obtained by rewriting the summation expressions as equivalent matrix multiplications. We can now write the derivative of J with respect to the full intermediary matrix Bt:

∂ J ∂ Bt = (1 − σ (y(θ )))2BcB⊤cBt+ K

k=1 (1 − σ (zk(θ )))2BwkB ⊤ wkBt (3.8)

(38)

As opposed to the gradients of Word2Vec (equation 3.2), the gradients of Word2DM do not lead to simple and easily interpretable training updates. As discussed in the paragraph following equation 3.3, in Word2Vec the target vector is made more similar to the context vector and less similar to the negative context vectors. Ideally we would like something similar to occur in Word2DM with density matrices, but equation 3.8 shows that we lose the intuitive training updates of Word2Vec through the introduction of intermediary matrices. Furthermore, we can show that the gradients of Word2DM sometimes lead to unwanted consequences in training.

Consider the case where the density matrices of a target and context word are highly dissimilar. Recall from equation 3.4 that the y is in equation 3.8 is the trace inner product of the density matrices At and Ac(the measure we use to quantify semantic similarity). The

minimum value of the trace inner product of two density matrices is zero (this follows from the fact that density matrices are positive semi-definite), so two density matrices are highly dissimilar when their trace inner product is close to zero i.e. y ≈ 0. From equation 3.5 we can recall how y can be written in terms of the intermediary matrices:

y= m

i=1 m

j=1 [B⊤c Bt]2i j

Consider that y ≈ 0 if and only if the elements of B⊤cBt are close to zero in value, since squaring the elements in the summation makes them all positive. We have established the following equivalence:

tr(AtAc) ≈ 0 ⇐⇒ B⊤cBt ≈ O,

where O is the m × m matrix with all zero entries. Consider how this will affect the target-context update during training. The first term of the gradient in equation 3.8 becomes

(1 − σ (y(θ )))2BcB⊤cBt= (1 − σ (0))2BcO ≈ O

so the target-context update becomes ineffective for true contexts. The update should make the density matrix of the target word more similar to that of the context word, but the gradient is so small that it makes this impossible. Moreover, the more dissimilar the target and context density matrices are before the update, the less effective the update will be. This is the opposite of the intended effect (achieved by Word2Vec) in which the magnitude of the target-context update should increase if the target and context representations are less similar. This is an example of how the introduction of intermediary matrices in Word2DM leads to suboptimal training updates. We ensure that our density matrices are positive semi-definite, but lose the guarantee that the algorithm will learn high-quality semantic representations.

Referenties

GERELATEERDE DOCUMENTEN

Referentiepunt GA2005-36 Vegetatie: Charetum canescentis Associatie van Brakwater-kransblad 04D1b soortenarme subassociatie inops Chara connivens facies Dit referentiepunt betreft

De getransponeerde matrix A t van een matrix A is de matrix die men bekomt door rijen en kolommen te verwisselen. De getransponeerde matrix van een symmetrische matrix is de

Ze houden allebei bij hoeveel stappen ze dagelijks zetten gedurende een week.. kan aflezen hoeveel stappen Robert en Bertrand deden

Een fokker van paarden wil weten wat de invloed is van zijn jaarlijkse verkoop van (volwassen) dieren op zijn kudde. 20% van de jonge dieren en van de volwassenen

De bevinding dat ouders maar matig te betrekken zijn in het programma, en dat leerkrachten een ouderbrochure niet erg zinvol achtten, heeft ertoe geleid dat het plan voor

Die studie oor Graad 6-leerders se onderwys belewing in ’n landelike skool bied nie net ’n analise van die faktore wat ’n impak op hierdie leerders se prestasie het nie, maar gee ook

Er wordt naar gestreefd dat u zoveel mogelijk met dezelfde regieverpleegkundige borstkanker te maken heeft. Bij afwezigheid nemen haar

When four sets of measurements are available, the new re-weighted algorithm gives better or equal recovery rates when the sparsity value k < 18; while the RA-ORMP algorithm