Entailment between noun-verb phrases with compositional distributional semantics

(1)

Entailment between noun-verb

phrases with compositional

distributional semantics

Emma van Proosdij 10663657

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. M.A.F. Lewis

Institute for Language and Logic Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

3.5 experimental setup . . . 17 3.5.1 selection vocabulary . . . 17 3.5.2 applied approaches . . . 17 3.5.3 evaluation . . . 18 4 results 18 5 discussion 20 5.1 evaluation of results . . . 20 5.2 critiques on methods . . . 21 5.3 further research . . . 22 5.3.1 implementation . . . 22 5.3.2 other methods . . . 22 6 conclusion 22

(3)

abstract

Distributional semantics represents words as co-occurrence vectors obtained from a large corpus. A similarity measure between words can be obtained by comparing these vectors, but this does not scale up to phrase or sentence level. Formal semantics represents sentences as logical formulas and provides sophis-ticated models for sentence meaning, but is less focused on individual words. Compositional distributional semantics combines these two approaches: nouns are represented as vectors and other parts of speech such as verbs and adjectives as linear maps. The similarity measures obtained by these methods are some-what loose, and many AI application such as question Answering, information extraction and paraphrase acquisition need a tighter similarity criterion such as entailment. Entailment between two clauses in formal semantics is defined as: ”it cannot be true the the former is true and the latter is false”. In this thesis, entailment between verb-noun phrases in combination with different quantifiers is obtained by learning verb matrices from an entailment based sentence space. These matrices are applied to distributional noun vectors to obtain entailment relations between different sentences. Different quantifiers and functions words are examined as well. Although the results look promising, more research is needed to estimate the practical value of this approach.

1 introduction

Semantics is the study of meaning in language. A major contribution of artificial intelligence to this broad and complex field is to provide models that simulate human understanding of language. These models are necessary for multiple language related tasks such as machine translation, information extraction and information retrieval [14]. Traditionally there have been two fields of semantics: distributional semantics and formal semantics.

Distributional semantics is founded on the hypothesis ”meaning is use”, that Wittgenstein proposed in 1953 [14], or as the linguist Firth said: ”you shall know a word by the company it keeps” [14]. Meaning in this field is often modelled as similarity, and depends on the notion that similar words occur in similar contexts. A co-occurrence matrix can be constructed from a large corpus. This matrix can be interpreted as vectors in a semantic vector space where each vec-tor represents a word.

This approach has demonstrated useful results in many experiments [14], but primarily on the lexical level. Once combinations of words or sentences are in-volved performance drops significantly. Either co-occurrence of complete phrases or sentences should be determined, but that probably will not occur in a text, or words are combined as a ’bag of words’ and the structure of the sentence is not taken into account [11].

Whereas distributional semantics has the word as its basic unit, formal seman-tics focuses on the sentence as its primary notion [3]. Sentences are composed from entities and to represent a logical formula.

(4)

This has provided sophisticated models of sentence meaning, but is primarily functional on the sentence level. On the lexical level focus is often put on func-tion words, while nouns, verbs and adjectives are represented as sets of entities. Combinations of these entities are represented as intersection, but his model has proven to fail in many cases [2].

Distributional semantics and formal semantics have somewhat opposing strengths and weaknesses [4]. Distributional semantics performs well on the lexical level, but does not scale op to sentence level. Formal semantics by definition scales up to sentence level but lacks on the individual word level.

The aim of compositional distributional semantics is to combine these two fields and construct meaning in a compositional way. Baroni & Zamparelli [3] have proposed an approach where nouns are represented as distributional vectors and other parts of speech such as verbs, adjectives and logical connectives as linear maps. When those linear maps are applied to noun matrices, the meaning of the combination between them will be represented.

Coecke et al. [4] have extended this notion by taking the syntactic structure of a sentence into account. By applying a function that represents the grammatrical decompostion to tensor product of representations of words, the meaning of a sentence can be calculated.

With these approaches the similarity between two sentences can be cal-culated, so for example a high similarity score will be obtained between the sentences: ’all men talk’ and ’all fathers talk’, but this notion of similarity is somewhat loose. Many semantic information-oriented applications like Ques-tion Answering, InformaQues-tion ExtracQues-tion and Paraphrase AcquisiQues-tion require a similarity criterion that is more specific [5]. Specifically, these applications need information on whether one sentence can be replaced by one another and still be true. In the example provided above this would be possible, since ’all men talk’ automatically implies that ’all fathers talk’. This notion is called ’entailment’. In formal semantics entailment between two clauses is defined as: it cannot be that the former is true and the latter is false [2].

Geffet&Dagan [5] have proposed an hypothesis to identity entailment between words. Entailment above the word level was researched by Baroni et al. [2]. They trained a classifier on positive and negative examples of entailment be-tween simple sentences and managed to identify entailment relations with. Balkir et al [1] have approached entailment above the word level in a compositional manner. By applying the structure of Coecke et al. [4] to the entailment hypoth-esis of Geffet&Dagan [5] they formulated an hypothhypoth-esis to identify entailment between sentences.

This research aims to identify entailment of phrases and sentences in a com-positional manner, simlilar to Baroni et al. [2] and Balkir et al. [1]. Specifically verb-noun phrases are considered such as: ’fathers talk’. These sentences are extended to contain different function words such as quantifiers and negations. In contrast to Baroni et al. [2] and Balkir et al. [1], the method to identify entailment between sentences is similar to Baroni et al. [3]. A matrix that

(5)

represents a verb is learned from distributional noun vectors and entailment relations between training sentences. This matrix is then applied to a distribu-tional noun vector returns a vector that represents whether this sentence entails all training sentences or not. Different approaches are discussed to extend these sentence to contain function words. An experiment on a simplified grammar is conducted to test these approaches.

Specifically, research is done on whether semantic matrices of verbs and function words obtained by linear regression on an entailment based semantic sentence space and distributional noun vectors manage to identify entailment vectors of sentences.

To examine this, simple sentences consisting of one verb and one noun are exam-ined, and linear regression on an entailment based sentence space is hypothesised to be a suitable measure for this task. After that, sentences are extended to contain different function words, and several approaches of identifying entail-ment between these somewhat more complex are examined. An experientail-ment on a small vocabulary is conducted to test the usefulness of these approaches.

This thesis is separated into different sections. At first, the theoretical foun-dations of the subject are discussed in more detail. After that, the research approach is examined in more detail, first providing a theoretical overview of the considered methods, then more information about the test implementation. The results are presented and discussed afterwards.

2 Theoretical foundations

2.1 distributional semantics

In distributional semantics, words are represented as vectors. The easiest way to represent words as vectors is with Matrix factorisation Methods [12]. For every word in a large corpus its neighbouring words are counted and represented in a term co-occurance matrix. The columns of that matrix are the vectors that represent the words. For instance: the word ’father’ might occur 10 times in the neighbourhood of ’mother’, 2 times in the neighbourhood of ’football’ an 0 times in the neighbourhood of ’pineapple’. If only these three context words are considered, the vector for father can be represented as: [10, 2, 0]. Similarity between to words can be obtained by calculating the cosine similarity between their distributional vectors.

A disadvantage of this simple approach is that frequent words contribute a disproportionate amount to the similarity measure [12]. Words such as ’the’ and ’a’ occur often but provide little information about semantic similarity. A number of techniques are available to account for this problem, such as entropy-or centropy-orrelation-based nentropy-ormalisation entropy-or root type transfentropy-ormation in the fentropy-orm of Hellinger PCA [12].

Another way to represent word vectors are shallow based window methods. These methods use neural networks to aid in making predictions in local context methods [12].

(6)

Both approaches preform better on different tasks. GloVe [12] has provided vectors that use advantages of both techniques to represent meaningful vector spaces that preform well on a word analogy task. They have published their vectors for open use, and these vectors will be used in this research.

2.2 Formal semantics

Whereas distributional semantics focuses on individual words, formal semantics focuses on sentences and represents them as logical formulas [3]. At the lexical level, formal semantics focuses on function words, words that have little lexical meaning and express grammatical relationships between other words within a sentence [3]. One of its main goals is to formulate rules for composing complex sentences using these function words [3].

Because of this focus on function words, the meanings of nouns and verbs are often considered as pure extensions: nouns, verbs and adjectives are properties and thus denote a set of entities. ’Fathers’ is considered as ’the set of entities that are a father’, ’talking’ is ’the set of entities that talk. In the simplest case the meaning of an adjective-noun phrase is considered as the intersection between the two sets of entities: so ’red car’ is represented as redobjects ∩ cars [3]. However, this method of combination fails in many cases [3]. For instance, a fake gun is not a gun [3]. This problems have led to a demand for other representations of adjectives, such as functions that map the meaning from a noun to that of a modified noun, but it remains unclear how these functions should be constructed [3].

Formal semantics has provided sophisticated models of sentence meaning, but does not provide a representation of verbs nouns and adjectives that scale up to real-life tasks [2].

2.3 compositional distributional semantics

Representing the meaning of a sentence with only distributional semantics is close to impossible, because this sentence would have to occur multiple times in a text, and their contexts as well, which is highly unlikely for any corpus. Let alone multiple sentences or bigger parts of text.

A naive way to represent the meaning of a sentence is the ’bag of words’ model. A sentence is considered to be an unordered set of words and their meaning can be calculated by the addition or multiplication of the distributional representa-tions of the words. The word order or grammatical relarepresenta-tions of the words are not considered in this model, so ’Cats eat mice’ would have the same meaning as ’Mice eat cats’, which evidently does not mean the same thing for a human reader.

Mitchell&Lapada [11] have proposed more sophisticated models to construct sentence meaning, in which they consider more information about the roles of words in sentences: additive models and multiplicative models. The additive

(7)

model is described by:

~

p = A~v + B ~w

where v and w are word vectors, and p represents the meaning of a phrase. A and B are matrices that represent syntactic or contextual information. The multiplicative model is described by:

~

p = C · ~v · ~w

where v and w are words vectors, C is a matrix that represents syntactic or contextual information, and p is the phrase vector. Although their evaluation of these models proved to work better than simple ’bag of words’ models, the matrices were simplified to reduce the number of parameters, and part of the parameters remain unclear [15]. Zanzotto et al. [15] have explored the additive models by learning a non simplified version of the matrices. They obtained these matrices by training them on dictionary entries, because dictionary entries tend to be multi-word descriptions of one word. This provided more sophisticated models that proved to be more accurate than the simplified ones.

Baroni & Zamparelli [3] have explored adjective-noun phrases with a model that represents adjectives as a linear maps that transforms a noun to an adjective noun [3], similar to formal semantics. The model they examine is related to the multiplicative models of Mitchell& Lapada [11]:

~ p = A~n

where A is a matrix that represents an adjective, n a noun vector and p the phrase vector. Every adjective represents a different matrix in this model. They learn the matrices by collecting noun and adjective noun vectors in a training set and obtain the matrices with linear regression. This approach proved to outperform similar methods when tested on adjective-noun phrases that did not occur in the training set.

2.4 categorical compositional distributional semantics

Compositional distributional semantics examines meaning on phrase or sen-tence level. In categorical compositional distributional semantics the specific grammatical structure of a sentence is considered, and meanings of phrases and sentences are functions of that structure [1]. Coecke et al. [4] have constructed a mathematical framework to identify a similarity measure between grammat-ically different sentences. This frameworks depends on ’pregroup grammar’ to identify syntactic structure, and ’tensor spaces’ to represent meanings of word. Because these representations share the property of compact closure, meaning of sentences can be mapped to the same ’sentence space’. By applying a function that represents the grammatrical decompostion to tensor products of represen-tations of words, the meaning of a sentence can be calculated. At first, every word in a sentence gets assigned a grammatical type. Using pregroup grammar [4], the way a sentence is composed is identified. If a sentence is properly con-structed there is a specific way to reduce all grammatical elements to a sentence

(8)

with pre-group grammar. After that a vector space is assigned to every word according to their grammatical type. A vector of meaning for every part is then constructed and the tensor of these elements is taken to obtain a meaning representation of the sentence.

Coeke et al. have proposed this mathematical framework, but did not provide a physical implementation of their project. Grefenstette et al. [7] gave spe-cific of instances about how this could be realised. Grefenstette&Sadrzadeh [6] have implemented an experimental version of this framework and obtained statisfactory results on a word ambiguation task.

2.5 entailment

Compositional distributional semantics approaches manage to identify a simi-larity measure between two phrases. Using these methods, a probable outcome would be that the sentences ’all men talk’ and ’all fathers communicate’ have a high similarity. However, Many semantic information-oriented applications like Question Answering, Information Extraction and Paraphrase Acquisition require more information than this somewhat loose notion of similarity[13]. In particular, all these applications need to know when the meaning of one phrase can be inferred from another phrase, so that one phrase could substitute the other in some contexts [2]. In the case of ’all men talk’, ”all fathers commu-nicate’ could be inferred from this because all fathers are men and talking is a type of communication. Formal semantics and distributional semantics have taken different approaches to identify entailment relations between words and sentences.

2.5.1 entailment in formal semantics

entailment is a core notion of logic, which is used in formal semantics [3]. The entailment relation between two or more sentences states that it cannot be true that the former is true and the latter is false [2]. Formal semantics allows this notion of entailment to not only occur between sentences, but also between phrases and words. For example: father |= men, because all fathers are men. The formal semantics account of entailment has gradually become somewhat looser: if a human observer identifies two sentences as entailing, they entail each other [2].

Pattern-based approaches are often used to identify these entailment relations [2]. A popular method is that of Hearst [8] that searches for patterns like ’... such as...’ in a large corpus. Occurrences of ’insects such as beetles’ will indicate that beetles |= insects. While approaches like this have been successful, they are mostly restricted to the word level.

MacCartney & Manning [10] describe the role of quantifiers in entailment between phrases. They describe that meaning of a compound expression can be the result of function application on mathematical representations of other words. They describe a function as ’upward monotone’ (↑), if a phrase is entailed

(9)

by another phrase if the elements of the first phrase entail the elements of the second phrase. ’Dance in Paris’ |= ’tango in France’ since ’dance’ |= ’tango’ and ’Paris’ |= ’France’. A function is considered to be ’downwards monotone (↓)’ if a sentence is entailed by another sentence if the elements of the second sentence entail the elements of the first. ’Didn’t dance’ |= ’didn’t tango’ because ’tango’ |= ’dance’ [10]. Some quantifiers are considered as binary functions having different monotonicities in different arguments. For instance, ’every’ can be considered as upward monotone in its first argument, but downward monotone in its second argument. every fish swims |= every shark moves, because ’shark’ |= ’fish’ and swim |= moves.

2.5.2 lexical entailment in distributional semantics

Entailment between words is not immediately apparent from their distributional vector representation, because they are constructed based on similarity, not en-tailment. However, various researchers have suggested that entailment is the ability of one term to “substitute” for another [2]. For example, in a sentence where the words ’father’ occurs, the word ’men’ could also occur, but vice-versa this needs not to be true. Father is could therefore be considered as ”narrower” than ”man”, and this would imply that father|=man. Geffet&Dagan [5] have proposed that: ”Given a pair of words, this relation holds if there are some contexts in which one of the words can be substituted by the other, such that the meaning of the original word can be inferred from the new one.” and refined this notion with two hypothesis [13]:

Hypothesis I:

If vi => wj then all the characteristic (syntactic-based) features of vi are

ex-pected to appear with wj.

Hypothesis II: If all the characteristic (syntactic-based) features of vi appear

with wj then we expect that vi=> wj

With these hypothesis various measures have been crafted to identify entailment between distributional vectors, such as the balAPinc measure Kotlerman et al. proposed [9].

2.5.3 entailment between phrases and sentences

Most research on entailment relations has been conducted on the lexical level, but Baroni et al. [2] attempted to identify entailment between phrases above word level. They trained classifiers on test-data to identify entailment between Adjective-noun phrases and Quantifier-noun phrases. They managed to identify accurate entailment between AN and QV phrases that were not in their training set.

Balkir et al [1] have taken a different approach and attempted to extend the mathematical framework for compositional distributional semantics that coecke et al. [4] proposed, to identify entailment between sentences. They applied the entailment hypothesis of Geffet&Dagan [5] to this framework to indentify entailment between sentences. In contrast to early formal logic accounts of

(10)

entailment, they did not consider sentences to be entailing or non entailing, but they considered entailment to be a scale of various degrees of entailment. They did an experiment to test their hypothesis on a simplified vocabulary. They asked human annotators to rate sentences on their degree of entailment and managed to identify entailment measures with their proposal that were similar to the human observers.

3 methods

The aim of this research is to identify entailment between sentences using compo-sitional distributional semantics. The meaning of a sentence will be represented as a vector that represents entailment between all other possible sentences. For instance, in a domain where only the only nouns are ’father’ and ’man’, and the only verbs are ’talk’ and ’whisper’ the meaning of the sentence ’a father talks’ will be represented by the vector [1, 0, 1, 0]T _{because indices that are 1 represent}

sentences that are entailed by ’A father talks’.

A father talks A father talks 1

A father whispers 0 A man talks 1 A man whispers 0

Different approaches are taken to obtain entailment vectors for sentences. Simple sentences that contain only one noun and one verb like ’fathers talk’ will be discussed first. These sentences are considered to implicitly contain the quantifier ’all’. After that, different approaches are discussed for extending these sentences to contain other function words. Discussed approaches that are not analytical solutions were tested on a small sentence space, the details about the implementation of this experiment are described in section 3.5.

3.1 verb-noun sentences

Entailment between simple sentences is identified by learning matrices that rep-resent verbs on a training set, similar to Baroni et al. [3]. This training set should consist of a set of simple sentences and the entailment relations between them. These can be represented in a matrix S that represents the sentence space. An example would be:

fathers talk fathers whisper men talk men whisper

fathers talk 1 0 1 0

fathers whisper 1 1 0 0

men talk 1 0 1 0

men whisper 1 1 1 1

Baroni et al. [2] represented adjective-noun phrases as the application of an adjective matrix to a distributional noun vector, using A · ~n = ~p, where A

(11)

represents the Adjective matrix, ~n a distributional noun vector and ~p the vector that represents an adjective-noun phrase. They learned the adjective matrices by applying partial least squares regression on a set of training examples. In this experiment a similar approach is used, but instead of adjectives, representations of verbs are learned with:

W · ~n = ~p

Where W represents the verb matrix, n a distributional noun vector and p the phrase vector. The verb matrices are learned with the Moore-Penrose pseudoin-verse:

W = S · N+

Where W represents the verb matrix, S the part of the sentence space that is relevant for that verb and N a matrix that consists of all distributional noun vectors ’next to each other’.

For instance, the verb ’talk’ is learned by

W = [humans talk, babies talk, ..., scientists talk]·[humans , babies, ... , scientists ]+ Once such a verb matrix is learned, it can be applied to a distributional noun vector to represent the meaning of a sentence.

3.2 function words

As discussed in section 3.1, simple sentences can be constructed with selected nouns and verbs, such as: ’Babies talk’ and ’children act’. These sentences often implicitly use the quantifier ’all’. For simplicity ’babies talk’ is considered to be the same as ’all babies talk’ in this experiment. When the existential quantifier is also included more sentences can be constructed such as: a baby talks, a child acts. Entailment between two sentences is depended on what quantifier is used. As described in section 2.5.1, depending on their quantifier sentences can be ’upward monotone (↑’, ’downward monotone’ ↓ or different in multiple arguments. Different quantifiers applied on different verb-noun phrases also produces different entailment relations. For instance:

• All children talk |= All babies talk

• All children talk |= All children communicate • All children talk |= All babies communicate

This indicates that ’all’ is downward monotone in its first argument and upward monotone in its second argument.

• A child talks |= A human talks • A child talks |= A child communicates

(12)

• A child talks |= A human communicates

This indicates that ’a’ is upward monotone in both arguments.

When the word ’not’ is available, more sentences can be created such as ’not all babies talk’. The relations between negated sentences are for instance:

• not (All children talk) |= not (All humans talk) • not (All children talk) |= not (All children whisper) • not (All children talk) |= not (All humans whisper)

This show that ’not All’ upward monotone in its first argument and down-wards monotone in its second argument.

• not (A child talks) |= not (A baby talks) • not (A child talks) |= not (A child whispers) • not (A child talks) |= not (A baby whispers)

This shows that ’not A’ is downwards monotone in both arguments.

3.3 create sentence spaces for training set

Sentence spaces for sentences with different quantifiers can be created by apply-ing the appropriate functions for the quantifiers. A summary of the quantifier functions discussed in the previous section can be described as:

• ∀(N V ) |= ∀(N ↓ V ↑) • ∃(N V ) |= ∃(N ↑ V ↑) • ¬∀(N V ) |= ¬∀(N ↑ V ↓) • ¬∃(N V ) |= ¬∃(N ↓ V ↓)

To establish entailment relations between the sentences, the entailment be-tween individual words should be known. This information can be stored in a matrix where a 1 represents entailment and a 0 non entailment. For nouns, entailment can be modelled by hypernym/hyponym relations and stored in a matrix N. A simplified example of N would be:

* human child baby

human 1 0 0

child 1 1 0

(13)

This denotes the hyponym relations, for instance child is a hyponym of human. The transpose of this matrix represents the opposite, the hypernym relations. NT would then be:

* human child baby

human 1 1 1

child 0 1 1

baby 0 0 1

Child is not a hypernym of human, but human in a hypernym of child. A similar matrix V can be constructed for verbs with meronym relations. A meronym is a type of a certain action: so talk is a meronym of communicate.

To identify entailment relations between all possible sentences with a specific quantifier the ’kronecker product (⊗) can be used. This is a type of tensor product that multiplies all possible combinations of entries in two vectors or matrices. To identify downward monotonone argments the N and V matrices should be considered, to identify upward monotone arguments the NT _{and V}T

should be used. For every quantifier a sentence space of entailment between all possible sentences can be constructed, these are:

• all: N ⊗ VT

• a: NT _{⊗ V}T

• not all: NT _{⊗ V}

• not a: N ⊗ V

A small fragment from the all matrix would be:

* all humans talk all children communicate all babies whisper

all humans talk 1 1 0

all children communicate 0 1 0

all babies whisper 0 0 1

3.4 identifing entailment

The last section discussed how entailment between sentences can be identified when entailment matrices between nouns (N) and verbs (V) are known. In a sentence outside a training set, these relations are generally not known. In this section several approaches are discussed to extend the method of section 3.1 to work for different quantifiers.

(14)

3.4.1 quantifier-verb matrices

A simple way of extending the method of section 3.1 is to suggest that matrices that represented the verb in section 3.1 were actually quantifier-verb matrices, because a sentence like ’babies talk’ implies ’all babies talk’. So instead of a general representation of the W , a representation of Wall was learned. Using

that knowledge, a similar approach can be taken to learn the other QV matrices, to learn the matrices on their appropriate training set.

• Wall: (N ⊗ VT) · N+

• Wa : (NT ⊗ VT) · N+

• Wnot all: (NT⊗ V ) · N+

• Wnot a: (N ⊗ V ) · N+

When the entailment vector of a sentence is computed the right quanti-fiers should identified beforehand to apply the matching quantifier verb matrix. This method ensures that no performance is lost on the quantifiers, but not completely compositional because the entailment matrix of a group of words is contracted together.

3.4.2 linear maps between sentence spaces

The last section discussed how matrices for a particular quantifier-verb combi-nation can be learned with linear regression on a training set, but that it is not completely compositional. To improve the compositionality, mappings between the different sentence spaces can be determined. Only one quantifier-verb ma-trix has to be learned per verb. To obtain an entailment relation for another quantifier, a mapping can be applied to the meaning phrase obtained by that QV matrix. To obtain a vector for ’All fathers talk’ the ’A→ All’ operator can be applied to ’A father talks’. To obtain a vector for ’not all fathers talks’ the ’all→not all’ matrix can be applied to the previous one. Different approaches can be taken to represent these operators.

3.4.3 maps obtained by linear regression

The operators can be represented as matrices and learned by linear regression, similar to the method to identify entailment relations between simple sentences discussed in section 3.1. The mappings can be learned by:

• Ma→all: (N ⊗ VT) · (NT ⊗ VT)+

• Mall→a: (NT⊗ VT) · (N ⊗ VT)+

• Ma→not a: (NT ⊗ V ) · (N ⊗ VT)+

(15)

This method is depended on the specific values of the training set. Since a pseudoinverse is taken and the maps are further learned on a training set, performance of this implementation of the quantifier maps will probably be worse than the basic implementation.

3.4.4 analytical solutions

There also exists an analytical solution to represent the maps, but this can not be expressed by a simple matrix. A higher order tensor is required to map a matrix to its transpose. This tensor is not considered in detail here, but is represented by T. So: T · A = AT. An example of a mapping that transforms a sentence from one sentence space to another is:

(T ⊗ I) · (N ⊗ VT) = (T · N ) ⊗ (I · VT) = (NT ⊗ VT)

This example shows that the operator (T ⊗ I) can take a sentence from ’all’ to ’a’. The general mappings will be:

• Ma→all: (T ⊗ I)

• Mall→a: (T ⊗ I)

• Ma→not a: (T ⊗ T )

• Mall→not all: (T ⊗ T )

As is apparent from the mappings above, the not operator is the same and the some/all operators are interchangeble. This is a useful property since a universal representation of ’not’ is convenient in natural language. This method of representing maps from one sentence space to another will not decrease the results because it is an analytical solution.

3.4.5 separate verbs

The most compositional method of obtaining meaning representations of a sen-tence would be not to learn representations for ’quantifier verbs’ but only the verb. This can however not be obtained by lineair regression on a training set like section 3.1, since every simple sentence implicitly contains a quantifier in this experiment. A sentence without quantifier would be incomplete and thus cannot be trained on.

Not enough information is available to manually separate the QV matrix into a Q and a V matrix. The different verb matrices would be represented as:

Q · V1= QV1

Q · V2= QV2

... Q · Vn= QVn

(16)

That would represent a vector of matrices: Q · ~v = ~qv

Since bot q and v are unknown, there would be too many solution for q. It would also have to hold for other quantifiers:

All · ~v = ~qv_all A · ~v = ~qv_a

This does however not determine values for the quantifiers, only gives a relation between ’All’ and ’A’, which is exactly what was already obtained in the previous section. Therefor it is not considered to be a useful method for this experiment.

3.4.6 other composition rules

All methods described in the previous sections focus on entailment between sentences with the same quantifier: for instance: between ’all fathers talk’ and ’all men talk’. Entailment between different structures can be identified as well. Basic rules can be established to determine entailment to identify entailment between sentences with different quantifiers.

• All → A (1:1) • A → All (1:0)

• Not all → Not a (1:0) • Not a → Not All (1:1)

These relations do not capture relations that hold only for a particular set of nouns and verbs. For instance: in a set ”not child” could mean ”adult”, because all nouns fall in the category of human, but this would not hold in a real-life example where other domains are also included. These rules can either be combined to form a large matrix of all possible sentences and learned on that matrix, or applied as a map to one vector to another.

Entailment between sentences with other function words like ’and’/’or’ can also be determined. To include and, the sentence can be split up in two parts. So the sentence ’a baby talks and a father whispers’ can be split up in p1:’a baby talks’ and p2:’a father whispers’, and identify the entailment relations for the separate sentences. All the sentences that are entailed by one of them are also entailed by the larger sentence. And all combined sentences that are entailed by both, represented by: p1 ⊗ p2

similar to ’and’, to include ’or’ a sentence should be split up in two parts. Like with ’and’, p1 ⊗ p2 are entailed by the bigger sentence, but not the individual parts.

(17)

type words

verbs act, interact, communicate, talk, write, spell, whisper, shout, yell, ask, consult, interrogate, sign, wink, applaud, gesture, nod, shake, type, handwrite

training nouns human, child, toddler, baby, daughter, boy, adult, father, un-cle, woman, mother, employee, banker, surgeon, biologist test nouns infant, teenager, girl, son, man,

scientist, physicist, doctor, pro-fessional, parent, aunt

Table 1: selected verbs and nouns

3.5 experimental setup

This section describes the implementations of the experiments that were con-ducted to test the methods described in the previous sections. Only methods that depended on a training set were implemented, since the analytical solutions would not change the performance. Therefore the simple ’verb-noun’ matrices learned by linear regression were tested, and the maps from one sentence space to another by lineair regression on training data were implemented as well. Maps between different quantifiers and other function words are not discussed, since performance would be exactly the same on the training and test set. 3.5.1 selection vocabulary

The experiment was tested on a toy vocabulary consisting of 20 verbs and 28 nouns within a particular domain. The selected verbs were all in the domain of ’communication’, and the nouns were in the domain of ’human’. The nouns were split in a test and training set. The selected words are presented in table 3.5.1. Entailment relations between nouns were extracted from hypernym/hyponym relations in WordNet, entailment between verbs was extracted for meronym relation in WordNet. They were represented as matrices N and V, and different sentence spaces were constructed with the kronecker product as described in section 3.3.

Distributional noun vectors for all the nouns were obtained from Glove. 3.5.2 applied approaches

Entailment relations between verb-noun sentences were tested for different sen-tence spaces. Quantifier verb matrices were learned with :

(18)

Where QV represents the quantifier-verb matrix, Sq the part of the sentence

space for a particular quantifier that is relevant for that verb and N a matrix that consists of all distributional noun vectors represented in a matrix.

The linear maps to transform one sentence space into another obtained by linear regression were learned by:

Mq1→q2 : Sq1· S

+ q2

where q1 and q2 are different quantifiers, Sq1 and Sq2 the sentence spaces

that belong to those quantifiers, and M the linear map that takes one sentence space to another.

3.5.3 evaluation

A test set was created of nouns that did not occur in the training set. Their en-tailment relations with the training nouns were established with hypernym/hyponym relations in WordNet. The appropriate Kronecker products for every quantifier, described in section 3.3 were applied to construct sentence spaces. These sen-tence spaces were used as a golden standard.

The obtained quantifier-verb matrices were applied to the distributional repre-sentations of the test nouns to obtain reprerepre-sentations for the sentences:

~

s = QV · ~n

Where QV is the obtained quantifier-verb matrix, ~n a distributional noun vector and ~s an entailment based meaning vector. These obtained vectors were rounded, all values below 0.5 were considered to be non entailing, all values above 0.5 were considered to be entailing. These rounded vectors were com-pared to the golden standard and precision, recall, f-measure and accuracy were calculated and averaged for all possible sentences.

The results were compared to a baseline of cosine similarity between the nouns. Because the trained verbs could not be varied, entailment between verbs is al-ways correct, so they were considered to be correct in the baseline as well. Cosine similarity is rounded of to the number of entailing stences in the training set. So if 20 percent of the sentences entail in the training set, the 20 percent with the highest cosine similarity is considered to be entailing. Cosine similarity is not designed to recognise entailment, so comparison with this baseline says nothing about the accuracy of cosine similarity, but provides an absolute baseline.

4 results

The results of the linear regression method for different quantifiers is presented in table 4. As is apparent from this table, the results for the quantifiers ’All/not A’ and ’A/not all’ is the same in this experiment. This is because the verbs are trained with the training set and cannot be tested by other verbs outside the training set. Therefore entailment relations between verbs are ’stored’ in the

(19)

quantifier precision recall f-measure accuracy all, not a 0.41 0.70 0.51 0.98 baseline all, not all a 0.15 0.70 0.25 0.39 a, not all 0.8 0.96 0.87 0.99 baseline a, not all 0.15 0.68 0.25 0.40

Table 2: results test set linear regression

quantifier precision recall f-measure accuracy Ma→all 0.36 0.8 0.50 0.98

Mall→a 0.71 0.96 0.83 0.97

Mall→not all 0.36 0.8 0.50 0.98

Ma→not a 0.71 0.96 0.83 0.97

Table 3: maps with linear regression

verb matrix, and entailment between the verbs is always correctly identified. Because ’All/not A’ and ’A/not all’ are only different in their entailment rela-tions between verbs, the nouns are the same and the same results are obtained. The results of linear regression on the maps from one sentence space to another are presented in table 4. The same reasoning applies to these results.

In figure 1, precision, recall, f-measure of the simple quantifier verb matrices, the map between ’all’ and ’a’ and the cosine similarity baseline are depicted for the quantifiers ’a’ and ’not all’. As discussed above, they have the same results because of the constant performance of the verbs. A similar representation is presented in figure 2 for the ’all’ and ’not a’ quantifiers.

(20)

Figure 2: results for ’all/not a’

5 discussion

5.1 evaluation of results

The results for the ’some’ and ’all not’ quantifier are far above a baseline of cosine similarity, and produce fairly high results. This method appears to be promising, but should be compared to other methods to get an indication of the accuracy of the method. The results for linear regression on the ’all’ and ’not some’ quantifier performs much worse, although still well above the baseline of cosine similarity. In part, this discrepancy can be due to a lack of entailing sen-tences. If the selected nouns are represented in a tree like structure it becomes clear that words higher up in the tree are less common that the lower ones. There are more specific words than general words, so there are more specific words in the test set as well. Because the ’all’ quantifier should entail nouns that are more specific, there are very little words to entail if the test noun is already very specific. This causes a golden standard with very few positive instances, so if a sentence is identified wrong it has great effect on the preci-sion. Accuracy counts involve true negatives, and these remain similar across quantifiers. If more data is used, more positive examples of entailment could be identified, and the result would likely be more similar to the ’a’ quantifier. When the maps are learned by linear regression around 10 percent of precision is lost. This may be manageable for these small sentences, but would be problem-atic if sentences become larger and performance is lost for every applied word. This performance loss is most likely unnecessary because an analytical solution is available. This can however not be implemented with a simple matrix, a higher order tensor is needed to apply this analytic solution. In further research the analytical solution should be implemented to ensure that no performance is

(21)

lost.

A common mistake that occurs in the linear regression method are the mis-identification of nouns that have a male and female variant. For example: ’a girl talks’ is identified to entail ’a boy talks’. This is probably due to a high similarity measure between these words. This could probably be prevented by using a method that identifies opposites.

The appearance of male and female variance also causes a great difference in performance if the test and traning set are divided differently. If all female vari-ants are in the testset and all male variant in the training set the performance would be worse than that both variants occur in the same set. This may be solved by using a much larger vocabulary that includes more nouns that do not have a male and female variant.

5.2 critiques on methods

As discussed above, the simple linear regression method worked fairly well on this test set. Part of this performance is due to the constancy of the verb vec-tors. Because only the noun vectors are different in the test set, the relations between the verbs are static and captured in the verb matrix. This may be a problem when the sentence spaces become much larger, because the relations between all verbs would have to be stored in that matrix. It would however be a sparse matrix, since many domains do not overlap.

When this method is extended to different sentence spaces, different approaches of handling the quantifiers were discussed. In the first method, QV matrices are learned beforehand. As discussed in that in section 3.1, this loses part of the compositionality, since it stores together a chunk of words. When this approach is extended to more words, it may also need seperate matrices for those chuncks as well. What may work on a larger scale is so determine specific composition rules for function words, and learn matrices once this structure is identified. This may however be slow to compute for a large sentence space.

The maps between different sentence spaces discussed in section 3.3 are more compositional, since different quantifiers represent different maps. Only one quantifier verb matrix has to be learned beforehand, and all quantifiers could be applied to this. So ’all fathers talk’ would be: ’all’ applied to ’a father talks’. This does however require one quantifier to be the ’basic unit’. This makes this approach asymmetrical. In general the analytic solution to the maps would be preferred to the linear regression obtained solution, but this has not yet been implemented. In theory no performance would be lost and the ’not’ quantifier would be the same and the ’and/all’ would be the same. When this approach is scaled up to more complex sentences, the relations of the quantifiers should be re examined to see if they still apply.

Seperating out the quantifiers and verbs was not considered to be a reasonable option for quantifier-verb matrices, but it would give the most general represen-tation, and should be examined further by other training measures.

Maps between different sentence spaces were only discussed in a simplified man-ner, and did not include relations like ’not adult’|= ’child’. This relation would

(22)

technically only be correct in the simplified sentence space where all nouns are children or adults, but in natural language a phrase like ’not child’ would prob-ably indicate adult. These relations could be examined in further research.

5.3 further research

5.3.1 implementation

This project was tested on a very small set of sentences which most likely influ-enced the results. To appropriately test the discussed methods they should be tested on a much larger selection of verbs and nouns. This would most likely cause less of a difference in results if the training and test set were divided differently. When reliable results are obtained different parameters could be examined, such as different obtained distributional vectors and different length of distributional vectors.

An implementation of the analytical solution for the maps between sentence spaces should be made to ensure it provides the desired results.

The evaluation of the results should also be improved, and the methods should be compared to other methods such as the classifier method that Baroni et al. [2] proposed or the entailment method that Balkir et al [1] proposed.

5.3.2 other methods

The methods described in this thesis are limited to phrases that contain one noun and one verb and a small selection of function words. To make more real-istic representations more pars of speech could be examined, such as adjectives. The selection of quantifiers could also be extended.

6 conclusion

Quantifier-verb matrices obtained by linear regression on a training set provide promising results for obtaining entailment vectors. However, experiments on a larger vocabulary and comparisons to other entailment measures are required to fully estimate the value of this method. Different quantifiers lead to different sentence spaces that can be composed witch Kronecker products. The most promising way to represent quantifiers as a separate linear map is to analytically solve the equation that turns one sentence space to another. More research on this topic is needed to extend these approaches to more complex sentences.

References

[1] Esma Balkir, Dimitri Kartsaklis, and Mehrnoosh Sadrzadeh. Sentence entailment in compositional distributional semantics. arXiv preprint arXiv:1512.04419, 2015.

(23)

[2] Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. Entailment above the word level in distributional semantics. In Proceed-ings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 23–32. Association for Computational Linguistics, 2012.

[3] Marco Baroni and Roberto Zamparelli. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Lan-guage Processing, pages 1183–1193. Association for Computational Linguis-tics, 2010.

[4] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. arXiv preprint arXiv:1003.4394, 2010.

[5] Maayan Geffet and Ido Dagan. The distributional inclusion hypotheses and lexical entailment. In Proceedings of the 43rd Annual Meeting on As-sociation for Computational Linguistics, pages 107–114. AsAs-sociation for Computational Linguistics, 2005.

[6] Edward Grefenstette and Mehrnoosh Sadrzadeh. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1394–1404. Association for Computational Linguistics, 2011. [7] Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke,

and Stephen Pulman. Concrete sentence spaces for compositional distribu-tional models of meaning. In Computing meaning, pages 71–86. Springer, 2014.

[8] Marti A Hearst. Automatic acquisition of hyponyms from large text cor-pora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 539–545. Association for Computational Linguistics, 1992. [9] Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389, 2010.

[10] Bill MacCartney and Christopher D Manning. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual En-tailment and Paraphrasing, pages 193–200. Association for Computational Linguistics, 2007.

[11] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic compo-sition. proceedings of ACL-08: HLT, pages 236–244, 2008.

[12] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 confer-ence on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

(24)

[13] Peter D Turney and Saif M Mohammad. Experiments with three ap-proaches to recognizing lexical entailment. Natural Language Engineering, 21(3):437–476, 2015.

[14] Dominic Widdows and Trevor Cohen. The semantic vectors package: New algorithms and public tools for distributional semantics. In Semantic com-puting (icsc), 2010 ieee fourth international conference on, pages 9–15. IEEE, 2010.

[15] Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. Estimating linear models for compositional distri-butional semantics. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1263–1271. Association for Computa-tional Linguistics, 2010.

Entailment between noun-verb phrases with compositional distributional semantics