Unsupervised Relation Discovery for Prepositions and Noun Compounds

(1)

M

ASTER

’

S

T

HESIS

Unsupervised Relation Discovery for

Prepositions and Noun Compounds

Author: Benno KRUIT 10576223 Supervisor: Dr. Ivan TITOV Co-supervisor: Diego MARCHEGGIANI

A thesis submitted in fulfillment of the requirements for the degree of MSc Artificial Intelligence

in the

Language and Computation Group Institute for Logic, Language and Computation

(42 ECTS)

(2)

(3)

iii

Abstract

Faculty of Science

Institute for Logic, Language and Computation MSc Artificial Intelligence

Unsupervised Relation Discovery for Prepositions and Noun Compounds

by Benno KRUIT 10576223

This thesis introduces the problem of unsupervised discovery of seman-tic relations expressed by prepositions and noun compounds. These rela-tions are rarely present in annotated resources, but contain useful informa-tion. I evaluate a model that induces a feature-rich semantic labeler for prepositions and noun compounds by coupling the distribution over the outputs of a classifier to a reconstruction component. First, I induce rela-tions only for preposirela-tions, but they do not correspond to gold-standard annotations. While prepositions are undeniably polysemous, the baseline assumption that they are not is hard to beat. Second, I evaluate the architec-ture of the reconstruction component on the task of predicting preposition paraphrases for noun compounds. Trained from factorizing the preposi-tions as relation labels, its output distribution is close to the distribution of a set of paraphrase annotations. Third, I induce a combined set of seman-tic relations for prepositions and noun compounds. The reconstruction-based models often predict the same relation for noun compounds and their preposition paraphrase, while the baselines don’t. While the results show that rich latent representations are beneficial for relating prepositions and noun compounds, relation induction for this class of phrases remains an open problem.

(4)

(5)

v

Acknowledgements

First, I would like to thank my thesis advisor Ivan Titov, who inspired me to tread the deep waters of unsupervised learning and gave me the freedom to try anything and everything. Second, I want to thank Diego Macheggiani, who tirelessly helped me through those all-too-common “dips in intelligence”. Third, I would like to thank my fellow students in the AI program for their friendship, help and motivation: Sara, Joost, Otto and Bas.

Finally, I want to thank my parents, who have seemingly unconditional pride and the most enticing footsteps, and most of all my girlfriend Hester, who keeps me happy, confident and sane.

(6)

(7)

vii

Introduction

1.1 Semantic Parsing

The field of semantic parsing aims to design computer programs that in-terpret the meaning of natural language text and generate data structures which reflect that meaning. Often the first step in that endeavour is to de-sign the structures that we would like the computer to generate. Consider the following example:

(1) John is reading a book on the history of the Netherlands by a professor from Amsterdam.

There are a number of projects that aim to annotate such sentences with well-defined structures. In the FrameNet project (Baker, Fillmore, and Lowe, 1998), for instance, a sentence is assigned one or more semantic frames from an inventory curated by experts. In the example above, the frame would typically be labeled as Reading_perception, involving the frame elements

Reader (annotated to the word John) and Text (annotated to the rest of

the sentence). We can then use these annotations to train a computer pro-gram — a semantic parser — to predict these structures automatically when given an input sentence.

However, there is more information in sentence (1) than only the fact that John is reading something. Consider the prepositions in this sentence. Using not the FrameNet inventory but a relation inventory for prepositions from Srikumar and Roth, 2013b, those prepositions express the semantic relations of Topic (book on history), Attribute (history of the Netherlands), Source(book by professor), and Source (professor from Amsterdam). In these examples, the preposition expresses a relation between the word on the left, called the governor, and the word on the right, called the object. Given enough high-quality annotations, we could train a semantic parser for not only the verb-based frames, but these preposition relations too. Sadly, in most annotated resources these constituents are not annotated with their own structure, and thus many current semantic parsers do not generate these structures. Indeed, there is no project in existence that would annotate noun compounds like history book or Amsterdam professor in this context with these kind of labels. This way, the limitations of these datasets of sentence annotations also become the limitations of the parsers that are trained on them.

Supervised parsers Training a semantic parser in a purely supervised manner has a number of shortcomings. First, though annotation projects

(10)

2 Chapter 1. Introduction such as FrameNet cost huge amounts of manual effort, they cannot pro-vide a complete coverage of language use. Additionally, domains such as medicine, law or governance use different vocabularies and discuss differ-ent concepts, which require their own massive annotation efforts.

Second, when semantic parsers are used in applications, the generated structures should be suitable for automated reasoning. In question answer-ing, they can be used to parse both the questions and the text that is used to answer those questions. In other information retrieval tasks, they might be used to build a general-purpose knowledge base of information extracted from large volumes of text (e.g. on the web). However, it is often unclear how the annotated structures correspond to the application that the seman-tic parser should be used for. Structures that may be useful for a task do not necessarily exist in the annotations, or the structures may not be adequate. For instance, annotations might group together antonyms such as open and close, or related terms such as eat and drink.

This problem can be mitigated by using data related to a specific task — such as questions and (structured) answers (Liang, Jordan, and Klein, 2012) — instead of manual annotations. Though those structures are then neces-sarily useful for the task at hand, designing a task and collecting data for general-purpose semantic parsing is extremely difficult. Thus, the datasets of these tasks are typically constrained to very specific tasks, requiring a new dataset for every task. In those specific tasks, there is no reason for the structures of one task to be useful for another.

Finally, annotated datasets typically do not exist for languages other than English. Translating a semantic parser is virtually impossible due to the language-specific data and features that are used.

Unsupervised parsers Therefore, we would like to train a computer pro-gram to discover patterns in language that reflect its meaning without man-ual annotations. Current work on unsupervised semantic parsing typically uses models with strong independence assumptions, a limited input repre-sentation and inflexible learning system (Titov and Khoddam, 2015). That means these systems have limited ability to model the relations that exist between parts of the sentence, and how concepts relate to each other in gen-eral. Additionally, these models have difficulties with languages that have a freer word order than English.

In Chapter 2, I will discuss existing approaches to semantic parsing and unsupervised language processing.

1.2 Reconstruction-error minimization

The use of unsupervised learning with a flexible reconstruction-based ob-jective, as in this work, is a strategy to hopefully overcome these problems. First of all, it allows us to use orders of magnitude more data than are used in supervised approaches. By training on any available text instead of text with annotations, we are not restricted to any domain, task or language. Secondly, reconstruction-based methods have succeeded in other fields as well. In machine learning, there has been a large amount of work on auto-encoders, which learn to encode data into lower-dimensional latent repre-sentations that allow the data to be reconstructed as well as possible. These

(11)

1.2. Reconstruction-error minimization 3 methods find intrinsic patterns in data by minimizing the reconstruction er-ror. We would like to use this succesful concept to create a semantic parser. Finally, it allows us to search for structure in parts of sentences that are normally not annotated.

g o v : p r o f e s s o r o b j : S t a n f o r d p r e p : f r o m p o s : N N - I N - N N 0% 30%60% a1: p r o f e s s o r a2: S t a n f o r d

Features

Relations

Reconstruction

g(x)

_r

a

1

(x), a

2

(x)

FIGURE 1.1: Inducing preposition relations by reconstruction-error minimization

In this thesis, I will evaluate a program that tries to discover the relations that are expressed by prepositions and noun phrases (figure 1.1). The pro-gram is based on a statistical model that reflects the likelihood of those parts of the sentence. For example, the predicted relations for ‘on’ and ‘about’ in book on/about history in contexts such as sentence (1) should become similar. This happens because in the training data book occurs with words similar to history in those contexts, which makes the word pair highly likely to occur in that relation. In other contexts, such as book on shelf, the predicted rela-tion for ‘on’ should be different, because in that relarela-tion book occurs with words similar to shelf. Using its own predicted relations, the system tries reconstruct the governor from the object, and vice versa. By minimizing the reconstruction error, the program jointly learns a measure of word simi-larity, a clustering of the input data and a probabilistic model of word pairs for each cluster.

The parameters of the model are optimized to make the input dataset as likely as possible, which forces it to distribute the word pairs over differ-ent relations. These parameters then express the probabilities of word pairs and relations. Modelling the probabilities of words together allows us to predict one word from the other word in the pair, and the relation that the model associates with the input. This way, the model is more expressive than existing generative models. Hopefully, these properties will allow the program to induce meaningful and useful representations of natural lan-guage semantics.

In Chapter 3, I will discuss the details of the model.

Of course, creating a semantic parser without a task to evaluate it on is a pernicious affair. This thesis describes a very limited parser — it is only for interpreting prepositions and noun compounds – which makes it useless for any task on its own. I have no choice but to evaluate the model by com-paring the relations that it discovers to annotations that have a linguistic foundation.

I will evaluate the model in three situations: (1) discovering relations that are expressed by prepositions, (2) the performace of the decoder when using prepositions to express noun-noun relations, and (3) finding one set of relations for both prepositions and noun compounds.

In Chapter 4, I will describe my experimental setup and results. In Chapter 5, I will discuss what they imply for the model and the problem of relation induction for prepositions and noun compounds.

(12)

4 Chapter 1. Introduction

1.3 Thesis Contributions

In this thesis,

• I introduce the problem of relation induction for prepositions and noun compounds, and relate it to existing work.

• I evaluate an expressive reconstruction-error minimization model on this problem, which highlights some of its shortcomings in this situa-tion. In particular,

– I induce relations only for prepositions, but they do not corre-spond to gold-standard annotations. While prepositions are un-deniably polysemous, the baseline assumption that they are not is hard to beat.

– I evaluate the architecture of the reconstruction component on predicting preposition paraphrases for noun-noun compounds. Trained from factorizing the prepositions as relation labels, its output distribution is close to the distribution of a set of para-phrase annotations.

– I induce a combined set of semantic relations for prepositions and noun compounds. The model beats the baselines at predict-ing the same relation for noun compounds and their preposition paraphrase.

(13)

5

Chapter 2

Background

The model described in the next chapter derives its inspiration from many fields, and this chapter is meant to give an overview of the research that has motivated its design. As this thesis introduces the problem of unsupervised preposition and noun-noun compound relation discovery, I will describe the its context in the literature on information extraction, semantic parsing and on the meaning of prepositions and noun compounds.

2.1 Distributed Word Representations

The main problem in making language models is sparsity : people are very productive with language, and combine words and phrases in so many ways that it is impossible to encounter all possible patterns in texts (Man-ning and Schütze, 1999). Natural language processing systems must find a way to keep the probability of unseen patterns high enough, and generalize correctly from observations. If a model assigned a probability of zero to all patterns it had never seen, it would not be useful. This is also called the curse of dimensionality. A popular way to cope with sparsity is distributed word representations (Bengio et al., 2003).

Learning distributed word representations can be seen as a form of di-mensionality reduction, where word and phrase tallies are approximated instead of tracked explicitly. The approximation is computed using word representations, which are more constrained than the tallies and are thus lower-dimensional. Those representations are not observed directly, but created by optimizing the model in a certain way. This results in a model that generalizes observations in order to overcome the sparsity problem.

Factorization For instance, a table with the tally of all word pairs in a vocabulary of size V requires V × V cells, but the same information can be approximated using fewer parameters. If we create representations for words that encode the most important parts of how they occur together, we can combine the representations for the words in the column and row of a cell to estimate the tally of that word pair. When the original information is a matrix (such as in this example) and the matrix is approximated using the multiplication of latent representations, this is known as truncated (or rank-reduced) matrix factorization (or decomposition). The reconstruction of the original matrix is known as expansion.

The similarity between these latent representations of words often cor-responds to our intuitions about language. This observation is central in the field of distributional semantics , which is based on the hypothesis that semantically related words occur in the same contexts (Harris, 1954). In

(14)

6 Chapter 2. Background distributional semantics, the model is not tested directly for its language modeling properties, but for the correspondence of the latent representa-tions themselves to human judgements of word similarity and analogy.

Bengio et al., 2003, introduce a method to learn latent word representa-tions by predicting words from the words that surround them, using a neu-ral network. A related but more efficient model is described in Mikolov, Sutskever, et al., 2013, and trained on more data. Pennington, Socher, and Manning, 2014, describe a more direct way of finding latent representations from word co-occurrence statistics. Levy and Goldberg, 2014, explain how prediction-based and count-based methods are related, presenting the for-mer model as implicit matrix factorization.

Some of the models described in the next chapter induce word represen-tations similarly to these approaches. In most experiments, I will also use word representations induced using one of the methods described above. However, the goal of this thesis is not the induction of these representations themselves, but to use them for unsupervised semantic parsing of preposi-tions and noun-noun compounds.

2.2 Supervised Semantic Parsing

Most current systems for semantic parsing are trained using a corpus of annotations. It is important to understand the choices that are made when creating these corpora, and the performance of the parses trained on them. This is particularly important for prepositions and noun-noun compounds, because there are significant disagreements on the way their semantics are annotated.

2.2.1 Parsing Prepositions

Prepositions specify relations between two words: the governor (which can be a noun, adjective or verb) and the object (which is usually a noun). Gov-ernor, preposition and object form a chain within the syntactic dependency structure of a sentence. Prepositions are highly polysemous (Baldwin, Kor-doni, and Villavicencio, 2009), which makes their interpretation a challeng-ing problem in the field of natural language processchalleng-ing.

O’Hara and Wiebe, 2003, introduce the semantic role labeling of prepo-sitional phrases for 7 Penn Treebank labels and 129 FrameNet roles. They train a decision tree on features derived from WordNet for generalization, and show that using high level features, such as semantic roles, significantly aid prediction. The roles that they predict from Framenet, however, are quite specific and primarily related to the frame that is evoked by a verb in-stead of to the preposition semantics itself. The labels attached to the Penn Treebank are coarse and heavily skewed. Neither of the two inventories are based on preposition word senses. Therefore, later approaches have created a different set of relations that is motivated bottom-up from word senses, merging frame elements together.

Hartrumpf, Helbig, and Osswald, 2006, make a distinction between regular and irregular phenomena (though they also note ‘subregular’ phe-nomena), in which a preposition has a literal or non-literal meaning. They

(15)

2.2. Supervised Semantic Parsing 7 also take into account the linguistic distinction between the prepositional phrase as complement or adjunct, which indicates how strongly it is linked to the governor. Their system for preposition interpretation creates rule-based proposals, and selects from those proposals using a statistical back-off model. Their parse structures are part of the Multi-Net paradigm, which is a knowledge representation formalism for natural language semantics used in their semantic annotation efforts of German text. Their rule-based system makes use of extensive rule-based semantic modelling structures that contstrain the types of predictions, hindering the comparision of their approach to related work.

Litkowski and Hargraves, 2005, introduce the Preposition Project, which is a dataset of prepositions in FrameNet sentences disambiguated using senses from the Oxford English Dictionary. By labeling prepositions in FrameNet, they create a corpus that was used for the SemEval2007 chal-lenge of preposition word-sense disambiguation (Litkowski and Hargraves, 2007). In this challenge, three teams participated. The team that achieved the highest score used a feature-rich logistic regression classifier.

Tratz and D. Hovy, 2009, later improve preposition sense disambigua-tion, beating the systems created for the SemEval2007 challenge. They in-troduce a large number of features, based on syntactic context and Word-Net synonyms and hypernyms. These were expanded and analysed in D. Hovy, Tratz, and E. Hovy, 2010, and their performance was compared on the SemEval2007 challenge dataset and the Penn Treebank labels. They conclude that close mutual constraints hold between the elements of the prepositional phrase, which motivates the use of a joint model of preposi-tion arguments in this thesis.

Srikumar and Roth, 2013b, train a supervised model of preposition se-mantics using a structural SVM. Their Model 1 predicts a class label given a WordNet-based feature-rich representation of the arguments, both on their own and conjoined with the preposition. They extend the latent structural SVM for their Model 2, which also predicts the arguments themselves from a set of candidates proposed by a dependency parser. This work uses their inventory of preposition relations for evaluation, but does not predict the preposition arguments.

Nakashole and Mitchell, 2015, propose other interesting features related to prepositions. They train a log-linear model of Prepositional Phrase at-tachment through Expectation-Maximization on extra unlabeled data. They evaluated different linguistic feature sets, and found that the most salient features came from the semantic types of noun phrases (from categoriza-tions such as WordNet), the semantic type of the subject of the clause, and n-grams containing the preposition.

Schneider et al., 2015, extend the inventory of Srikumar and Roth, 2013b, to create a hierarchy of preposition senses, with many connections between different senses. It also broadens the set of prepositions to include multi-word prepositions that were not annotated in the Preposition Project.

Unsupervised Approaches D. Hovy, Vaswani, et al., 2011, introduce un-supervised preposition sense disambiguation, but they don’t use super-senses. They train a separate generative model for each preposition, which finds latent word classes for the governor and object and the preposition itself. They train the model using Expectation-Maximization, Maximum a

(16)

8 Chapter 2. Background posteriori EM and Bayesian Inference with Dirichlet priors, and evaluate it by mapping the resulting preposition sense clusters to the labels with which they most frequently overlap. They achieve a score of 0.55 accuracy, which is a significant improvement over the baseline of 0.40.

2.2.2 Parsing Noun-Noun compounds

Noun-noun compounds express a latent relation between the modifier and head word (Downing, 1997).

Compound Splitting Some languages, such as German, compound nouns together into single words. Splitting those compounds reduces sparsity in natural language processing tasks. Some approaches, e.g. Daiber et al., 2015, perform compound word splitting using word embeddings.

In contrast, Dima, 2015, tries to find a function that combines the word embeddings of both nouns in a compound into the word embedding of the compound word itself. The most effective function is one that learns an additional representation for each word in the modifier and head position, and transforms them with a matrix of weights.

However, neither approach attempts to induce a latent relation inher-ent to the compound. One could argue that compound words might have different senses for every relation that it might express, but in practice com-pounds themselves are rarely ambiguous.

Paraphrasing with Verbs Downing, 1997, finds it impossible to constrain the relations expressable by noun-noun compounds. Downing identifies a minimum set of 12 relations which commonly occur, but argues that this set does not adequately capture all relations that can occur between nouns in novel compounds. This inspires Nakov, 2007, to use an unconstrained set of verbs (or verbs+preposition) for paraphrasing these relations. The two resulting datasets of search-engine-based and annotator-based praphrases had significant overlap, indicating clear latent relations. In this way, verbs allow the set of relations to be unconstrained, but the relation does not get one unambiguous label. That makes classification difficult, and cannot re-sult in a semantic parser.

Parsing into Classes Classes make semantic parsing easier, but you have to create a relation inventory.

Levi, 1978, introduces 9 categories of noun-noun compounds, based on ‘recoverably deletable predicates’. However, Warren, 1978, finds 6 major hierarcical categories by analysis of the Brown corpus. Other approaches result in 14 categories (Vanderwende, 1996, statistically from online dic-tionaries), 8 relations (Séaghdha, 2007, by analysis of the British National Corpus), or 22 relations (Girju, 2009, in a cross-lingual comparison).

Tratz and E. Hovy, 2010, create a taxonomy that was comprehensively compared to and associated with existing set of relations, and created a large dataset of noun-noun compounds. This taxonomy was used in Tratz and E. Hovy, 2011, to enrich a statistical syntactic parser. The dataset is the largest of its kind, and publicly available online1. However, the noun-noun

(17)

2.3. Information Extraction 9 compounds are not in sentence contexts, making them difficult to use for unsupervised semantic parsing.

In recent work, Dima and Hinrichs, 2015, improves the state-of-the-art noun-noun parsing on this dataset using a 2-layer neural network and pre-trained word embeddings. They also show that the representation in their hidden layer corresponds to our intuition on noun-noun compounds se-mantics.

Paraphrasing with Prepositions With prepositions, there is no need to create an well-defined inventory. However, the ambiguity of prepositions presents a problem for annotators, resulting in low inter-annotator agree-ment.

Lauer, 1996, uses 8 prepositions for paraphrasing. Motivating this, Lauer explains: “French makes limited use of compounding, while in German it is heavily productive. It is therefore commonly necessary to render a Ger-man compound into an alternative French syntactic construction, such as a prepositional phrase.” However, some types of noun-noun compounds are excluded, and the resulting dataset is relatively small. Girju, 2009, com-pared these paraphrases to a set of classes.

Bos and Nissim, 2015, create more data using a competitive game called “Burgers” as an annotation framework. Players paraphrase noun-noun compounds using prepositions in sentence contexts, earning points for inter-annotator agreement based on the player’s confidence. The pre-selection of a limited number of prepositions to present to the players is done case-by-case in a data-driven fashion using Google n-grams. The sentences are generally in the same domain as FrameNet (e.g. newswire, texts from the American National Corpus). This results in a dataset of player annotations with confidence indications, with noun-noun paraphrases using 24 possible prepositions. I used this dataset for evaluating noun-noun relation learning from preposition data.

2.3 Information Extraction

Information extraction systems build a general-purpose knowledge base from large volumes of text (e.g. on the web). The following methods all as-sume the knowledge base consists of (subject, relation, object) triples, where the relation always has the same type of subject and object roles.

Lin and Pantel, 2001, introduce DIRT, a system for clustering extracted relations in order to discover inference rules. The relations are extracted using manually created rules for syntactic patterns from full parse trees, and then clustered together based on co-occurrence statistics.

Banko et al., 2007, introduce TEXTRUNNER, a system for large-scale scal-able open information extraction through a single pass over a corpus. It is based on a efficient linear classifier, trained on a small set of syntactic parses, which labels parts of the sentence as the relation and arguments, after which the relation and argument tuples are normalized, stored and assigned a probability. This method is much faster than using a full syn-tactic parser. However, the extracted structures are still based directly on the words that are used in the text, instead of underlying semantics. This is extended by Kok and Domingos, 2008, who merge relations and arguments

(18)

10 Chapter 2. Background together into abstract concepts using Markov Logic. The result is a seman-tic network grounded in text, but with no way to judge the probability of unseen facts.

Yates and Etzioni, 2009, propose a generative model that estimates the probability that two predicates are synonymous by comparing pairs of ar-guments. Their model, RESOLVER, then performs hierarchical agglomer-ative clustering and achieves high precision. However, the model is unable to deal with polysemy.

Carlson, Betteridge, and Kisiel, 2010, introduce NELL, which combines a hand-crafted taxonomy of entities and relations with self-supervised large-scale extraction from the Web, but they require additional processing for linking and integration. They combine multiple high-precision classifiers, and create a corpus of beliefs by combining their predictions and previously extracted facts.

Fader, Soderland, and Etzioni, 2011, describe REVERB, which extracts informative relations by combining a simple syntactic and a lexical con-straint. Just like TEXTRUNNER, it forgoes a full syntactic parser for effi-ciency reasons. However, it extracts higher quality relations using a con-straint on part-of-speech tags and noun chunks.

Galárraga et al., 2014 canonicalize relations and arguments extracted by NELL and REVERB. They use high recall extractors, followed by cluster-ing methods to improve the precision. The arguments are merged together using hierarchical agglomerative clustering using various similarity scores, and the relations are merged using association rule mining. It approaches the performance of the method from Krishnamurthy and Mitchell, 2011, which uses the typed taxonomy from NELL relations to disambiguate ar-guments.

2.3.1 Unsupervised Semantic Parsing and Factorization

To integrate textual information into a knowledge base, you need to cre-ate useful representations that merge textual descriptions into abstract con-cepts. The (subject,relation,object) triples from the above approaches have a number of shortcomings. First of all, relations can hold between more than two entities. Second, every relation in the methods described above can have a inverse form, in which the subject and object are reversed. There-fore, it can be useful to split the relation into a semantic frame and semantic roles, as in the FrameNet project. When the relations or roles are less re-lated to their surface form, it becomes possible to predict unseen facts and perform reasoning tasks.

Lang and Lapata, 2010, introduce the Latent Logistic model for inducing semantic roles in text. Their approach allows for the induction of latent classes using a rich feature set.

Titov and Klementiev, 2012, introduce two Bayesian models for unsu-pervised semantic role induction. As generative models, they necessar-ily make several independence assumptions in order to remain tractable. However, they are able to induce a hierarchical model that shares informa-tion about semantic roles between relainforma-tions, using a language-independent feature set.

(19)

2.3. Information Extraction 11 Yao, Riedel, and McCallum, 2012, do semantic relation clustering us-ing variations on LDA. They construct feature representations from depen-dency parse patterns and the named entities occurring within them. More specifically, they use clusters from LDA as local features (from entities and words) and global features (from the sentence and document). They par-tition entity pairs of a path into different sense clusters with sense-LDA, a topic model that draws multiple observations from a hidden topic vari-able. These sense clusters are merged into semantic relations using hier-archical agglomerative clustering. The graphical model that they employ assumes the features are conditionally independent given the topic assign-ments, which is not always desirable.

Bordes et al., 2013, introduce a model for inducing latent representations of concepts in a semantic network, in order to predict the probability of un-seen connections. Their model factorizes a semantic network using stochas-tic gradient descent. Factorizing the semanstochas-tic network that results from a surface-form information extraction system, it is able to integrate multiple data sources and effectively perform entity ranking and word-sense disam-biguation. Weston et al., 2013, describe a similar system to jointly perform knowledge-base factorization and semantic parsing. They use a large-scale semantic network to improve semantic parsing.

Riedel et al., 2013, perform a factorization of a matrix of entity pairs and relations, from both shallow (surface-form) relations and relations from knowledge bases. That factorization allows for the reconstruction of other valid relations through asymmetric implicature, e.g. historian-at im-plies professor-at. Their weakly-supervised model induces latent rep-resentations of the relations and entities. The factorization provides inspi-ration for possible alternatives to the bilinear decoding in our work.

Lewis and Steedman, 2013, discover semantic relations by clustering typed surface-form predicates. Predicates from a categorial grammar are refined by the entity types of its arguments, after which similar typed predi-cates are merged using LDA. The probabilistic categorial grammar can then be used to parse sentences into precise logical forms, over which reasoning can be performed through logical proofs. They treat prepositions as argu-ment positions of the predicate, which means the model incorporates the preposition into the predicate but does not disambiguate it.

Titov and Khoddam, 2015, introduce the Reconstruction-error Minimiza-tion framework for semantic role inducMinimiza-tion. They induce an efficient linear classifier by reconstructing the arguments of the latent relation. Because this thesis is primarily based directly on this framework, I will discuss it fully in the Model chapter. For semantic role induction, they report high cluster overlap with semantic role annotation datasets in English and Ger-man. Compared to related approaches their method induces fewer roles, which means they are more interpretable.

2.3.2 Latent Structured Prediction

Unsupervised parsing is structure prediction with latent structure. Re-cently, there has been progress on induction of linear classifiers through reconstruction-error minimization, and the induction of latent representa-tions of structured outputs. As this thesis concerns the induction of a linear

(20)

12 Chapter 2. Background classifier through the reconstruction of argument structures, this research has inspired parts of the model described in the next chapter.

Daumé III, Langford, and Marcu, 2009, reduce unsupervised structure learning to supervised binary classification in a reconstruction framework. The resulting search-based structure prediction framework is evaluated on sequence labeling and unsupervised syntactic parsing.

Ammar, Dyer, and Smith, 2015, demonstrate a Conditional Random Field Auto-Encoder that is able to work tractably with global features. It is trained by block coordinate descent, alternating gradient descent and Expectation-Maximization for the encoding and decoding.

Srikumar and Manning, 2014, induce latent representations of struc-tured outputs. They induce a joint model of the output with latent rep-resentations, which is related to the decoder used in this thesis.

(21)

13

Chapter 3

Model

3.1 Reconstruction-Error Minimization

g o v : p r o f e s s o r o b j : S t a n f o r d p r e p : f r o m p o s : N N - I N - N N 0% 30%60% a1: p r o f e s s o r a2: S t a n f o r d

Features

Relations

Reconstruction

g(x)

_r

a

1

(x), a

2

(x)

FIGURE3.1: Variables for encoding and decoding

preposi-tion relapreposi-tions

In the Reconstruction-Error Minimization framework (Titov and Khod-dam, 2015), the goal is to find a clustering of the input into relations that help reconstruct the arguments of that relation. Inspired by neural network auto-encoders, the model consists of two parts, an encoder and a decoder. The encoder expresses how likely relations are given the input features. The de-coder expresses how likely the arguments are together given a relation.

In my case, the model is trained on a dataset of parsed sentences, where in this case every preposition usage x has two arguments a, consisting of a governor a1and an object a2.

The model is parameterized by the encoder and decoder parameters θ. These parameters are optimized to maximize the likelihood of the argu-ments in the training data. Given a set of relations R, the following proba-bility is optimized for every preposition usage in the dataset:

pθ(a|x) =

X

r∈R

pθ(r|x)pθ(a|r) (3.1)

3.2 Encoder

The encoder expresses the likelihood of relations given the features of the input, g(x). For example, in the phrase a professor from Amsterdam, the governor is professor, the object is Amsterdam, and the features could be represented by a set { gov_professor, obj_Amsterdam, prep_from, POS_NN-IN-NN}. This is expressed as a sparse boolean feature vector by viewing each possible feature (out of m total features) as a dimension which is 1 if that feature is present, and 0 otherwise. The output of the encoder is a distribution over relations r, using weight vectors wr∈ Rm.

(22)

14 Chapter 3. Model The encoder is log-linear, in the sense that the log-probability of a rela-tion given the input is proporrela-tional to a linear combinarela-tion of the features and the weights:

pθ(r|x) =

exp(wr· g(x))

P

r0_∈Rexp(w0_r· g(x))

(3.2) where the parameters θ include wr ∈ Rm for r ∈ R. The right-hand term

is normalized by dividing it by a summation over r. This results in a term equivalent to multinomial logistic regression, also known as the softmax function.

In the reconstruction-error minimization framework, the encoder can generally be any differentiable function as long as the posterior distribution of relations r can be efficiently computed or approximated. After training, the encoder constitutes the final induced classifier. Because it is an efficient log-linear model, such a classifier can be used for open-domain information extraction (Banko et al., 2007). Of course, the specific model that I evaluate in this thesis is not a complete semantic parser but only a component of one. However, it is important to evaluate such a component separately, in order to research its performance on this sub-task.

3.3 Decoder

The decoder reconstructs the arguments given the relation. For example, the decoder should, for some relation, assign a high likelihood to the argu-ment pair (a1 = professor, a2= Amsterdam)and other pairs that are

in-stances of an origin or source. For other relations, the decoder should assign

a high likelihood to pairs such as (a1 = professor, a2 = statistics)

(and other attribute pairs) and a lower likelihood to the former. In this way, the decoder expresses the probability of arguments from a vocabulary V.

The decoder is log-proportional to a function ϕ of the arguments and relation, parameterized by θ: pθ(a|r) = exp(ϕ(a1, a2, r, θ)) P a0 1∈V P a0 2∈Vexp(ϕ(a 0 1, a02, r, θ) (3.3) The scoring function ϕ expresses the ‘match’ between the relation and ar-guments.

As in the encoder 3.2, the right-hand term is normalized by dividing over a summation. Here, the normalization term sums over both argu-ments.

To fully analyze the properties of the reconstruction-error minimization framework, I will incrementally specify more complex decoders. I will also report the performance of these decoders in the Experiments chapter, in order to illustrate how the model generalizes.

3.3.1 Independent Argument Probability

While I aim at modelling the joint probability of argument pairs, I will first introduce independent argument decoders. These are similar to generative

(23)

3.3. Decoder 15 models which typically assume that arguments are conditionally indepen-dent. Additionally, this allows me to break down the performance of the reconstruction-error minimization framework into different aspects.

Categorical The simplest decoder model directly keeps track of indepen-dent argument probabilities. This is inspired by Ammar, Dyer, and Smith, 2015, who use a categorical (i.e. multinomial) distribution to model the re-construction of the input to their Conditional Random Field auto-encoder. However, their work is a graphical model in which the decoder is assigned a Dirichlet prior, thereby benefiting from regularization.

ϕ(a1, a2, r, θ) = θa1|r+ θa2|r (3.4)

where θ include θa1|r, θa2|r ∈ R for a1, a2 ∈ V, r ∈ R. The normalization

ensures that the final output is a probability distribution.

Selectional Preference To generalize this model, I factorize the

indepen-dent argument model using word embeddings. The parameter θa1|r that

expresses the score of an argument given a relation is replaced by an inner product of a word vector uawith a parameter vector c. This also allows me

to use pre-trained word vectors.

The scoring function now expresses the degree to which the word and argument representations are in the same direction in vector space:

ϕ(a1, a2, r, θ) = ua1c1|r+ ua2c2|r (3.5)

where θ include ua, c1|r, c2|r ∈ Rdfor a ∈ V, r ∈ R.

Because these decoders do not express the interaction between argu-ments, they are more similar to the generative models, which assume the arguments are conditionally independent. However, they do allow me to analyse the interaction between the encoder, which does capture interde-pendencies, and the working of the decoder.

3.3.2 Joint Argument Probability

In reality I want to model the joint probability of arguments to model a relation. However, that joint signal is very sparse, which precludes the ex-plicit training of argument pair scores. Therefore, a joint argument model is necessarily factorized to overcome sparsity.

While many factorized joint models are possible (e.g. Srikumar and Manning, 2014, Bordes et al., 2012, or Weston et al., 2013), I will examine only a bilinear model due to the scope of this thesis. As the interaction between encoder and decoder will prove to be the primary cause of local minima, that has been the focus of this work.

Bilinear Model To encode the interaction between both arguments, here the scoring function is computed using the product of word representations and a square matrix that represents the relation itself. By taking the left or right product of the relation matrix with one of the word vectors, it is possible to make a prediction about the other argument. Word vectors that

(24)

16 Chapter 3. Model are near the result of this operation in vector space are likely candidates for that argument.

For example, if a1 = professorthe model should learn a relation r for

which ˆua2 = u

T

professorCris near word representations such as uAmsterdam

or uStanford.

The motivation of this model is that the signal from the combination of the arguments is much more informative than from the arguments sepa-rately. Hopefully, such a signal can replace direct supervision for training semantic parsers. In this work, the relations are induced from a limited sig-nal that is constrained to predicate-argument structures in certain context. The experiments will show whether the learning signal is strong enough for such an expressive model.

ϕ(a1, a2, r, θ) = uTa1 Crua2

where θ include ua∈ Rd, Cr∈ Rd×dfor a ∈ V, r ∈ R.

3.4 Optimization

3.4.1 Stochastic Gradient Descent

The model is trained by stochastic gradient descent, using Adagrad (Duchi, Hazan, and Singer, 2011) for learning rate adaptation. Stochastic Gradient Descent is based on iteratively updating the parameters of the model using the gradient of the loss function evaluated on part of the training data. In my case, the unregularized loss function per training example is the nega-tive log-likelihood

Qi(θ) = − log pθ(ai|xi)

for each input xi.

The training consists of iterating over random subsets S of the data (called ‘mini-batches’), calculating the loss and updating the parameters (here θj could be any parameter in accordingly, using a learning rate η:

θ ← θ − ηX

i∈S

∇Q_i(θ)

AdaGrad However, this often causes the updates to perform poorly on parameters that are used infrequently. Additionally, it is difficult to find the right learning rate in practise. To overcome this problem I used AdaGrad, which is a subgradient method that dynamically incorporates information about the problem during the training. Here the gradient is divided by its historical magnitude per parameter θj:

hj ← hj + ∇jQi(θ) 2 θj ← θj− η X i∈S ∇jQi(θ) phj

(25)

3.4. Optimization 17 This results in faster and more reliable convergence.

3.4.2 Encoder regularization using its entropy

One problem when training the model described in this chapter is early convergence on local minima. This happens when the encoder initially does not reflect sufficiently high probabilities for all relations. In that case, the learning signal of the decoder is focussed on only the relations predicted by the encoder, which leads the decoder to neglect those relations and discard them in the course of training.

A simple way to overcome this is to add a term to the objective that fa-vors encoder predictions are more evenly distributed over relations. This is reflected in the entropy of the encoder predictions. The higher the entropy, the more evenly distributed the encoder predictions are. The loss thus be-comes: QH_i (θ) = Q(θ) − H pθ(r|xi) = Q(θ) +X r pθ(r|xi) log pθ(r|xi)

In the experiments, I report on both the model with the entropy term, and without. I also show it is possible to anneal the entropy by scaling it during training. This should avoid local minima in the beginning, and focus the model later on.

3.4.3 Extension: decoder approximation by sampling

An extension that would dramatically increase the efficiency of this model is candidate sampling. In the model described in this chapter, the main performance bottleneck is the calculation of the normalization for the ar-gument prediction in 3.3, because it needs to sum over all words in the vocabulary. Many approaches to factorization problems like this approx-imate the normalization instead of computing it completely (e.g. Mnih, 2013, Mikolov, Chen, et al., 2013, Bordes et al., 2013, Weston et al., 2013, Riedel et al., 2013). Instead of summing over all possible words in order to compute the normalization term, these approaches treat the argument pre-diction as a ranking problem. The model is thus trained to rank observed arguments higher than random arguments.

For candidate sampling, the decoder log-proportional to ϕ would be replaced by a ranking: pθ(a|r) ≈ X (ã1,ã2)∈N f ϕ(a1, a2, r, θ), ϕ(ã1, ã2, r, θ)

where N is a set of random argument pairs following some sampling dis-tribution and f is a monotonically increasing function.

Using the camdidate sampling function explored by Titov and Khod-dam, 2015, the decoder probability would become:

pθ(a|r) ≈ exp log σ(ϕ(a1, a2, r, θ)) −

X

(˜a1,˜a2)∈N

log σ(ϕ(˜a1, ˜a2, r, θ))

(26)

18 Chapter 3. Model where the sampled pairs are generated to always contain one observed ar-gument (i.e. in the form (a1, ˜a2) or (˜a1, a2)) and σ is the logistic sigmoid

function.

While I have experimented with this approximation in the initial stages of my research, preliminary results indicated that its use would obscure the primary contributions of this thesis. Namely, the optimization difficulties arising from the interaction between the encoder and decoder outweighed the gains that would result from efficiency. Althrough it would allow for a larger training corpus, it would have been unclear how it had contributed to the induced relations. Therefore, I have not used this setup in my final experiments, and I report results only using the full softmax normalization.

(27)

19

Chapter 4

Experiments

Ideally, unsupervised semantic parsing methods should be evaluated in a task-based setting such as question answering or information extraction. However, such domains are too broad for the narrow domain of preposition parsing. Therefore, I compare unsupervised semantic parses to datasets of gold-standard annotations.

To evaluate the reconstruction-error minimization framework for prepo-sition relation discovery, I performed three experiments. Two of these relate preposition relations to noun-noun compound relations. My assumption is that the relations expressed by prepositions and compounds are similar, as described in Chapter 3.

4.0.1 Initialization and Hyperparameters

To prevent an explosion of the number of parameters in the bilinear model, I used a relatively small word representation dimensionality of 40. When using larger pre-trained word representations, I reduced their dimension-ality using principal component analysis, and re-scaled them to have zero mean and a unit L2norm.

The encoder parameters w were initialized with a standard gaussian distribution. The parameters of the decoder were initialized differently per type of decoder. The categorical decoder (3.3.1) was initialized with a flat distribution, i.e. all weights set to 1. The selectional preference de-coder (3.3.1) was initialized with a standard gaussian, to match the word representations. I initialized the parameters of the bilinear decoder (3.3.2) with a zero-mean gaussian, with a standard deviation of√d, where d is the word representation dimensionality. This ensured that its dot product with a standard gaussian vector resulted in a vector of the same magnitude.

All models were optimized using AdaGrad with a step size of 0.5. I used mini-batches of 50 inputs, and trained on 90% of the data until the objective on the held-out 10% stopped falling.

The models were always trained using 33 possible classes, following Srikumar and Roth, 2013a.

4.0.2 External Data

For the word representations, I used Google’s word2vec pre-trained word vectors on about 100 billion words of Google News (from Mikolov, Chen, et al., 2013). These are 300-dimensional representations of 3 million words and phrases. Some words from my input data (e.g. some names) did not occur in this vocabulary, so for those I used the average over all word rep-resentations.

(28)

20 Chapter 4. Experiments For the input features, I used the 320 Brown clusters from Turian, Rati-nov, and Bengio, 2010.

4.0.3 Evaluation

As in previous work on unsupervised semantic parsing, I evaluate my model using purity, collocation and their harmonic mean F1. Purity is the ratio of

largest overlaps of the clusters with the gold, and collocation is the ratio of largest overlaps of the gold with the clusters:

PU = 1 N X i max j |Gj∩ Ci| (4.1) CO = 1 N X j max i |Gj∩ Ci| (4.2) F1= 2 PU × CO PU + CO (4.3)

where Ci is the set of instances in the i-th induced cluster, Gj is the set of

instances in the j-th gold cluster, and N is the total number of instances.

4.1 Discovering Preposition Relations

To discover preposition relations, I trained the model on the prepositions in a large parsed dataset of FrameNet sentences (Bauer, Fürstenau, and Ram-bow, 2012). This resulted in 360469 preposition instances. Then I evaluated it with classes from Srikumar and Roth, 2013a.

4.1.1 Data and Evaluation

The majority of FrameNet sentences is taken from the British National Cor-pus. (Bauer, Fürstenau, and Rambow, 2012) parsed all sentences using the Stanford dependency parser, version 2.0.1, and aligned the FrameNet anno-tations to the parses. I used the head and dependent of these prepositions in the automatic parses as governor and object, resulting in a vocabulary of 53367 words.

Features From these parses I extracted a set of features that were inspired by Srikumar and Roth, 2013b, who extend the feature set from Tratz and D. Hovy, 2009. Because my aim is to induce a fully unsupervised and language-independent classifier, I did not use any features based on Word-Net, a thesaurus, or language-specific heuristics. My feature set therefore consists of only word, part-of-speech, capitalization indicator and brown cluster for the governor and object. Like Srikumar and Roth, 2013b, I use both these base features, and the features conjoined with the preposition to be classified. This resulted in a set of 320763 features. Evaluating a logistic regression classifier on this feature set gives a cross-validation accuracy of 0.814.

(29)

4.1. Discovering Preposition Relations 21

Labels Then, I evaluated the clustering using datasets of preposition su-persenses. Using the preposition 33-class supersense mapping from Sriku-mar and Roth, 2013a, I associated each preposition usage from the SemEval2007 challenge dataset (Litkowski and Hargraves, 2007), to a relation from the inventory. This dataset consists of 14857 preposition annotations, on sen-tences from the FrameNet corpus. I treated this dataset as the gold standard to evaluate against using the metrics described above.

Interestingly, this is not the only supersense inventory based on the senses from the Preposition Project (Litkowski and Hargraves, 2005). The parser described by Tratz and E. Hovy, 2011, uses a 29-class inventory of preposition supersenses, but these are not described in publication. How-ever, the supersenses are defined in the data supplied with the parser. Many preposition senses from the Preposition Project are matched to a specific ‘major cluster‘, exactly like in Srikumar and Roth, 2013a. However, these are incomplete: some preposition senses are not matched to any relation. Therefore, I chose not to report any results on this relation inventory.

4.1.2 Baselines

Trivial Baselines In order to place the results in perspectve, I include a number of trivial baselines in the results table. Every preposition can be assigned a random class from a set of some size (Random) or the same single class (One Class).

Another trivial baseline is assigning every of the 33 most frequent prepo-sitions to their own class (Preposition Classes). This is equivalent to assum-ing that every preposition has a sassum-ingle meanassum-ing, and that there is no over-lap in word senses between different prepositions. We shall see that this assumption is stronger than we might like to imagine.

Non-Negative Matrix Factorization Non-Negative Matrix Factorization is a decomposition method that assumes the data and components are negative. It finds a decomposition of the inputs into two matrices of non-negative elements, by optimizing for the squared Frobenius norm. When using a low-rank decomposition, the components can be seen as classes that explain the data. The implementation that I used alterates the mini-mization between both matrices using coordinate descent and is initialized by nonnegative double singular value decomposition.

Latent Dirichlet Allocation Latent Dirichlet Allocation (Blei and Hoff-man, 2010) is a generative model that infers latent classes from a set of sparse inputs. Every input is associated with a mixture of classes (or ‘top-ics’) that explain the distribution of the input features. It is only trained on the input features, and does not assume that the classes express a relation between arguments.

For my experiments, I used the logistic regression, NMF and LDA im-plementations present in the Python package scikit-learn (Pedregosa et al., 2011). The reconstruction-error minimization models were implemented in Theano (Bergstra et al., 2010).

(30)

22 Chapter 4. Experiments Size Instance p (r |x ) 1 / 18 clack behind him 1.000 smile on him 1.000 confided in him 1.000 note for them 1.000 tr od on him 1.000 1 / 23 padded acr oss room 0.999 lur ched acr oss room 0.998 rustled ar ound room 0.993 sauntering ar ound room 0.985 tiptoeing into room 0.983 1 / 26 heights above he 0.993 Festival in June 0.991 ownership of goods 0.984 criticized as br each 0.982 thr ew fr om bucket 0.974 1 / 27 wander along road 0.989 stamped down drive 0.977 hidden inside skirts 0.976 disappr oval at behaviour 0.951 meditates on meaning 0.948 1 / 28 descended on one 0.993 skip fr om side 0.984 take round get 0.975 steps thr ough woods 0.969 remarks fr om Jef ferson 0.956 ( A ) Categorical model Size Instance p (r |x ) 1 / 17 waded among cr owds 0.999 clatter ed to gr ound 0.998 slither ed along veins 0.994 wound ar ound mar gins 0.993 hobbled to side 0.980 1 / 18 dwelt with me 0.996 renting fr om him 0.996 ar ched at him 0.993 glanced about him 0.986 smirked at him 0.986 1 / 21 certainty about Russia 0.998 pilgrimage to Grantchester 0.997 pilgrimage ar ound world 0.997 consultant in Aber deen 0.997 people in V ienna 0.997 1 / 23 wonderful with or chestras 0.999 unfriendly to outsiders 0.996 fr ee for members 0.994 investigations by gr oups 0.986 distinguish between couples 0.967 1 / 24 draped over table 0.996 festooned with fr uits 0.996 rested on table 0.996 reigned over Palestine 0.996 followed towar ds table 0.995 ( B ) Selectional Pr efer ence model Size Instance p (r |x ) 1 / 12 swimming in gravy 0.999 aspir e to comfort 0.997 sidled thr ough door 0.981 aspir e to consciousness 0.978 star ed acr oss table 0.971 1 / 20 walk ar ound Bolton 0.987 resignation on 2 0.975 skip fr om side 0.968 excitement in eyes 0.954 revenge on enemies 0.953 1 / 20 cut thr ough corridor 0.997 skulking among vegetation 0.995 reduce to month 0.982 ambled along riverbank 0.932 swished over sand 0.919 1 / 21 hem of jacket 0.989 torn into bits 0.976 decayed into shapes 0.971 brilliance of writing 0.969 happened befor e death 0.961 1 / 27 head of latter 0.957 view thr ough haze 0.956 suspicions against him 0.940 displeasur e in voice 0.926 astonished at achieved 0.925 ( C ) Bilinear model F IG U R E 4 .1 : The five most confident pr edictions for the five lar gest clusters induced by the reconstr uction-err or minimization models

(31)

4.1. Discovering Preposition Relations 23

Model PU CO F1

Random 0.20 0.05 0.08

One Class 0.20 1.00 0.33

Preposition Classes 0.48 0.58 0.52

FIGURE4.2: Trivial Baselines

-prep +prep (classifier: 0.814) (classifier: 0.816) Model PU CO F1 PU CO F1 NMF 0.26 0.42 0.32 0.47 0.47 0.47 LDA 0.34 0.40 0.37 0.42 0.53 0.47 Categorical 0.23 0.16 0.19 0.26 0.18 0.21 Selectional Preference 0.24 0.20 0.21 0.24 0.19 0.21 Bilinear 0.22 0.30 0.25 0.21 0.37 0.27

FIGURE 4.3: Preposition Relation Induction Results. The left side is excluding the preposition feature template, the right side including it. All reconstruction-error models are trained with an entropy regularization term. The scores be-tween brackets are cross-validated accuracy scores of a

lo-gistic regression classfier.

Model PU CO F1

No entropy term

Categorical 0.24 0.50 0.32

Selectional Preference 0.23 0.49 0.31

Bilinear 0.20 0.52 0.29

Annealed entropy term

Categorical 0.26 0.20 0.23

Selectional Preference 0.23 0.20 0.21

Bilinear 0.21 0.56 0.30

FIGURE 4.4: Preposition Relation Induction Results, with

varying entropy term. Here, no preposition feature tem-plate was used.

(32)

24 Chapter 4. Experiments

(A) Preposition Classes (B) NMF

FIGURE4.5: Confusion matrices for the Preposition Classes baseline, and for NMF when the preposition feature

tem-plate is used.

4.1.3 Results

From table 4.2, we can see that the prepositions themselves correspond strongly to the annotated classes in the inventories. Inspecting the cluster-ing baselines (NMF and LDA), I found that the features with most weight were always frequently occurring features that were conjoined to the prepo-sition. Indeed, when adding the preposition itself as a feature template, we can observe striking results.

Comparing the left and right side of table 4.3, it is evident that the preposition feature template has profound effects on unsupervised learn-ing performance. However, when trainlearn-ing a supervised logistic regression classifier, the difference in performance is much smaller: the cross-validated accuracy is 0.814 without, and 0.816 with the preposition feature. Addition-ally, from the confusion matrices of the Preposition Classes and the cluster-ing baselines (figure 4.5) it is clear that NMF creates a clustercluster-ing very simi-lar to the Preposition Clusters. Therefore it is safe to say that these cluster-ings use the strong signal from the prepositions themselves to cluster the data. The type of features that lead to good clustering performance might be very different than the features that lead to good classification perfor-mance. Nevertheless, neither of the baselines is able to beat the Preposition Classes baseline, which is based on the naïve assumption that prepositions are not polysemous.

The models based on reconstruction-error minimization also fail to in-duce relations that correspond to the annotations. I have tried to shed light on the interaction between encoder and decoder training through two ad-ditional experiments. In table 4.4, I report the clustering scores for two variations of the models. In the first, they are trained without the entropy term. In the second, the entropy term is gradually removed using a sig-moidal annealing schedule. This means both variants generally end up with a lower number of predicted relations, which is reflected in higher collocation scores. However, it is the purity of the models that we are most interested in, and this is unfortunately hardly affected.

(33)

4.2. Paraphrasing Noun-Noun Compounds with Preposition Relations 25

Qualitative evaluation Why do the reconstruction-error models induce clusters that differ so much from the annotations? In table 4.1, we can in-spect the largest clusters, and the training instances that belong to them with highest probability.

For the Categorical model, the largest two clusters seem to be heavily skewed towards a single unique object (‘him’, and ‘room’). These trivial clusters are succesful in reconstructing the highest frequency words, but do not correspond to any interesting semantics. The other clusters do not seem to reflect distinct concepts either.

The clusters in the Selectional Preference model have clearer semantics. The largest cluster has a clear location focus. However, here too there are clusters with trivial interpretations: the second-largest cluster again almost exclusively clusters together instances based on the object word. Inter-estingly, third-largest cluster here has clear geographically-oriented object matches, but doesn’t express the relation that captures them. The fourth is clearly about groups of people, but again doesn’t match an actual relation.

The clusters in the Bilinear model are very hard to interpret. Sometimes they can tend to have a location or non-location emphasis, but there is no clear pattern. It is likely that the bilinear model is able to encode subtle vector-space operations in its matrices that optimize argument reconstruc-tion, but do not lead to interpretable relations.

4.2 Paraphrasing Noun-Noun Compounds with

Prepo-sition Relations

To investigate whether it is possible to transfer preposition relations to noun-noun compounds, I trained several classifiers to predict a preposition from a pair of nouns. In particular, I wanted to know whether the factorization of the prepositions occurrences would correspond to human paraphrases of noun-noun compounds.

First I factorized the noun pairs of every noun-preposition-noun occur-rance in the parsed dataset of FrameNet sentences (Bauer, Fürstenau, and Rambow, 2012). Then I evaluted the suitability of those factorizations to noun-noun pairs. This was done by comparing the noun-noun pair prepo-sition paraphrases from Bos and Nissim, 2015 — the results of theBurgers game in Wordrobe — to the expansion of the pairs for every preposition factorization.

I also trained a number of classification baselines with which to compare the factorizations.

4.2.1 Model

For this problem, I created a classifier based on the factorization of the noun-noun co-occurrences. Like the bilinear decoder, the word pairs are factorized by finding a matrix for every relation that encodes the compati-bility of two word vectors. However, for this problem the matrix is induced by optimizing the prediction of the preposition from the noun pair. More

(34)

26 Chapter 4. Experiments 0.0 0.1 0.2 0.3 0.4 0.5 0.6 about across against among around at before between by for from in into of on over per through toward towards under via with within Burgers FrameNet

FIGURE 4.6: Differences in preposition distribution be-tweenBurgers and FrameNet

formally, I maximized the likelihood of the labels in the dataset: p(r | n1, n2) =

exp(uT_n₁ Crun2)

P

r0_∈Rexp(uT_n

1 Cr0 un2)

For the word embeddings, I explored two options: random initializa-tion, and using fixed pre-trained representations.

4.2.2 Data and Evaluation

The training set consisted of the 125448 noun-preposition-noun triples in the parsed FrameNet corpus, for the 14 prepositions that were used by Bos and Nissim, 2015. This results in a vocabulary of 30236 different nouns.

The Burgers annotations from Bos and Nissim, 2015 are not

disam-biguated: the noun-noun compounds are paraphrased using different prepo-sitions by different annotators. Therefore, I also evaluate against the distri-bution of annotations, using the cross-entropy between the annotation dis-tributions and the model prediction distribution for all classes. Table 4.7

shows some examples of theBurgers paraphrases.

The training accuracy was computed by 3-fold cross-validation. The Burgers accuracy was computed by taking the preposition annotated most often as the gold label.

4.2.3 Baselines

The most frequent preposition in the dataset is of, but the training data is much more skewed towards of than theBurgers annotations (see table 4.6).

(35)

4.2. Paraphrasing Noun-Noun Compounds with Preposition Relations 27

Noun compound Paraphrase

chemical company company in chemical(s)

rebel hideout hideout for rebel(s)

grain exports exports on grain(s)

hunger strike strike with hunger(s)

fashion houses houses of fashion(s)

opium production production of opium(s)

level officials officials on level(s)

drug trafficker trafficker in drug(s)

sting operation operation for sting(s)

Government forces forces of Government(s)

FIGURE 4.7: Examples of noun-noun paraphrases from Burgers

Logistic Regression The first baseline is a logistic regression model that predicts a preposition from the two nouns. It uses separate parameters for a noun as the governor or as the object.

RESCAL RESCAL is an approximate tensor factorization method based on an alternating-least squares algorithm. It has previously been used to factorize YAGO (Nickel, Tresp, and Kriegel, 2012). In this experiment, I performed a grid-search over possible values of the ‘rank’ hyperparame-ter, which indicates the size of the word representations. Through cross-validation, a rank of 8 performed highest, and that is the value that I report results on. To make the Bilinear model with random initialization compara-ble to RESCAL, I enforced it to use word representations of the same size. It is however still possible that the bilinear model would perform differently with a different rank setting.

Word Vector Classifier To test the effectiveness of the word representa-tions themselves, I trained another logistic regression model on the con-catenated word vectors of the nouns, i.e. 80-dimensional dense vectors.

4.2.4 Results

From table 4.9, it is evident that a simple logistic regression model is hard to beat using factorization for classification. Often these models do not even outperform the majority-class baseline. However, factorizing the relations using the pre-trained word embeddings does lead to a distribution that is more similar to the human annotations. That seems to indicate that this model generalizes noun-(preposition)-noun relations more similarly to hu-man intuitions.

However, no matter how strong the assumption of single-relation sitions performs (as we saw in the previous section), the fact is that prepo-sitions are polysemous. This is also highlighted by Bos and Nissim, 2015, who stress that prepositional paraphrases of noun-noun compounds are not a panacea of relation annotations. In the next section, I will suggest an evaluation for joint noun-noun and preposition relation induction that strives to overcome this single-relation assumption.

Unsupervised Relation Discovery for Prepositions and Noun Compounds

M

’

T

Unsupervised Relation Discovery for

Prepositions and Noun Compounds

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Semantic Parsing

1.2

Reconstruction-error minimization

Features

Relations

Reconstruction

g(x)

r

a

(x), a

(x)

1.3

Thesis Contributions

Chapter 2

Background

2.1

Distributed Word Representations

2.2

Supervised Semantic Parsing

2.3

Information Extraction

Chapter 3

Model

3.1

Reconstruction-Error Minimization

Features

Relations

Reconstruction

g(x)

r

a

(x), a

(x)

3.2

Encoder

3.3

Decoder

3.4

Optimization

Chapter 4

Experiments

4.1

Discovering Preposition Relations

4.2

Paraphrasing Noun-Noun Compounds with

Prepo-sition Relations

_r

_r