Master thesis:
Automatic acquisition of qualia roles
using an Italian semantically annotated corpus
by
Marco Trovato Mascali c
Supervisor: Gosse Bouma
European Master of Language and Communication Technologies
University of Groningen University of Nancy 2
August 2011
Declaration
I hereby confirm that the thesis presented here is my own work, with all assistance
acknowledged.
Abstract
This thesis investigates and improves a machine learning approach which permits to automatically recognize agentive and telic roles for nouns from a large CoNLL parsed Italian corpus.
Following the work of Yamada and Baldwin [25] I use a token-level supervised
classifier to dynamically discover for each noun of a set a number of verbs considered
as being the agentive or telic role of that noun. Furthermore, after the creation of
three semantically different training sets, corresponding to three subclasses of the
SimpleOWL ontology (Location, Artifact and LivingEntity), I run two different ex-
periments to evaluate whether selecting positive and negative instances among those
belonging to the same semantic group can improve the results or not and to under-
stand whether semantic features are useful for this goal. The research lies on the
assumption of the compositional nature of natural language, as described in the Gen-
erative Lexicon (GL) theory by Pustejovski, and in particular on the importance of
the so called qualia roles. Lexica based on GL can be used in many different com-
putational linguistic tasks such as Question answering and Textual Entailment, but
they are extremely time consuming and expensive to develop manually.
Contents
Abstract iii
1 Introduction 1
1.1 Motivation . . . . 2 1.2 Goals . . . . 4 1.3 Outline . . . . 4
2 Related works 6
2.1 Existing lexical databases . . . . 6 2.2 Qualia’s acquisition techniques . . . . 7
3 Resources 10
3.1 The Simple OWL ontology . . . . 10 3.2 ItalWordnet . . . . 11 3.3 A semantically annotated Italian corpus . . . . 12
4 Mapping Simple OWL and ItalWordnet 15
4.1 Motivation . . . . 15
4.2 Mapping . . . . 15
4.3 Remarks and results . . . . 21
5 Creating the Gold Standard 23 5.1 Preparing the data . . . . 23
5.2 Instruction to the annotators . . . . 25
5.3 Profile of the annotators . . . . 26
5.4 Analysis of the Gold Standard . . . . 27
6 The machine learning method 31 6.1 Features . . . . 31
6.2 Creating the training data . . . . 33
6.3 Experiments . . . . 37
7 Evaluation and results 43 7.1 Training on only one semantic class . . . . 44
7.2 Training on different semantic classes . . . . 46
7.3 Summing up and comparing the results . . . . 48
8 Conclusions and future work 50
Bibliography 52
Chapter 1
Introduction
One of the advantages of the Generative Lexicon theory is that it addresses the main issue of semantic knowledge: in fact, while lexical semantic knowledge appears to be extremely various and irregular, a computable and unambiguous knowledge is necessary for Natural Language Processing. Pustejovsky’s main idea was that lexical semantic knowledge can be formalized through a decompositional process which led him to identify four basic “oppositions” [16], called qualia roles. Each of these roles is a feature consisting of a logical predicate. For any given lexical unit there can be maximally 4 roles (Often some roles are not typical for a semantic class):
Formal : the class the entity belongs to.
Constitutive : relations about the internal constitution of an entity.
Agentive : the origin of the entity or the way this was created.
Telic : the typical purpose or function of the entity.
Generalizing a little we can say that through the qualia roles a noun can be related to other nouns by the means of classical relations such as hyperonymy, and to verbs specifying its origins and its typical functions. For example the noun “beer” would have the following qualia roles:
Formal : beverage Constitutive : barley Agentive : to produce Telic : to drink
In this thesis I describe an approach to improve a method for automatically ac- quiring Agentive and Telic roles for nouns from an Italian CoNLL parsed and seman- tically annotated corpus. Similarly to Yamada and Baldwin [25] I use a token-level supervised classifier to achieve my goal. Besides their work, I also use more com- plex features for the vector and I split the training data in three subsets according to the semantic classification of each noun in the Simple OWL ontology. The main hypothesis behind my research is that some semantic background knowledge and a semantically annotated corpus can improve the overall system. To evaluate the re- sults I use a manually annotated Gold Standard consisting of 1200 noun-verb pairs indicating agentive and telic roles and a grade describing their level of “relation”.
1.1 Motivation
Despite the fact that the influence of the Generative Lexicon (GL) theory by Puste-
are not yet very developed and common. This is mainly due to the fact that creating this kind of databases is extremely expensive and time consuming for researchers.
Furthermore, the majority of studies have focused especially on the acquisition of other kind of relations such as Constitutive (the internal constitution of an entity) and Formal (the so called “is-a” relation) roles, neglecting the importance of Telic and Agentive roles which can be used, for example, for the interpretation of logi- cal metonymy [12]. For example, such kind of information can be useful for the interpretation of the following sentences:
(1) Beatrix enjoyed the book.
(2) Beatrix enjoyed the cake.
(1) and (2) are almost identical and syntactically equal, but the (possible) implicit verbs are clearly different:
(1a) Beatrix enjoyed reading the book.
(2a) Beatrix enjoyed eating the cake.
The interpretation of these sentences presupposes a deep knowledge of the seman-
tics of the two entities “book” and “cake”. Going a bit deeper into the analysis of
this example it’s possible to state that once we know the characteristics of an entity,
we can probably use them for other entities belonging to the same semantic class. So,
for example the interpretation of (2) would be similar if substituting “cake” with an
other noun classified as “Food”. Having a large lexicon describing Telic and Agentive
roles can help the automatic interpretation of such cases. More generally, GL lexicon
databases can be extremely useful for a variety of different tasks, from anaphora res- olution to machine translation [14]. Research dedicated to the automatic acquisition of qualia roles have usually led to modest results, so much has still to be done.
1.2 Goals
The goal of my thesis is to improve the results obtained by other works on automatic discovery of qualia roles. In particular I want to focus on the importance of filtering the training data and split it in different semantic classes. Furthermore I expect that using features of a semantically annotated corpus should help achieving the goal.
1.3 Outline
In chapter 2 I will talk about some related works, in particular I will introduce two
already existing lexical semantic databases annotated following the GL theory, then
I’ll compare other studies on the automatic acquisition of qualia roles. Chapter 3 will
be dedicated to the description of the resources I need for my research, which are the
Simple OWL ontology, an OWL resource based on the Simple model [13], ItalWordnet,
the Italian version of the more famous Wordnet developed at Princeton University and
finally wikiCoNLL, an Italian CoNLL parsed and semantically annotated corpus ex-
tracted from Wikipedia (http://wacky.sslmit.unibo.it/doku.php?id=download). As I
need a group of semantically annotated nouns for collecting different training data,
in chapter 4 I will explain how I collected them mapping the Top Ontology of Ital-
Wordnet to the Simple OWL ontology. The topic of chapter 5 will be a description
and an analysis of the manually built Gold Standards. Chapter 6 is the main part
of the thesis: it contains a description of the features used and of the training data
and it continues with describing the classification algorithms used. In chapter 7 I’ll
show the results of the classification methods and finally I’ll present the conclusions
in chapter 8.
Chapter 2
Related works
2.1 Existing lexical databases
As already announced in the introductory chapter, there are few available lexical databases based on GL. All of them have been manually built through introspec- tion thanks to the effort and the time that some researchers have dedicated to the goal. One of the first project in this field was the Parole-Simple lexicon ([13]). Spon- sored by the European Union, Simple’s main purpose was to develop “wide-coverage- harmonized computational semantic lexica for 12 European languages” [22]. Parole was instead dedicated to the annotation of the morpho-syntactic layer. The combina- tion of these two projects led to the Parole-Simple lexicon, composed of 10.000 word meanings for each language annotated according to a language-independent ontology of semantic types and morphological and syntactic features specific to each language.
A sample of the final lexicon is publicly available at http://www.ub.edu/gilcub/SIMPLE/simple.html.
Unfortunately the entire Italian lexicon is not free.
Following the Simple specifications, Pustejovsky and other researchers [17] have developed a large generative lexicon ontology and a dictionary in English to be used for computational tasks, the Brandeis Semantic Ontology (BSO). It’s an ongoing [10]
project containing 40.000 lexical entries and 3.700 ontological nodes. As suggested in Simple, the ontology is divided into 3 major types: events, entities and properties.
Although the qualias contained in the BSO are always correct (remember they are the result of introspection), after an evaluation performed comparing it with the British National Corpus (BNC) and ConceptNet [15] by Havasi [10], a lack of the qualias structures used in everyday language was highlighted. The main differences between Simple and BSO are the size, which in the second is much larger than in any lexicon of any language of the Simple project, and the lack of other parallel lexicons of the BSO.
2.2 Qualia’s acquisition techniques
The problem of automatic acquisition of qualia is not new. Some researchers have already tried during the years to find a good method to extract this information.
Cimiano and Wenderoth [8] have suggested a method based on Hearst patterns [11].
Following the assumption that some semantic relations (qualia roles in this case) are
related to the syntactical structure of the sentence, Cimiano and Wenderoth manually
created for each role a set of queries (called clues), indicating the relation of interest,
which are connected to a regular expression. For example π(t) are used to is one of the
four clues built for the telic role, where π(t) is a variable standing for the term, and
which is connected to the regular expression (A |a|An|an)NP 0is{V BZ}used{V BN}
to {T O}P URP . Afterwards they sent queries like these to Google API and then they downloaded and part-of-speech tagged the snippets of the first 10 Google hits matching the clue. Finally they matched the regular expressions related to the clue and weighted the qualia candidates according to the Jaccard coefficient.
An other interesting work has been done by Claveau et al [9]. In their paper they describe a method to automatically acquire qualia roles for noun-verb pairs from a se- mantically and POS tagged French corpus using inductive logic programming without any predefined patterns. This method has been motivated by the linguistic assump- tion that knowing a priori all the possible patterns describing a certain qualia role is not an easy (or even possible) task. For achieving the goal they extracted noun-verb pairs from a French semantically and syntactically annotated corpus including some background knowledge (position and distance of the terms) and manually classified them as positive or negative examples of qualia relations. The inductive logic pro- gramming step did the rest. In this way they obtained not only the qualias for a given noun, but also the rules explaining how noun-verb pairs having a qualia relation are different from those lacking the relation.
A completely different approach is proposed by Amaro et al [7]. In their study
they highlighted the similarities between the Wordnets and the qualia information
noticing how, for example, the hyperonymy/hyponymy relation refers to the formal
qualia role and the cause relations to the agentive qualia role. Given this as a basis,
they presented a table of links between the lexical relations of the Portuguese Wordnet
and the qualia roles. Although this is not a method for acquiring qualias from a corpus
it is still a cheap way to obtain a GL based lexicon by a simple conversion from the
Finally Yamada and Baldwin [25] proposed two techniques for discovering a ranked
list of agentive and telic roles for noun-verb pairs. The first one, similarly to Cimiano,
is based on fixed templates while the second is a machine learning method using as
feature vector the dependency relations of the noun-verb pairs and their contexts. In
my thesis I will follow the idea of this second technique and I will try to improve it.
Chapter 3
Resources
3.1 The Simple OWL ontology
The Simple-OWL [22] is a Generative Lexicon based ontology imported from a pre- existing computational linguistic lexicon called Parole-Simple-CLIPS (the Italian sec- tion derived from the European Simple project [13]) and converted into the W3C standard. Thanks to this tool different word senses (encoded as semantic units in Simple-OWL) can be identified and discriminated from other word senses having the same lexical item. The Simple-OWL hierarchy, exactly as in Simple, consists of 153 language independent types (OWL classes) built on the influence of Pustejovsky’s GL model. In fact the top classes of the ontology (Figure 3.1) are Entity, Agen- tive, Constitutive and Telic. The Formal qualia role is instead modelled trough the owl:subclassOf relation.
OWL object properties and data type properties are used in Simple-OWL respec-
tively to link two semantic units and to link a semantic unit to a value within a closed
Figure 3.1: Simple OWL top hierarchy
range. Each class, delimited by the constraints and conditions created by object and datatype properties, works like a template guiding the encoding of the semantic units.
The OWL file containing the hierarchy and the attributes is freely available under request. Unfortunately the lexicon is not included and not free.
3.2 ItalWordnet
ItalWordnet [20] is a lexical-semantic database created following the trail of two
other projects called EWN (EuroWordNet) and SI-TAL (Sistema Integrato per il
Trattamento Automatico del Linguaggio). Similarly to the Princeton Wordnet, the
IWN is structured in synsets (synonym sets) containing word senses related through
a partial synonymy relation (two lexemes are partial synonyms if they have the same
connotation and can be interchanged in a context). It contains around 67.000 word
senses, 50.000 lemmas and 130.000 semantic relations such as for example metonymy,
antonymy, hyponimy, etc. Each synset is assigned to one or more of the 65 classes
of the Top Ontology, a language independent hierarchy of concepts indicating fun-
damental semantic differences (Artifact, Natural, Dynamic, Static, etc.) which are considered as being common to all the languages. So for example the word sense
‘tavolo’ table is an instance of a synset which is classified according to the Top Con- cepts (TCs) as [Artifact, Object, Furniture]. These TCs are distributed in three categories depending on their semantic aspects without using any kind of distinctions in terms of part of speech. The first category (1stOrderEntity)(Figure 3.2) was con- ceived following the Qualia roles by Pustejovsky and it contains only concrete nouns;
the second (2ndOrderEntity) was partially influenced by the Aktionsart [23] and con- tains nouns, verbs, and adjectives referring to properties, events, states or processes;
finally in the third category (3rdOrderEntity) we find abstract nouns.
3.3 A semantically annotated Italian corpus
The semantically and syntactically annotated Italian Wikipedia, which can be down- loaded in the Wacky project web page [1] is a corpus (at the moment of the download in February 2011) of about 200 million tokens and more than 10 million sentences in CoNLL format built by the University of Pisa. The extraction of the corpus and the annotation were automatically performed and include information covering morpho- syntactic, syntactic and semantic fields of linguistics. Each sentence is separated from the others by a blank line. Each line of the corpus consists of 12 fields separated by a tab character and indicating the following information:
1. ID: It consists of a number starting from 1 at the beginning of each new sentence
2. Word form: a word form or a punctuation
Figure 3.2: First Order Entities of ItalWordnet
3. Lemma: lemma or punctuation
4. CPOSTAG: coarse-grained part of speech tags
5. POSTAG: fine-grained part of speech tags
6. Features: morpho-syntactic features like gender, number and person
7. Head: 0 (if it’s the root) or the ID of the parent node
8. DEPREL: dependency relation
9. PHEAD: empty
10. PDEPREL: empty
11. empty
12. Semantic label
The tools used to obtain this format are described in the website of the University
of Pisa [2]. In particular, the list of the semantic tags used to annotate adverbs, verbs,
nouns and adjective by their SuperSense tagger is available in [3].
Chapter 4
Mapping Simple OWL and ItalWordnet
4.1 Motivation
The goal of this chapter is creating a lexicon of word senses divided into classes corresponding to the Concrete Simple-OWL classes (see Figure 4.1) to be used for the training corpus of my machine learning method. Lacking the Simple lexicon, I will show how I obtained a similar lexicon by using and modifying an already existing mapping of the two concept ontologies of Simple-OWL and ItalWordnet. Results and remarks will be discussed in the final section of this chapter.
4.2 Mapping
As I mentioned in the previous chapter, unfortunately the Simple lexicon is not freely
available, while the ItalWordnet is, for thesis purposes. Although these projects have
Figure 4.1: Concrete Simple OWL classes
different approaches and goals, they can still be related and gain knowledge from one another. In fact a manual mapping from SimpleOWL to ItalWordnet (IWN) has already been performed by Nilda Ruimy [19] [21] [18]. Following the Simple-OWL structure, a semantic unit can be assigned to only one semantic type; on the other side IWN allows a cross classification along multiple Top Concepts. In spite of this evident difference and the fact that we are comparing semantic units (in Simple) to word senses in synsets (in IWN), it is possible to consider their semantic classification as equivalent given the fact that these are both based on the GL theory. In her paper Ruimy describes the similarities of the two lexica and maps the top ontologies of Simple and IWN analysing their shared words (semantic units in Simple, variants of a synset in IWN). Despite the unpredictable discrepancies she finds, which are usually caused by an incomplete encoding in IWN or differences in the granularity of the senses of the two models, a substantial overlap is evidenced. The list of the mappings she obtained indicates for every single Simple type the possible corresponding IWN type (or intersection of types). A sample is shown in Table 4.1.
Ruimy’s work is extremely useful for my goal. Having both lexicons, she performed a linking method based on the shared words. As the mapping is already done, it’s now easier to transfer the IWN lexemes to the appropriate Simple categories. Nevertheless, a few problems arise here:
• Ruimy’s mapping [4] is directed from Simple to IWN, while I need the reverse mapping
• The mapping is not 1 to 1
• Reversing it is not straightforward, in fact a group of IWN types can correspond
SimpleOWL types ItalWordnet Top Concepts (intersection) Location Solid Part Place
Location Relation 3DLocation Place Solid
Place Natural Liquid Part GeopoliticalLocation Part Place
Area Part Place Solid
Opening Location Relation
Place Part
Table 4.1: Mapping SimpleOWL concepts to IWN performed by Ruimy to more than one Simple type (ex. in Table 4.1 “Place u Part” (IWN) is linked to both “Opening” and “GeopoliticalLocation”).
A way to overcome these obstacles is using the power of Formal Concept Analysis (FCA) treating Simple-OWL classes as objects and IWN classes as attributes. In all those cases in which a Simple-OWL type is linked to more than one group of IWN types I split the original Simple class in as many other classes as the number of the mappings to avoid any kind of information loss. The example in Table 4.2 represents the last 3 mappings of Table 4.1. The classes on the rows belong to the Simple-OWL hierarchy, while those on the columns to IWN. The class “Opening” is so split in two parts, corresponding to the 2 possible mappings Ruimy found.
Following these guidelines I built an FCA with 202 objects and 56 attributes.
Using Galicia, a software created by the University of Montreal [5] I automatically
SimpleOWL \ IWN Place Part Solid Location Relation GeopoliticalLocation X X
Area X X X
Opening X X
Opening2 X X
Table 4.2: Intent and Extent example
mation. A section of the lattice is shown in Figure 4.2.
Figure 4.2: A section of the lattice created
The concepts of the lattice obtained can be divided into three main groups of analysis:
1. Concepts having as intent a certain number of attributes and only one object
as extent. For example the concept 81 (Figure 4.2) has as intent “Container u
Object” and extent “Container”. This is the evidence that those words classified
according to the IWN ontology as “Container u Object” can be mapped to
Simple-OWL “Container” class. In the lattice there are 58 of such similar
concepts; among these we also find concepts sharing an intent which I had
artificially split (Table 4.3).
Intent (Simple-OWL types) Extent (intersection of IWN types) NaturalSubstance Solid u Liquid
NaturalSubstance2 Liquid u Substance NaturalSubstance3 Gas u Substance
Table 4.3: NaturalSubstance split in 3 classes
This is due to a finer-grained ontological classification in IWN. (see annexes 1:
mapping 1 to 1).
2. Concepts having as intent a certain number of attributes and more than one object as extent belonging to a common Simple-OWL superclass. Many times the IWN top ontology is not specific enough to be linked to a Simple-OWL leaf-class. Anyway a mapping is still possible if climbing up the Simple-OWL ontological hierarchy a shared parent node is found (see Table 4.4).
Intent Extent Common Simple-OWL
(Simple-OWL type) (Inters. of IWN types) parent node
EarthAnimal, Animal u Object Animal
AirAnimal, WaterAnimal
Table 4.4: Example of a common Simple-OWL parent node
In 4.4 we see how the three Simple-OWL types called EarthAnimal, AirAnimal
and WaterAnimal have the same IWN extent and at the same time they share a
common parent node Animal. Surely this kind of mapping make us loose some
information but it can still be considered a good approximation. Reversing the situation I discussed in the previous paragraph, in these cases the Simple-OWL ontology has a finer classification. That’s the case for 21 concepts. (see annexes 2: mapping with shared parent node).
3. All the other concepts not belonging to the two previous groups. These concepts are not useful for my purpose. In fact apart of the top and bottom concepts, which are characterized respectively by the lack of intension and extension, they don’t specify any extent which refers either to a single Simple-OWL class or to a parent node of the hierarchy of the ontology.
4.3 Remarks and results
As already noticed by Ruimy, the mapping from one resource to the other appear much more challenging when dealing with IWN 2nd Order Entities. This is probably due to an extremely fine-grained ontology under the classes Events and Properties in Simple-OWL and to an intrinsic complexity in the semantic encoding of word senses falling in these types.
In this study I will focus my analysis on the Simple-OWL hierarchy types belong- ing to by the supertype ConcreteEntity (Figure 4.1) for at least two reasons: first because there is a more detailed mapping from IWN to Simple-OWL under this class;
the second and more important reason is that Agentive and Telic qualia roles are less problematic to annotate for concrete entities than for events.
After these considerations, I leave out all the mappings other than those under
the ConcreteEntity superclass. With the help of a python program I selected all the
synsets marked as ‘N’ (nouns) and encoded in IWN with the set of types I need to map, then I extracted from the synsets all the variants (word senses) contained and automatically assigned them the proper Simple-OWL class. First I run the program for the first group of concepts (those with 1 to 1 mappings) obtaining 5635 word senses classified. Similarly I apply the same method for the second group of concepts (those sharing a common parent node) obtaining 14501 word senses classified. As I expected a small subset of the first group of word senses obtained was also present in the second group. The reason is that in a few cases one of the concepts belonging to the first group was subsumed by a concept of the second group as shown in Figure 4.3.
Figure 4.3: Example of a subsumed concept
The overlapping was solved simply by eliminating the word senses which were
repeated twice from the second group. This reduced the total amount of word senses
mapped from 20136 to 20002.
Chapter 5
Creating the Gold Standard
In this chapter I will introduce the procedure I followed to create three semantically different Gold Standards, randomly selecting a total amount of 60 nouns from the 3 main SimpleOWL classes belonging to the Concrete SimpleOWL superclass.
5.1 Preparing the data
The data was divided into three groups, corresponding to the main Simple-OWL classes under the superclass ConcreteEntity: Location, LivingEntity and Artifacts.
I collected the semantic units obtained after the mapping (see chapter 4) separately
for each group. The result of this process was 7257 living entities, 2004 locations
and 3740 artifacts. Notice that even though I could have chosen more specific classes
when available (sometimes it was not the case because of the different granularity of
IWN and SimpleOWL; see chapter 4) I preferred not to especially because I wanted
to obtain 3 different semantic groups of nouns not too difficult to annotate manually
and general enough to be found in a non specific-domain corpus. After this choice I
randomly took 20 nouns from each group which were linked to at least 20 different verbs (deliberately skipping copulas and auxiliaries) by a dependency relation in the wiki CoNLL corpus. This choice was done to avoid the inclusion of nouns which were extremely rare in the corpus. For each noun 20 verbs were extracted from the wiki CoNLL corpus. In this group of verbs at least 2 were chosen as representative of Agentive and Telic roles from the literature (if available) or directly by me. All the other verbs were selected randomly. For example, for the noun violino (violin), belonging to the semantic class “Artifact”, the following verbs were selected:
Suonare (play)
Accordare (tune)
Costruire (build)
Divenire (become)
Impugnare (hold, draw)
Includere (include)
Abbandonare (give up)
Sfoderare (unsheathe)
Introdurre (insert, put)
Eseguire (execute, play)
Entrare (enter)
Produrre (produce)
Comprendere (understand)
Insegnare (teach)
Aggiungere (add)
Strimpellare (to strum)
Scomparire (to disappear)
Mantenere (to mantain)
Ispirare (to inspire)
Dedicare (to dedicate)
The final data was thus formed by 20 nouns × 20 verbs × 3 semantic groups (Artifacts, LivingEntity, Location), which makes a total of 1200 noun-verb pairs. It has to be noticed that the same data is used to annotate both Telic and Agentive roles.
5.2 Instruction to the annotators
After having selected the noun-verb couples to annotate, two volunteers were asked to perform the annotation separately for each semantic group on noun-verbs. In other words they were asked to annotate 3 data sets composed of 400 noun-verb pairs each.
Before starting, I proposed them a short review of the GL theory. The tasks were 2:
1. Judging with a mark the degree of correlation for each noun-verb couple.
2. Deciding for each noun-verb pair if the verb should be classified as an Agentive or a Telic role for that noun.
The marks were based on a numeric scale from 0 to 6. Each degree of relation was described as follows:
0 No relation at all. Not even in any imaginable metaphorical sense.
1 Extremely poor relation.
2 Poor relation.
3 Not clear.
4 Possible Case.
5 Very common case of qualia role (either Agentive or Telic).
6 Typical case of qualia role (either Agentive or Telic).
After having judged the degree of correlation the annotators were asked to decide what kind of relation was linking each noun-verb pair selecting among three options:
Agentive, Telic or nothing for those cases in which making a clear decision was difficult or simply when none of the two roles was suitable for that noun-verb pair. If a noun had more than one meaning, a definition for the disambiguation were given to the annotators. All the instructions were given in Italian.
5.3 Profile of the annotators
Two adult Italian native speaker graduated students of Communication Sciences and
5.4 Analysis of the Gold Standard
The result of the annotation process is a collection of 1200 noun-verb pairs annotated with a degree and a qualia role divided into three semantic classes. For example, the first annotator judged 5 of the 20 noun-verb pairs for the noun violino, belonging to the Gold Standard for Artifacts, as follows:
Violino Accordare 4 T Violino Costruire 6 A Violino Divenire 0 _ Violino Impugnare 4 T Violino Includere 0 _
where the first two words are the noun-verb pairs the annotators was given, the third entry is the degree of relation (from 0 to 6) and the last entry is the qualia role he/she assigned to the pair, which can be T (for Telic), A (for Agentive) or un underscore (for none of the two roles or for difficult decisions).
To combine the different version of the two annotators for each of the three sets I simply calculate the average of the grades the 2 annotators gave to each noun-verb pair and evaluate their agreement on the qualia roles. For example, the final Gold Standard for 5 of the 20 noun-verb pairs for the noun violino (same pairs of the previous example) is:
Violino Accordare 5.0 TT
Violino Costruire 6.0 AA
Violino Divenire 0.5 _A
Violino Provare 5.0 TT Violino Disdegnare 1.5 _T
This time the correlation is expressed by an average and the qualia role can be expressed by an agreement (both annotators judged the verb either as telic or agentive or none) as in the first, the second and the fourth noun-verb pairs, or a disagreement.
A few statistics can help understanding the data obtained after this process. In table 5.1 I discretized the correlation scores (indicating the goodness of a noun-verb pair) in three categories: low scores going from 0 to 1.5, medium scores from 2 to 4 and high score from 4.5 to 6. Notice that at the moment I’m not considering their agreement on the qualia role.
0 - 1.5 2 - 4 4.5 - 6 Tot
Artifact 147 123 130 400
Location 118 156 126 400
Living-entity 251 129 20 400
Table 5.1: Scores of the 3 Gold Standards
Now (see table 5.2) analysing only the high scores (those between 4.5 and 6), we can
see how many of these have been annotated with a perfect Agentive (AA) or Telic
(TT) agreement. The last column is instead showing the high scoring noun-verb pairs
having a disagreement on the qualia role.
Tot. high (4.5 - 6) AA TT disagreement
Artifact 130 28 64 38
Location 126 36 63 27
Living-entity 20 6 7 7
Table 5.2: High scores analysis
In table 5.3 are shown the mean and the variance (without making any selection according to the qualia role) for each of the 3 Gold Standards (one for each semantic class).
Mean Variance Artifact 2.797 4.397 Location 3.012 3.297 Living-entity 1.594 2.233
Table 5.3: Mean and Variance of the 3 Gold Standards
This result shows that the annotators were judging the 3 Gold Standards in dif-
ferent ways. In particular, the variance for Artifact is higher than the other ones,
which means that a more clear-cut decision was made for this semantic class when
choosing a degree of correlation of noun-verb pairs. In other words, as the variance
is a measure describing how far a set of numbers is from the mean, having a high
variance here suggests that the annotators were often giving either a high or a low
mark to the n-v relation. Much more importantly, the Living-entity gold standard
show that, as expected, it was very difficult to annotate and led to very few Agen-
tive and Telic roles. This was not surprising because the literature I used to obtain
the Agentive and Telic roles for the nouns, while building the data to be annotated,
was extremely poor when referring to Living entities. For these reasons from now on
I will use only the Gold Standards for Artifact and Location, while neglecting the
Living-entity’s data.
Chapter 6
The machine learning method
This chapter is dedicated to the main part of the thesis. In the first section I’ll describe the features extracted from the Italian CoNLL parsed Italian Wikipedia corpus (see Chapter 3.3), than I’ll continue describing the procedure to obtain and balance the training corpus. The last part instead is focused on the machine learning method and on the selection of the features to use.
6.1 Features
For each sentence of the Italian CoNLL corpus containing a noun-verb pair present
in both Gold Standards (remember I’m not using the Living-entity Gold Standard),
I extracted and then converted into csv (comma separated values) a set of features
including the dependency relation linking the noun-verb pair, the part of speech of
the two terms, the semantic tags they were given and finally all the other dependency
relations in which they were involved. So for example, the sentence “In Italia si
costrui’ il primo violino.” (The first violin was built in Italy) which includes the
noun-verb pair “violin-build”, part of the Gold Standard for artifacts, was represented in the CoNLL corpus 1 as:
1 In in EA 4 comp_loc O
2 Italia Italia SP 1 mod B-noun.location
3 si si PC 4 clit O
4 costrui’ costruire V 0 ROOT B-verb.cognition
5 il il RD 7 det O
6 primo primo NO 7 mod O
7 violino violino S 4 subj B-noun.artifact
8 . . FS 1 con O
For the noun-verb pair violino-costruire in this sentence my python program ex- tracted the following features:
POS of the Noun: S
POS of the Verb: V
Dependency Relation of the Noun: subj
Dependency Relation of the Verb: ROOT
Semantic label of the Noun: B-noun.artifact
Semantic label of the Verb: B-verb.cognition
1
Notice that in this examples I reduced the CoNLL fields from 12 to 7 (ID, Word form, Lemma,
POSTAG, Head, DepRel, Semantic Label), which are those useful for the features. See Chapter 3.3
for more details
Dependency environment of the Noun: det, mod
Dependency environment of the Verb: comp loc, clit
The last among these features (the dependency environment) needs a little ex- planation. In fact it is actually represented as a list of features expressing the pres- ence/absence of a dependency relation for each one of the 26 dependency relations of the CoNLL tagset. So in the previous example, for the noun violino only the dependency relations det and mod are present, while the remaining 24 are tagged as absent. Instead, those (rare) cases in which a single dependency relation is present more than one time are simply marked as present. This choice was done to avoid the creation of long and almost unique features which would have decreased the machine learning performances. So, at the end of this process I will have 2 sets of instances (one for each semantic class) in csv, each instance being represented like:
796_violino_costruire,S,subj,B-noun.artifact,V,Root,B-verb.cognition,no,yes,no,no..
where the first element is the ID formed by the number of the sentence extracted and the noun-verb pair. Then the actual features follow: the second, third and forth are respectively the POS, the dependency relation and the semantic label of the Noun.
Similarly 5 th , 6 th and 7 th are the same features for the Verb. Finally 26 features for the Noun and other 26 features for the Verb (only 4 of the total 52 are shown in the example) indicate with a yes or a no the dependency environment (as explained previously).
6.2 Creating the training data
Now that I extracted all the necessary features I need to refine the data to obtain pos-
itive and negative instances for agentive and telic roles, following the Gold Standard
indications. So, for each instance extracted from the CoNLL corpus in csv format, I attached the corresponding grade of correlation (in average) and the agreement on the role as expressed in the Gold Standard for that specific noun-verb pair. To give an example, the instance described in the last paragraph of the previous section would become:
796_violino_costruire,S,subj,B-noun.artifact,V,Root,B-verb.cognition,no,.. 6.0,TT
where the last 2 features are taken from the Gold Standard Artifact, for the noun- verb pair “violino-costruire”. Finally I can process these data to obtain the final training data I’ll use for the machine learning experiments. The steps I’ll follow are 3: Balancing, Positive or Negative, Agentive or Telic.
Balancing: in table 6.1 I compare the mean (on the noun-verb correlation) of both Gold Standards with the mean of the two extracted sets of instances. It’s easy to notice how the results of the extracted instances are higher than those of the respective Gold Standards.
Gold Standard grade mean Extracted instances mean
Artifact 2.797 5.021
Location 3.012 3.9
Table 6.1: Grade’s mean of Gold Standard and extracted instances (scale 0-6)
This means that in the corpus the frequency of the noun-verb pairs which were
from 0 to 6) is higher than the frequency of the pairs which scored less. Going a little deeper, I discovered that few noun-verb pairs with a high degree of correlation were dominating the extracted instances; for example, the noun-verb pair libro-scrivere (book-write), which scored 6.0, was appearing 2761 times in the extracted instances for the Gold Standard Artifacts on a total amount of 14944 extracted instances for that semantic class. To avoid a training set not at all conform to the Gold Standard I eliminated all those instances containing the same noun-verb pair after I had seen it 100 times. Although this operation could seem a little artificial, it’s useful to reduce the importance of the noun- verb pairs which were dominating the set of instances extracted. At the end of this process the number of instances for both sets was reduced consistently, as shown in table 6.2.
Num instances before Num instances after
Artifact 14944 6102
Location 9048 6449
Table 6.2: Number of instances before and after Balancing
Positive or Negative: The goal of my machine learning approach is to automat-
ically understand if an noun-verb pair represents a good example of Telic or
Agentive role, or if it doesn’t. At this step I extracted from the balanced sets
of instances only those ones which clearly represented a good or a bad example
of qualia role, leaving out the instances extracted from noun-verb pairs which revealed problematic and uncertain even for the annotators (notice that I’m still not considering the difference between Agentive and Telic role. This will be done in the next paragraph). To achieve this, I selected as negative ex- amples the instances having a grade between 0 and 1.5, and as positive those between 4.5 and 6, eliminating all the instances which scored between 2 and 4.
This operation reduced the number of instances of each set (see table 6.3) less dramatically than the previous operation.
Num instances before Num instances after
Artifact 6102 4540
Location 6449 4753
Table 6.3: Number of instances before and after Pos or Neg
Agentive and Telic: This last step finally takes into account the differences be-
tween Telic and Agentive role. Each of the two sets of instances (Artifact
and Location), reduced by the operations called “Balancing” and “Positive or
Negative”, is here split into two parts: one containing positive and negative
instances for the Agentive role and the other one containing positive and neg-
ative instances for the Telic role. Only the instances with perfect agreement
on the role (both of the judges marked the noun-verb pair as being Agentive
or Telic), and obviously with a grade bigger than 4.5 (as said before), were
instances which scored less than 1.5 (notice that these are common for both Agentive and Telic) and the positive ones of the opposite role. So, for example, the Artifact-Agentive set will have as negative instances: the low scoring ones and the positive instances of the Artifact-Telic set. The result of this last step is 4 sets of positive and negative instances: 2 sets (one per role) for each of the 2 semantic classes 2 (table 6.4).
Positive instances Negative instances
Artifact-Agentive 723 2882
Artifact-Telic 1897 1708
Location-Agentive 1189 2816
Location-Telic 1201 2804
Table 6.4: Pos and Neg instances for each training set
This 4 sets are the final training data I will use in the machine learning experi- ments.
6.3 Experiments
On the training data obtained I run 2 different classifiers using the Weka software [6] to discover the algorithm giving the best accuracy: a NaiveBayes classifier and a decision tree algorithm. The choice of a NaiveBayes and a decision tree (I’ll use an
2