Noun phrase Taxonomies: Lexico-syntactic and Causal methods

(1)

Noun Phrase Taxonomies:

Lexico-syntactic and Causal

methods

(2)

Layout: typeset by the author using LA_TEX.

(3)

Noun Phrase Taxonomies:

Lexico-syntactic and Causal

methods

Joris A. Galema

11335165

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

Dhr. prof. dr. ing. R.A.M. van Rooij Institute for Logic, Language and Computation

Faculty of Science University of Amsterdam

Science Park 907 1098 XG Amsterdam

(4)

Abstract

With the exponentially growing amount of information that is easily accessible due to the world wide web, the necessity of hierarchical knowledge representations that have the ability to encompass information within certain fields is now greater than ever. Multiple methods exist to automatically construct such knowledge representation, where several state of the art methods make use of lexico-syntactic patterns in text to extract hypernymy relations to build a taxonomy of concepts. Humans tend to represent, and think about, knowledge in terms of causes and effects. However, to this day no method exists that utilizes causal learning methods to construct hierarchical knowledge representations. It is in the interest of the current study to determine whether hierarchical knowledge representations can be constructed through causal learning methods.

(5)

Chapter 1 Introduction

Hierarchical knowledge representations such as ontologies and taxonomies can be described as structures that provide organization to concepts and language used to organize knowledge [21]. They can be utilized as tools for decision making system development [28], extracting data from text [8] and automatic generation of natural language [17] to name a few applications. Examples include WordNet [11], the Ontology for Biomedical Investigations (OBI) [1] and The Semantic Web [2]. The article by Brewster and O’Hara [5] covers the current state of the us-age of an ontology as a knowledge representation and concludes that ontologies are extremely important for structuring and contextualizing knowledge but cre-ation of these ontologies demand an excessive amount of work. Currently, the main method of constructing an ontology is manual construction, which is a costly effort. Therefore automation of this process could be very beneficial for the future. Construction of such a knowledge representation seems more relevant now than ever. With the exponentially growing amount of information available on the inter-net, structured hierarchical knowledge representations can be very useful in finding information efficiently.

For instance, research by Colace and De Santo [7] addresses the problem of over-whelming amounts of information on the internet available to students and teach-ers. In this research they propose that the creation of an ontology can solve this problem and improve future academics.

Creation of an ontology consists of multiple stages. These stages can be repre-sented as a layer cake of increasingly complex sub-tasks for the development of an ontology [6]. An illustration of the layer cake can be seen in figure 1.1.

At the base of the layer cake there are all the terms to be placed in the hierar-chy. The second layer is concerned with determining synonymy relations such as "crime" and "infraction". The next layer seeks to take all terms and assign them

(7)

Figure 1.1: Ontology Learning Layer cake

to concepts. A concept consists of an intensional definition of the concept, a set of instances of the concept and a set of linguistic realizations of the concepts. For instance the concept of a person can be described as an individual human being. Instances of human beings are your parents, your children and your neighbours for example. Linguistic realizations would then be the names of these persons. The next sub-task, or layer, is the task of determining a concept hierarchy, or tax-onomy. Concept hierarchies consist of "is a" relationships between concepts and can be constructed by uncovering hypernymy relations between concepts. Hyper-nymy relations relate broad terms (hypernyms) to more specific terms (hyponyms). For instance, a lawyer is a hyponym of the hypernym person.

This research is interested in the construction of these concept hierarchies through the means of hyponym/hypernym detection.

There are multiple methods for the automatic detection of hypernymy relations. A well known method, proposed by Hearst in 1992 [16], relies on lexico-syntactic patterns in a sentence that indicate a relation between a broader term and a more specific term. Sentences like "Buildings such as the Empire State Building" or "Primates such as gorillas" indicate a hypernymy relation through the pattern "N P1 such as N P2" where N P1 and N P2are noun phrases and N P2 is a hyponym

of N P1. Roller et al. have shown that automatic detection of hypernymy

rela-tionships through the use of these Hearst patterns consistently outperform other methods [29]. However the method that performs best in their study utilizes dis-tributional methods in combination with Hearst patterns, thus not being a purely syntactic method.

Hearst patterns may be very useful, however they do not always appear in text. Therefore precision of extracted hypernymy relations may be high but recall

(8)

of these relations are generally low. Another problem that arises when using Hearst patterns is that syntactic patterns alone are not powerful enough to deal with am-biguity in text [32]. Consider the sentence "animals other than dogs such as cats". in this sentence the pattern N P1 such as N P2 will extract "dogs such as cats"

resulting in the hypernymy relation "a cat is a dog". This can be dealt with by noun phrase tagging the sentence like "(NP animals other than dogs) such as (NP cats)" but a grammar like this could interfere with the extraction of other hyper-nymy relations that appear in similar sentences. Consider the sentence "animals other than pets such as dogs" where the precedence should look like "(animals) other than (pets such as dogs)". Using previously mentioned grammar will extract "a dog is an animal other than a pet". In this case, preliminary knowledge about what a cat, a dog and a pet is, is necessary. Requiring preliminary knowledge about concepts for which you are constructing a knowledge representation seems counter intuitive.

If syntactic patterns alone deem not powerful enough to deal with ambiguity in text it would seem appropriate to include semantics to the equation.

The distributional hypothesis by Zellig Harris is a well known theory that states that there is a positive correlation between distributional similarity and semantic similarity [15]. Meaning of words can be derived from their contexts.

Another aspect to consider is that the way human beings represent knowledge is often causal of nature. We tend to describe, and think about, certain events in cause and effect relations to be able to explain them. In recent years the study of causality has been a topic of increasing interest. It is often of interest to determine what would happen when some event were to occur in a different fashion. How would this affect other events. This question to what extent it can be predicted that some event may change if another would be different is causal of nature [25]. Causality is concerned with constructing causal models which encompass the un-derlying mechanisms of certain events. These causal models are determined on the basis of correlation and probabilities, thus being distributional of nature.

If it is the case that a hyponym can be seen as a cause of a hypernym occurring in text (or the other way around), then it may be possible to extract these relations from text using causal learning methods. However, to this day, no such method exists.

Therefore it is in the interest of this study to determine whether this could be achieved through causal learning methods and if this method could hold up against a lexico-syntactic method.

(9)

Chapter 2 Background

Currently there are multiple methods to automatically extract hypernymy relations from text [6] [3]. Roller et al. have shown that the most effective method to this day utilizes lexico-syntactic patterns in text to extract these relations [29]. This method, however, does not rely solely on syntax for it utilizes several distributional methods.

The interest of the current research is to determine if the use of causal learning can be of benefit to the automatic extraction of hypernymy relations, and if its performance is comparable to that of a purely lexico-syntactic method.

The use of these methods requires some background knowledge which will be covered in this chapter.

2.1 Natural Language Processing

When dealing with natural language it is important to understand the syntactic structure behind a language. Chapter 23 of the book "Artificial Intelligence: a modern approach" [30] covers the theory to do so. To understand the inherent structure behind a sentence in a language it is necessary to assign a certain syn-tactic functionality to the words in that sentence. This function of a word is called a Part Of Speech (POS) tag. When a sentence has been POS tagged it is possible to parse the sentence and create a tree that shows the syntactic structure of the sentence.

However, there are several aspects to take into account when attempting to define this structure. For instance the sentence "the dog bit a cat with a hat on" can be interpreted in two different ways. Either the dog was wearing a hat while biting a cat or the cat was wearing the hat. These kind of ambiguities can be difficult to deal with when analysing a sentence purely on a syntactic basis.

(10)

This section covers the current methods for acquiring the structure of a sentence relevant to this study.

2.1.1 Part Of Speech (POS) tagging

The available tags a word can receive depends on the tagset used. The universal POS tagset [27] consists of 12 POS tags. Examples of these tags are "NOUN" for nouns, "ADV" for adverbs and "." for punctuation marks. A full list of the universal tagset can be seen in table 2.1.

The universal tagset is derived from the Penn Treebank POS tagset [22], which is a more elaborate tagset, consisting of 36 POS tags. However, in this research the goal was to identify noun phrases and find hypernymy relations between them. The POS tags in the universal tagset proved sufficient for this task.

Universal POS Tagset

Tag Description NOUN nouns VERB verbs ADJ adjectives ADV adverbs PRON pronouns DET determiners

ADP prepositions and post positions

NUM numerals

CONJ conjunctions PRT particles

. punctuation marks

X other

Table 2.1: Universal POS tagset

2.1.2 Noun Phrases

To identify a noun phrase (NP) in a sentence it is necessary to de-fine what a noun phrase is. For this multiple grammars can be de-fined [10]. The simplest form of a Noun phrase would be a single noun:

(1) N P ← N OU N

This can be extended effortlessly by adding an optional determiner and an optional indefinite amount of ad-jectives or nouns (e.g. "The big lumber yard"):

(2) N P ← (DET )(ADJ |N OU N ) ∗ N OU N

A noun phrase can be extended further by defining a Prepositional Phrase (PP) as a preposition and a noun phrase and adding it as optional to the end of the

(11)

Figure 2.1: Ambiguity of a sentence

noun phrase:

(3) N P ← (DET )(ADJ |N OU N ) ∗ N OU N (P P )

However, problems can arise when using a grammar with rule (3). Consider the sentence "I will hit the fly with the newspaper". This can be parsed into two dif-ferent trees for which the first suggests that an individual will hit a fly by use of a newspaper. The second would suggest that this individual will hit a fly that has a newspaper for whatever purpose this newspaper might have to the fly. A graphical representation of the parse trees can be seen in figure 2.1.

This kind of ambiguity can be troublesome when deciding which grammar is to be of use. If the first interpretation is desirable (where an individual uses a newspaper to hit a fly), grammar (2) seems suitable. If the alternative is desirable grammar (3) should be used. However, suppose grammar (2) is to be used. Now take a similar sentence "I will hit the fly with wings". In this case the interpreta-tion of the resulting parse tree will be that an individual will hit a fly using some type of wings to hit with.

2.2 Lexico-syntactic Hypernymy extraction

As stated before, the use of lexico-syntactic patterns to detect hypernymy relations in text still holds up to this day. This section will therefore discuss hypernymy relation extraction through the use of lexico-syntactic patterns.

In 1992 Marti Hearst proposed a method to automatically detect hypernymy re-lations in text [16]. The method relies on lexico-syntactic patterns such as "N Phyper

(12)

detecting "Hearst patterns" will indicate that "N Phyper such as N Phypo" is a

hy-ponym of the hypernym "lexico-syntactic patterns".

Despite the usefulness of this type of lexico-syntactic method, the sole use of syntax has its problems [32].

One such problem is that syntax alone cannot deal with syntactic ambiguity in text. Consider the sentence "animals other than dogs such as cats". in this sentence the pattern N Phyper such as N Phypo will extract "dogs such as cats" indicating that

"cats" is a hyponym of "dogs".

Another problem is that these patterns can be so specific that they do not occur very often in text. The result of this is that recall of hypernymy relations that can be found in text is generally low.

Regarding the first problem, it is possible to alter the grammar used to define a noun phrase. Making it such that the parsed sentence of the example given looks as follows: "(NP animals other than dogs) such as (NP cats)". However, defining a noun phrase such that the sentence is parsed in this manner could interfere with other extractions in similar sentences.

Consider the sentence "animals other than pets such as dogs" which is to be in-terpreted as "(NP animals) other than ((NP pets) such as (NP dogs))". If we use the alternative noun phrase definition mentioned before this sentence will be parsed as "(NP animals other than pets) such as (NP dogs)" indicating that dogs are animals other than pets, which in some cases is true, but is not always the case. This example shows that preliminary knowledge about pets and dogs seems necessary if the goal is to find correct hypernymy relations, implying that syntax alone is not strong enough for this task and semantics could prove beneficial to this task.

2.3 Distributional semantics

The distributional hypothesis by Harris states that words that are semantically similar have a tendency to share a similarity in linguistic distribution [15]. Dis-tributional semantics is based on the disDis-tributional hypothesis and seeks out to discover semantic similarity from linguistic distribution data.

To this day the most commonly practiced method to discover semantics is by producing a distributional model from text and representing the semantics of a word with a vector representation [4].

(13)

These vector spaces can be created through several methods. One such method uses a set of words to be used as context and counts for each word w the amount of co-occurrences of word w with the context words. An example of such a vector space can be seen in figure 2.2.

Figure 2.2: Word vector space. the words "sky", "wings" and "engine" are context words.

Determining the similarity be-tween words is then accomplished through calculation of the distance be-tween word vectors. A small dis-tance between word vectors would imply that the words are semanti-cally similar. For instance, in fig-ure 2.2 the word "helicopter" and the word "drone" are more similar than the word "helicopter" and the word "rocket".

Another way to represent distribu-tional semantics is from a more proba-bilistic perspective. On the assump-tion that frequency correlates with plausibility, a method to determine the plausibility of a word pair is to count

the frequency that the pair co-occurs [23]. Determining the co-occurrence of words can be achieved by defining a window in which words can co occur. Skip gram models [14] can then be used to count all co-occurrences. In essence a skip gram model is an n gram model that is allowed to "skip" words in a sequence of words to include words further in the sentence. For a sequence S = w1, ..wm the skip

gram model is defined as the set:

skip − gram(S)k_n = {wi1, wi2, ..., win|

n

X

j

ij − ij−1 < k}

Where i is an index for the current word w in the sequence S, n is the amount of words that co occur together and k is the range of the window.

Take for instance an 2-skip-bigram model, where a "bigram" model is defined as an n-gram model with n = 2, and the sentence "Insurgents killed in ongoing fighting".

(14)

A bi-gram model for this sentence would consist of "insurgents killed", "killed in", "in ongoing", and "ongoing fighting".

A 2-skip-bigram model, however, would consist of "insurgents killed", "insurgents in", "insurgents ongoing", "killed in", "killed ongoing", "killed fighting", "in on-going", "in fighting" and "ongoing fighting", thus containing all co-occurrences of n words within a window k.

We can define the number of co-occurrences of w1 and w2 in a window of k,

which will be denoted as (w1, w2)k, as the amount of times w1 and w2 are

encoun-tered together in a k-skip-bigram model. In a more general sense; The amount of co-occurrences (w1, w2, ..., wn)k of n words within a window of k in a sequence of

words S is defined as :

(w1, w2, ..., wn)k = |{C ∈ skip − gram(S)kn|(w1, w2, ..., wn) ∈ C}|

2.3.1 Probabilities from text

With the definition of a skip gram model it is possible to calculate the probability of encountering a co-occurrence of words in a corpus within a window of k. This probability of words w1, ..., wn co-occurring within a window of k is the likelihood

that a co-occurrence of n words in the corpus are the words w1, ..., wn.

Now suppose n = 1, then the probability of word w comes down to the likelihood that a word from the corpus is word w if a random word were to be taken out of the corpus. P (w)k= (w) k P i(wi)k = _P#w i#wi

Where #w denotes the amount of occurrences of word w in the corpus.

This is often referred to as a bag of words model [30], because the corpus is in some sense represented as a bag of words where each word is taken at random independent of other words in the corpus.

However, if n = 2 then the calculation dramatically changes. The probability of a word pair w1, w2 in the corpus co-occurring within a window of k words is the

(15)

P (w1, w2)k = (w1, w2)k P i P j(wi, wj)k

Where (w1, w2)k denotes the amount of co-occurrences of word pair w1, w2 within

a window of k words.

For this calculation it is necessary to know all the word pair co-occurrences within a window of k words. This can become a problem when a corpus has a high amount of unique words. Consider for example a corpus that has 1000000 unique words. To know the total amount of co-occurrences of 2 words, every possible combination of 2 words has to be checked for co-occurrences resulting in 499,999,500,000 word pairs that have to be checked for co-occurrence.

Increasing n further to n = 3 results in 166,666,166,667,000,000 word triples to be checked. Increasing n to n = 4 would then result in 41,666,416,667,124,999,750,000 word quadruples to be checked.

Therefore this calculation quickly turns intractable with high orders of n.

These calculations can be circumvented through some assumptions discussed in section 3.4.1.

In statistics it is often of interest to determine conditional probabilities. A conditional probability of X given Y can be seen as the probability that X will be encountered when Y is known to be encountered and is defined as:

P (X|Y ) = P (X, Y ) P (Y )

And the probability of X and Y being encountered given Z is:

P (X, Y |Z) = P (X, Y, Z) P (Z)

Conditional probabilities can be used to determine independence between vari-ables, or events, X and Y . X and Y are said to be independent, denoted as X ⊥⊥ Y , when knowledge about the truth value of X adds no information to the supposed truth value of Y (and vice versa). Section 2.5 elaborates further upon the definition of independence.

(16)

which is utilized in this research to extract hypernymy relations with a distribu-tional approach.

Relating conditional probabilities back to word co-occurrence probabilities, the conditional probability of a word w1 given the word w2 would then be:

P (w1|w2)k=

P (w1, w2)k

P (w2)

And for w1 co-occurring with w2 given w3:

P (w1, w2|w3)k=

P (w1, w2, w3)k

P (w3)

In layman’s terms this can be described as the likelihood that w1 is encountered

within a window of k words next to w2 if w2 were to be encountered and w1, w2

and w3 to be encountered together if w3 were to be encountered, respectively.

However, the possibility exists, and it is highly likely, that some sets of words will not co-occur within a specified window. Suppose the condition is not a single word w3 but a set of words w3, w4, ..., wi to condition on, and suppose this set of

words never co-occurs within a specified window. This will result in the proba-bility of the condition set to be zero which results in a division by zero, which is incalculable.

This can be solved by adding a value to the probability such that divisions by zero will never occur. This process of adjusting the probability in low frequency occurrences is called smoothing [30]. There are several methods of smoothing. In it simplest form the probability of a co-occurrence is altered by adding one to its frequency. This is called add-one smoothing, or Laplace smoothing.

If semantics can be discovered through linguistic distributions and the use of semantics can prove beneficial in the extraction of hypernymy relations, then uti-lization of these probability functions could prove beneficial towards the extraction of hypernymy relations in text.

(17)

2.4 Causal learning

The use of distributional information is applicable in many fields of Artificial In-telligence. One of such fields that has been gaining an increase in interest is the study of causality.

The way human beings represent knowledge is often causal of nature. We tend to describe, and think about, certain events in cause and effect relations to be able to explain them.

The study of causality seeks to uncover cause and effect relationships between certain events and how to find these relationships to develop causal models that encompass the underlying mechanisms of these events. The idea behind these causal models is that events can be described in a directional structure that shows a asymmetric cause and effect relation between them. These types of cause and effect relations can be found through statistical analysis by means of determining dependencies [26].

The importance of causality has been questioned greatly throughout history [9]. It has been discussed that we can only observe correlation and that the study of causality is useless because it may not even exist. However, consider a situation where a group of people has been examined on the amount of exercise they perform and the levels of cholesterol in their blood. The graph in figure 2.3a shows the results of the examination. In this case there is a significant correlation between the amount of exercise and the levels of cholesterol, however the implication of this correlation is in contradiction with what you would expect, namely that high levels of cholesterol is paired with high amounts of exercise.

Now consider the graph in figure 2.3b. In this graph it is observable that age actu-ally influences cholesterol levels and amount of exercise as well. Moreover, there is actually a negative correlation between cholesterol levels and amount of exercise, as expected.

What this shows is that there can be certain causal aspects for a situation. Ignor-ing such cause and effect relationships may lead to faulty conclusions. Thus the study of causality does deem necessary and useful.

Currently the topic of causality has gained a rise in interest in the areas of computer science, epidemiology and economics due to the necessity for cause and effect relationships of events. The question to what extent it can be predicted that some event may change if another would be different is causal of nature [25]. It is in the interest of the current study to discover whether causal learning can aid the process of extracting hypernymy relations from text.

(18)

(a) correlation between cholesterol and exercise

(b) correlation between choles-terol and exercise given age

Figure 2.3

2.4.1 Causal models

A Causal model can be seen as a structure that shows the causal story or the mechanisms behind certain events [26]. It consists of nodes that are connected in an asymmetric manner such that a cause of events has an arrow pointing to the events it causes, also referred to as its effect. An example of a causal model can be seen in figure 2.4. This causal model shows the causal relations between being a smoker, pollution in a city, having lung cancer, requiring an X-ray and having Dyspnoea.

Figure 2.4: Causal story regarding lung can-cer with its causes and effects

In this model having dyspnoea and requiring an X-ray are inde-pendent of pollution and being a smoker. They are however de-pendent on having cancer which in turn is dependent on pollu-tion and being a smoker. While being a smoker and/or living in a polluted area does not directly cause dyspnoea, it does directly cause lung cancer which is a di-rect cause of dyspnoea. The fact that being a smoker and living in a polluted area have effect on dyspnoea and requiring an X-ray can be derived from the

(19)

probability tables in the model. The probability of having lung cancer when being a smoker is true (S=T) and pollution is high (P=H) is 0.05 which is higher than the probability of having lung cancer when either pollution is low (P=L) or being a smoker is false (S=F). In turn this affects dyspnoea for the probability of hav-ing dyspnoea when you have lung cancer (C=T) is higher than the probability of having dyspnoea when you do not have lung cancer (C=F).

This type of causal model that contains direct causal relations with probability tables is called a Bayesian network [24].

X Y Z (a) chain X Y Z (b) fork X Y Z (c) collider

Figure 2.5: the three struc-tures in a causal model: a chain, a fork and a collider There are three types of structures of which a

causal model can consist. The three structures can be seen in figure 2.5. Take variable X, Y and Z.

If it were discovered that X causes Y and

Y causes Z, then X, Y and Z would be

in a causal relation called a chain (Figure 2.5a).

If it were discovered that X causes both Y and Z it is called a fork (Figure 2.5b).

The final structure is called a collider and states that X is caused by both Y and Z (Figure 2.5c).

Discovery of these structures is achieved through determining (in)dependencies and conditional (in)dependencies Determination of independence will be further elaborated on in section 2.5, auto-matic discovery of a causal model will be discussed in section 2.4.2.

Suppose a chain X → Y → Z is discovered and suppose Y is observed. Then it is the case that changing the value of X does not affect Z because Y is the direct cause of Z. Then we have that X and Z are independent when Y is observed (X ⊥⊥ Z|Y ). This phenomenon is called d-separation. In this case Y d-separates X and Z.

(20)

For forks d-separation behaves in a similar fashion.

Take a fork X ← Y → Z. If Y is not known, then observing X will change the expected value of Z because observing X tells something about what Y should be which in turn tells something about Z. However, because Y causes X and Z, observing Y causes the observation of X to have no effect to the expectation of Z (X ⊥⊥ Z|Y ).

For colliders however, d-separation behaves differently.

Take a collider X → Y ← Z. If Y is unknown then observing X does not affect the expectation of Z. However, if Y were observed, then observing X tells something about what Z should be because X and Z are both direct causes of Y is. This causes Y to make X and Z dependant (X 6⊥⊥ Z|Y ).

Identification of colliders is of great benefit to determination of causal models as shown in section 2.4.2.

2.4.2 Algorithms

Currently multiple methods within the field of Artificial Intelligence exist to au-tomatically learn causal models [19]. In this section two of such algorithms are discussed and elaborated on.

The CI algorithm proposed by Verma and Pearl [31] utilizes conditional de-pendencies of a set of events to determine whether certain events d-separate other events, thus discovering which event can be seen as the cause of others.

It does so by creating a graphical model and adding every event as a node in this model. For each node X and Y , a direct causal connection is made between them if for every set S of nodes that do not contain X and Y , X and Y are not d-separated by the set S (¬(X ⊥⊥ Y |S)). This first step acquires all direct causal undirected relations between the events.

After all direct causal relations are identified, some of the relation directions, or arc directions, are identified by means of colliders. Because colliders are distinct from chains and forks in that they can make two events dependent when condi-tioned on their common effect, they can be identified with absolute certainty if they occur in the model.

For each undirected chain X − Y − Z, the chain can be oriented as X → Y ← Z if it is the case that Y does not d-separate X and Z (meaning that the independence of X and Z is not due to Y so Y has to be a common effect because there is a

(21)

direct causal relation between X, Y and Z).

After colliders have been identified, a final step of causal direction identifica-tion is performed. This step checks for each undirected pair Y − Z for one of two conditions.

(1) If Y had occurred as the middle node of an undirected chain X − Y − Z, the previous step failed to indicate that Y is a collider on X and Z, and X, Y are now oriented as X → Y , then X, Y and Z is a chain structure like: X → Y → Z because this is the only available option that is left.

(2) If the orientation of Y ← Z would induce a cycle in the model, Y and Z should be oriented as Y → Z. Because a causal model is not allowed to be cyclic (i.e. a cause cannot be the effect of its own effect as well), it is possible to determine some arc directions through testing whether the opposite direction would induce a cycle. This final step is repeated until no more arc directions can be identified.

The CI algorithm thus obtains a causal model from independencies of events. However, when the CI algorithm has to deal with a large set of events/nodes, this will mean that at every moment where dependence has to be determined between a node X and Y , the dependence has to be determined conditioned on each subset of all nodes that does not contain X and Y , resulting in a near intractable algo-rithm when dealing with large sets of nodes.

A solution to this problem was proposed with the PC algorithm [18], which also solves the problem of uncertainty regarding the significance of an independence by introducing a significance test.

The first step of the PC algorithm is the initialisation of a causal model as a fully connected graph connecting all events as nodes. For each pair of nodes X and Y that are connected, the PC algorithm will determine whether they are uncondi-tionally independent. If this is the case, then the connection between X and Y is removed.

If a connection is removed the PC algorithm repeats this step, but now sets the order of the condition set to 1, indicating that for each pair of connected nodes X and Y independence will be determined conditioned on all events Z such that Z is connected to X and Y . If X and Y are independent given all such events Z,

(22)

then the connection is removed.

Again, when a connection is removed, the step is repeated and the order of condi-tion set is incremented by 1.

This approach of increasing the order of conditioning set at each iteration and only conditioning on connected nodes decreases the complexity of the algorithm greatly. This is because most connections will be removed during early iterations, causing the number of available sets to be conditioned on to be relatively small when the algorithm reaches higher orders.

The PC algorithm also keeps track of the nodes S that d-separate nodes X and Y (X ⊥⊥ Y |S → DSep(X, Y ) = DSep(X, Y ) ∪ S), which is used in the second step of the PC algorithm.

The second step of the PC algorithm corresponds to the second step of the CI algorithm in that it identifies colliders to acquire some of the arc directions. Because the PC algorithm has kept track of the nodes that d-separate others, this comes down to determining for each undirected chain X − Y − Z whether Y ∈ DSep(X, Z). If this is not the case, then Y is a collider on X and Z: X → Y ← Z.

The final step of the PC algorithm is the same as the final step of the CI al-gorithm.

2.5 Independence

As stated before, it is necessary to determine independence of two events X and Y and conditional dependence of two events X and Y given a third event or set of events Z to determine whether there exists some causal connection between X and Y .

Determination of this independence can be done in two ways.

In a probabilistic sense it should be the case that X and Y are independent if the observation of Y does not alter the probability of X. This will be covered in section 2.5.1.

Another way to determine independence is to analyse how X and Y correlate. If X and Y can be represented as categorical random variables their correlation co-efficient can be calculated from which an independence (or lack of) can be derived. This will be covered in section 2.5.2.

(23)

2.5.1 Probabilistic independence

When two binary variables X and Y are called independent of each other it means that observing one or the other does not affect the probability of the other occur-ring. This means that the probability of X is equal to the conditional probability of X given Y :

X ⊥⊥ Y ⇐⇒ P (X|Y ) = P (X)

To determine a conditional independence between X and Y given an event or set of events Z, it has to be the case that the probability of X and Y given Z is equal to the product of the probability of X given Z and the probability of Y given Z:

X ⊥⊥ Y |Z ⇐⇒ P (X, Y |Z) = P (X|Z) ∗ P (Y |Z)

Furthermore, The axioms of probabilistic independence [13] state that:

(1) X ⊥⊥ ∅ Trivial independence

(2) X ⊥⊥ Y → Y ⊥⊥ X Symmetry

(3) X ⊥⊥ Y ∪ Z → X ⊥⊥ Y Decomposition

(4) X ⊥⊥ Y &X ∪ Y ⊥⊥ Z → X ⊥⊥ Y ∪ Z Mixing

Axiom (1) states that any variable X is independent of the empty set (P (X) = P (X|∅)).

Axiom (2) states that independence is symmetric; X being independent of Y makes Y independent of X.

Axiom (3) states that X is independent of Y if X is independent of both Y and Z.

Axiom (4) states that X is independent of both Y and Z if X is independent of Y and X and Y are both independent of Z.

2.5.2 Correlation

Determining Independence between events X and Y is also possible by means of correlation [19]. The correlation ρX,Y between two discrete random variables X

and Y describes the degree to which one variable can predict the value of the other in a linear fashion.

(24)

Consider a small corpus of 5 articles. For each noun phrase that occurs in the corpus we can represent it as the collection of articles it occurs in followed by the amount of times it occurs in the article. Suppose N p1 and N p2 such that their

occurrence is as described in table 2.2. This can be represented in scatter graphs as seen in figure 2.6a

1 2 3 4 5

NP1 4 3 7 2 9

NP2 5 2 5 1 7

NP3 2 3 5 4 5

Table 2.2: Occurrences of three noun phrases for a corpus of 5 articles

(a) occurrences of N P₁(x-axis) and N P₂ (y-axis)

(b) occurrences of N P₁(x-axis) and N P3(y-axis)

Figure 2.6: Article occurrences represented in a scatter plot

If noun phrases are represented in such a manner it is then possible to calculate Pearson’s sample product moment correlation coefficient rN P1,N P2 between noun

phrases as such: rN P1,N P2 = P i(N P1i− N P1) ∗ (N P2i− N P2) q P i(N P1i− N P1) 2 _∗ q P i(N P2i− N P2) 2

(25)

The correlation coefficient ranges between -1 and 1. The further away from 0, the more the variables are correlated and thus the more they can predict each other.

Therefore, when this correlation coefficient is not significantly different from zero, there is an independence between two noun phrases.

Applying this to the previous example, the correlation coefficient between N P1

and N P2 is: rN P1,N P2 = P i(N P1i−N P1)∗(N P_2i−N P2) √ P i(N P1i−N P1)2∗ √ P i(N P2i−N P2)2 ‘= P i(N P1i−5)∗(N P2i−4) √ P i(N P1i−5)2∗ √ P i(N P2i−4)2 = √ 26 34∗√24 ≈ 0, 910

Now consider N P1 and N P3. Figure 2.6b shows in what manner they occur

within the corpus.

Suppose we were to calculate the correlation coefficient of N P1 and N P3, then

following the same calculation we arrive at rN P1,N P3 ≈ 0.0.658. Which shows that

there is less of a correlation between N P3 and N P1 then there is between N P2 and

N P1 as can be seen in figure 2.7.

Partial correlation Partial correlation can be seen as determining the correla-tion of two variables X and Y , while holding a third variable or set of variables Z constant, denoted as ρX,Y •Z. Holding Z constant means that the values Z can

take are not allowed to vary from observations of Z. This is similar to determining a conditional probability.

Take two noun phrases N P1 and N P2 and a third noun phrase N P3 to be held

constant. The sample partial coefficient is then defined as:

rN P1,N P2•N P3 = rN P1,N P2 − rN P1,N P3 ∗ rN P2,N P3 q 1 − r2 N P1,N P3 ∗ q 1 − r2 N P2,N P3

(26)

(a) Correlation between N P₁ and N P₂ (b) Correlation between N P₁ and N P₃

Figure 2.7: Correlation between noun phrases regarding their occurrences in arti-cles

As an example, take N P1, N P2 and N P3 in the corpus described in table 2.2.

Calculating rN P1,N P3•N P2 results in:

rN P1,N P3•N P2 = r_{N P1,NP3}−r_{N P1,NP2}∗r_{N P3,NP2} q 1−r2 N P1,NP2∗ q 1−r2 N P3,NP2 = √0.658−0.910∗0.313 1−0.9102_∗√_1−0.3132 ≈ 0, 948

Showing that N P1 has a higher correlation with N P3 when holding N P2

con-stant than the previous case where no noun phrase was being held concon-stant.

When the condition, i.e. the variables to be held constant, is a set with multiple variables, the sample partial coefficient can be calculated by recursively partialing out a variable in the condition set:

rN P1,N P2•C = rN P1,N P2•C\{C0}− rN P1,C0•C\{C0}∗ rN P2,C0•C\{C0} q 1 − r2 N P1,C0•C\{C0}∗ q 1 − r2 N P2,C0•C\{C0}

Again, when this correlation coefficient is not significantly different from zero, there is an independence between N P1 and N P2 given the set C.

(27)

Significance test To determine whether a correlation between variables X and Y can be considered as vanishing, meaning that rXY = 0, a t-test can be

per-formed. With a t-test a t-value is calculated. For the t-test it is required to set a significance level α and to know the degrees of freedom of the problem. With these values you can determine the critical t-value by looking it up in a t-table such as table 2.8. When the calculated t-value is higher than this critical t-value, the hypothesis that there is no correlation between X and Y is rejected, resulting in X and Y being dependent.

The standard t-test for the product moment correlation coefficient between two variables X and Y is defined as:

t = rXY ∗ √

n − 2 1 − r2

XY

Where n is the sample size, and n − 2 the degrees of freedom.

For the sample partial correlation between X and Y given a set Z the t-test is calculated as:

t = rXY •Z∗pn − |Z| 1 − r2

XY •Z

With n − |Z| degrees of freedom.

Take the example corpus from before again, described in table 2.2 and suppose α = 0.05. We have 5 samples so n = 5. Then:

t = rN P1,NP2∗ √ n−2 1−r2 N P1,NP2 = rXY∗ √ 5−2 1−r2 N P1,NP2 = 0,910∗ √ 5−2 1−0,9102 ≈ 9.189

With 3 degrees of freedom. Looking at the t-table, the critical t-value with α = 0.05 and 3 dof is 3.182, meaning that the hypothesis that N P1 and N P2 are independent

(28)

Figure 2.8: t-test table For N P1 and N P3 we have:

t = rN P1,NP3∗ √ n−2 1−r2 N P1,NP3 = 0.658∗ √ 5−2 1−0.6582 ≈ 2.007

With 3 degrees of freedom. This t-value is lower than the critical t-value for α = 0.05 and 3 dof, meaning that rN P1,N P3 is not

sig-nificantly different from 0 mean-ing that N P1 ⊥⊥ N P3.

Furthermore, for the partial correlation rN P1,N P3•N P2 we have that t = rN P1,NP3•NP2∗ √ n−3 1−r2 N P1,NP3•NP2 = 0.948∗ √ 5−2 1−0.9482 ≈ 13, 150

with 2 degrees of freedom. The critical t-value for α = 0.05 and dof =2 is 4,303 which is lower.

This means that N P3 6⊥⊥ N P1|N P2.

Together with the fact that N P1 ⊥⊥ N P3 this would suggest

that N P2 is a collider on N P1

(29)

Chapter 3 Method

3.1 Corpus

In this study the Brown corpus has been used for the hypernymy detection task. The Brown corpus consists of 500 English articles, consisting of more than 2000 words each [12]. The total amount of words in the corpus is 1,014,312 of which ... are unique words. The articles are divided into 18 diverse categories which in turn also consist of several subcategories.

3.2 Extracting hypernymy relations through Hearst

patterns

For the extraction of noun hypernymy relations it is necessary to tag the text with Part-Of-Speech (POS) tags and identify Noun Phrases. A preprocessing file was written in Python 3.8.5 using the universal tagset of NLTK [20] to tag the sentences with POS tags. After POS tagging of a sentence the NLTK regex parser was applied to identify Noun phrases in the sentence. In this research a noun phrase was specified as an optional determiner, one or more optional adjectives or nouns and a noun at the end of the noun phrase. Sentences were then altered by adding noun phrase tags (NP). An example of a sentence before preprocessing and after preprocessing can be seen in figure 3.1.

After identification of noun phrases in a sentence, extraction of hypernymy relations through the use of Hearst patterns could be initiated. By usage of the Regex module in Python [] these patterns were identified in the sentence and the correct hyponym/hypernym relation would be extracted. A detailed list of the

(30)

Sentence to Noun Phrase tagged sentence Original

Sentence A work place like a big lumber yard

POS Tagged

(’A’, ’DET’), (’work’, ’NOUN’), (’place’, ’NOUN’), (’like’, ’ADP’), (’a’, ’DET’), (’big’, ’ADJ’), (’lumber’, ’NOUN’), (’yard’, ’NOUN’) Altered

Sentence (NP A work place ) like (NP a big lumber yard )

Figure 3.1: The process of altering a plain text sentence to a Noun Phrase tagged sentence

Hearst patterns used in this research can be found in figure 3.2.

The extracted hypernymy relations were then lemmatized using the WordNet lem-matizer from NLTK and saved in a dictionary for later examination.

3.3 Representing distributional semantics

Aside from the extraction of hypernymy relations through the use of Hearst pat-terns, the construction of two more dictionaries was necessary to account for dis-tributional semantics and the different representations for it. The first dictionary contains all positions for each noun phrase in the text. When the similarity of two noun phrases is based on the proximity of the noun phrases with regards to each other, skip-gram models have to be derived from the text. Therefore it is necessary to keep track of the specific positions of noun phrases in the text. The second dictionary keeps track of the amount of occurrences of a noun phrase in each document and is to be used when calculating the correlation of two nouns depending on their occurrences in articles.

3.4 Finding a causal model

For the discovery of a causal model an implementation of the PC algorithm, as described in [19], was built in python for this research. The pseudo code for the algorithm can be seen in Algorithm 1. Because most corpora consist of a large

(31)

Hearst Patterns used for hypernymy extraction N Phypo which is a type|example|class|kind|sort of N Phyper

N P_hypo1 (,| and| or N P_hypo2 ...) and|or (any|some) other N Phyper

N Phypo(,) which is called N Phyper

N Phypo is a|are special case(s) of N Phyper

N Phyper such as|including|like N Phypo1 (,| and| or N Phypo2 ...)

(un)like most|any|all|other|some( other) N Phyper, N Phypo ....

Figure 3.2: Hearst Patterns used for the extraction of hypernymy relations. N Phyper indicates a hypernym noun phrase, N Phypo indicates a hyponym noun

phrase. Parentheses indicate patterns that are not necessary for the extraction. "|" indicates that either the word before or after the bar has to occur.

amount of different noun phrases (the Brown corpus consists of about 200000 dif-ferent noun phrases) it would be intractable to use all noun phrases for the PC algorithm. For an initial model of 200000 nodes the first pass of calculating Inde-pendence between nodes would consist of roughly 20 billion calculations.

Therefore in this research the initial model of the PC algorithm consists of the noun phrases extracted via Hearst patterns. The goal in this research is to discover if the PC algorithm can discover a causal model that is structured in the same manner as the hierarchy discovered through Hearst patterns and may be able to improve upon it. Therefore this method provides a suitable solution to the problem of an intractable amount of initial nodes.

(32)

Algorithm 1 PC algorithm

Input: Set of nodes N , Independence function ⊥⊥ Output: Causal Graph G

1: G ← {(X, Y ) ∈ N × N |X 6= Y } 2: k ← 0 3: for all (X, Y ) ∈ G do 4: DSep(X,Y) ← ∅ 5: rem ← 1 6: while rem = 1 do 7: rem ← 0 8: for all (X, Y ) ∈ G do 9: adj ← {Z|(X, Z) ∈ G ∧ Z 6= Y } 10: if ∀S ∈ adjk(X ⊥⊥ Y |S) then 11: G ← G \ {(X, Y ), (Y, X)} 12: rem ← 1

13: DSep(X,Y) ← DSep(X,Y) ∪ adj

14: k ← k + 1 15: T ← {(X, Y, Z) ∈ N3|(X, Y ) ∈ G ∧ (Y, Z) ∈ G ∧ (X, Z) /∈ G} 16: for all (X, Y, Z) ∈ T do 17: if Y /∈ DSep(X,Z) then 18: G ← G \ {(Y, X), (Y, Z)} 19: rem ← 1 20: while rem = 1 do 21: rem ← 0 22: D ← {(X, Y ) ∈ N2_{|(X, Y ) ∈ G ∧ (Y, X) ∈ G}} 23: for all (Y, Z) ∈ D do

24: if ∃X ∈ N ((X, Y ) ∈ G ∧ (Y, X) /∈ G) then

25: G ← G \ {(Z, Y )}

26: rem ← 1

27: if G \ {(Y, Z)} makes G cyclic then

28: G ← G \ {(Z, Y )}

29: rem ← 1

For the PC algorithm it is necessary to find probabilistic independences and condi-tional independences. The calculation of independence between two nodes depends on the mode or representation of distributional semantics.

(33)

3.4.1 Co-occurrences

The first representation of distributional semantics relies on the co-occurrence of noun phrases.

As described in section 2.3.1 the probability of a word occurring in an entire corpus is defined as :

P (w) = _P#w

i#wi

Whereas the probability of a set of n words co-occurring within a window of k words is defined as:

P (w1...wn)k= (w1, ..., wn)k P i1, ..., P in(wi1, ..., win) k

As discussed in section 2.3.1, the calculation of P

i1, ...,

P

in(wi1, ..., win)

k _quickly

turns intractable with large numbers of n. The need for this calculation can be circumvented however.

Suppose n = 2. To calculate the probability of a word pair w1, w2 co-occurring

within a window of k it is then necessary to calculate P

i

P

j((wi, wj) k_).

Because the PC algorithm requires conditional probabilities, and not necessarily co-occurrence probabilities, it is possible to negate the necessity for this calculation through some assumption.

If the conditional probability of w1 given w2 is defined as P (w1,w2)

k P (w2) we have: P (w1|w2)k = P (w1, w2)k P (w2) = (w1,w2)k P i P j((wi,wj)k) #w2 P i#wi

Here all the occurrences of w2 where w2 occurs alone are also included. It

is however fairly unlikely for a word to occur alone in a sentence. Therefore, assuming that a word w always occurs with at least one word in a window of k (e.g. #w =P i(w, wi)k) results in: P (w) = _P#w i#wi = P i(w, wi)k P i P j(wi, wj)k =X i P (w, wi)k

(34)

Resulting in: P (w1|w2)k= P (w1, w2)k P (w2) = P (w1, w2) k P iP (w2, wi)k = (w1,w2)k P i P j((wi,wj)k) P i(w2,wi)k P i P j((wi,wj)k) = (w1, w2) k P i(w2, wi)k = (w1, w2) k #w2

Which is a trivial calculation.

It is also important to consider the conditional probability P (w1, w2|w3)k. If

this probability has to be calculated in a same manner as P (w1|w2) the assumption

has to be slightly altered. If it is assumed that every word occurs with at least one other word then we have:

P (w1, w2|w3)k = P (w1, w2, w3)k P (w3) = P (w1, w2, w3) k P iP (w3, wi)k = (w1,w2,w3)k P i P j P l((wi,wj,wl)k) P i(w3,wi)k P i P j((wi,wj)k)

Which does not remove the necessity for the computationally expensive calcu-lations.

However, an assumption that words always occur with at least two other words within a specified window results in:

P (w1, w2|w3)k= P (w1, w2, w3)k P (w3) = P (w1, w2, w3) k P i P jP (w3, wi, wj)k = (w1,w2,w3)k P i P j P l((wi,wj,wl)k) P i P j(w3,wi,wj)k P i P j P l((wi,wj,wl)k) = (w1, w2, w3) k P i P j(w3, wi, wj)k = (w1, w2, w3) k #w3

Which, again, is a trivial calculation.

Because this research is interested in noun phrase hypernymy relations, only noun phrases will be considered relevant. Furthermore, every noun phrase in the

(35)

corpus will be treated as a single word.

The probability of a noun phrase occuring is then defined as:

P (N P ) = _P#N P

i#N Pi

Following the assumption that a word never occurs alone in a specified window of k words, this is applied to noun phrases as well.

Therefore the conditional probability of a noun phrase N P1 given another noun

phrase or set of noun phrases C is calculated as:

P (N P1|C)k =

(N P1, C1, C2, ..., Cn)k

(C1, ..., Cn)k

And for noun phrases N P1, N P2 given the set C:

P (N P1, N P2|C)k =

(N P1, N P2, C1, C2, ..., Cn)k

(C1, C2, ..., Cn)k

Where (N P1, N P2)k is the number of co-occurrences of N P1 and N P2 in a window

of size k. The counts have been smoothed using Laplace smoothing to negate the effect of "zero" co-occurrences causing problems with calculations.

Following section 2.5.1, independence of two noun phrases N1 and N2,

uncon-ditionally, is determined through the equivalence:

N P1 ⊥⊥ N P2 ⇐⇒ P (N P1|N P2) = P (N P1)

And independence of two noun phrases N1 and N2, conditioned on a set of noun

(36)

N P1 ⊥⊥ N P2|C ⇐⇒ P (N P1, N P2|C) = P (N P1|C) ∗ P (N P2|C)

Furthermore, when the probability of two noun phrases co-occurring is zero, i.e. they never co-occur in a window of size k, they are said to be independent. Following these definitions of probabilities and independence the PC algorithm as discussed in section 2.4.2 and described in Algorithm 1 can predict a causal model. The PC algorithm has been executed with co-occurrences in a window of size k = 5, k = 10, k = 20 and k = 30.

3.4.2 Noun phrase correlation

The second approach towards distributional semantics is to analyse the context of each noun phrase. Here independence is determined through analysis of the (partial) correlation of two noun phrases in terms of the amount of times they occur in articles.

As described in section 2.5.2, independence of two variables can be found by determining that the sample (partial) correlation coefficient is not significantly different from 0.

The standard t-test described in 2.5.2 has been utilized to determine if the corre-lation coefficient can be regarded as vanishing.

For this, the performance of hypernymy relation detection using the PC algo-rithm was tested with significance level values α = 0.05, α = 0.005, α = 0.0005, α = 0.00005.

(37)

Chapter 4 Results

In the following section the results of both the PC algorithm and the lexico-syntactic method will be presented. The section starts with a collection of note-worthy examples of hypernymy relations extracted by the lexico syntactic and causal learning method. The relations are represented as graphs.

The section ends with an evaluation of the causal learning method with respect to the lexico-syntactic method and evaluation of both methods separately.

4.1 Extracted relations

What follows is a series of extracted hypernymy relations that signify where both methods perform and where they fall short. Each relation is represented as a graph, where the extracted relations through Hearst patterns are represented as "hyponym" → "hypernym", and for each example a short description is given as to what it shows.

(a) Hearst (b) Window k=5

Figure 4.1

Figure 4.1 Shows the extracted hypernymy relation between "iron" "metal" and "hard material" which is extracted correctly by the lexico-syntactic method

(38)

(figure 4.1a). Although the causal method (figure 4.1b) does connect the nodes in the same manner, the directions of the connections are not correct. The direc-tions would state that iron and hard materials are metals if it were the case that hyponyms cause hypernyms or that metal is an iron and also a hard material if the causal relationship were the other way around.

(a) Hearst

(b) Window k=5

(c) Window k=20

Figure 4.2 Figure 4.2 shows an instance

where the extracted hypernymy relations of the Hearst patterns 4.2a has a slightly more complex structure than that of the pre-vious example, with bank hav-ing two hypernyms and corpo-ration having three hyponyms. As can be seen in figure 4.2b and 4.2c, the causal method ex-traction cluster the same noun phrases together. However, the connections are mostly incorrect. Moreover, an increase in k shows an increase in connections be-tween noun phrases. Again the directions have inconsistent implications regarding the sup-posed causal relation between hy-ponyms and hypernyms.

(39)

(a) Hearst (b) Window k=5 (c) Window k = 10,20,30 Figure 4.3 (a) Hearst (b) Window k =5 (c) Window k = 10,20,30 Figure 4.4

(40)

Figure 4.3 and figure 4.4 both show an example where increasing k in the causal method results in more connections. When k is set to 5 the hypernym is only connected with one of its hyponyms (figure 4.3b,4.4b), whereas the hypernyms is connected to all of its hyponyms when k is higher (figure 4.3c,4.4c). However the increase of k also leads to more incorrect connections.

Furthermore, it is noteworthy that the resulting model in figure 4.3c (with k > 5) does not contain the noun phrases "mouth", "pebble" and "balled cotton" whilst the model in figure 4.3b (where k = 5) does.

(a) Hearst (b) Window k=5,10

(c) Hearst (d) Window k=5,10,20,30

Figure 4.5

Figure 4.5 shows instances where the causal method extracts the hypernymy relation in a correct manner. However these two examples contradict each other. For the model in figure 4.5b suggests that the causal relationship between hyper-nym and hypohyper-nym should be of the sort hypohyper-nym → hyperhyper-nym, whereas the model in figure 4.5d suggests that this relationship should be of the sort hypernym → hyponym

Figure 4.6

Figure 4.6 shows hypernymy relations extracted through Hearst patterns that contain meaningful relations (e.g. "house is a building", and "apartment is a build-ing") in addition to non-meaningful relations (e.g. "widespread contamination is

(41)

a building") and relations that can be true but are not always the case or could be interpreted as true (e.g. "a tall structure is a building" or "a hotel is a need").

(a) Hearst

(b) Window k =5 (c) Window k =10

Figure 4.7

Figure 4.7 shows an example of a causal model that includes more noun phrases in its cluster of dependencies when k is increased, with a model with k = 10 (figure 4.7c) adds the noun phrase "a fluorinated hydrocarbon" to the model that is not present when k = 5 (figure 4.7b).

Figure 4.8

Figure 4.8 shows an example where the lexico-syntactic method extracts a non-sensible hypernymy relation due to syntactic ambiguity. Consider the Hearst pattern "N P1 like

N P2 and N P3" and the sentence "a combination like ham

and egg" which is a rather frequent combination. A syntac-tic method leads to the extraction that both eggs and ham are hyponyms of a combination, whilst it is actually the case that ham and egg together is a combination.

(42)

(a) Hearst (b) Window k=5 (c) Window k=10,20 (d) Window k=30

Figure 4.9

(a) Hearst (b) Window k=5 (c) Window k =10

Figure 4.10

Figure 4.9 and 4.10 show examples where an increase of window size gradually leads to a more correct model with regards to the lexico-syntactic method if it is the case that causal relationship between hyponym and hypernym should be of the sort hypernym → hyponym.

Figure 4.10c shows a model that exactly represents the relation extracted by the lexico-syntactic method (figure 4.10a).

(a) Hearst (b) Window k =5 (c) Window k = 10

Figure 4.11

Figure 4.11 shows an example where a window of size k = 5 (figure 4.11b) results in hypernym and hyponyms to be disconnected in the model. Increasing k (figure 4.11c) adds the hypernym to the cluster of connections.

(43)

(a) Hearst (b) Window k=5,20,30

Figure 4.12: correct connection if hyponyms cause hypernyms

Figure 4.12 shows an example where the causal method very closely resembles the result of the lexico-syntactic method (figure 4.12a). The causal model (figure 4.12b) contains one connection that is not correct according to the lexico-syntactic method. Furthermore, if it is the case that hyponyms are the cause of hypernyms in the causal model then the connections that are correct are also directed in a correct manner.

(a) Hearst (b) Window k =5

(c) Window k =10 (d) Window k = 20

(44)

Figure 4.13 shows how an increase of k can lead to the hypernym being con-nected to most of its hyponyms. However increasing k to far results in a very different model that is less correct.

(a) Hearst (b) Window k=5

Figure 4.14

Figure 4.14 shows how a causal method can connect too many noun phrases to the cluster which (ironically considering the noun phrases in the model) leads to bad connections.

4.2 Evaluation

To determine how well the causal learning method performs related to the lexico-syntactic method, the extractions of the lexico-lexico-syntactic method were used as a gold standard. Determination of performance was determined according to three criteria:

(1) Clusters: How are the noun phrases connected directly and indirectly? (2) Connections: How are the noun phrases directly connected?

(3) Direction: How are the connections directed?

Clusters Criterion (1) was evaluated through comparison of each cluster of con-nected noun phrases from the lexico-syntactic method with the clusters resulting from the PC algorithm that contain the same (and possibly more) noun phrases. For each of the clusters obtained from the causal learning method precision was calculated by dividing the amount of correct noun phrases in the cluster of con-nected noun phrases by the total amount of noun phrases in the cluster.

Recall was calculated by dividing this amount of correct noun phrases by the amount of noun phrases in the reference cluster of the lexico-syntactic method. With precision and recall the F1 measure was calculated as such:

F1 =

2 ∗ P ∗ R P + R

(45)

Where P is precision and R is recall.

For each comparison precision, recall and F1 measure was calculated and for each of these the average was calculated.

Connections To evaluate criterion (2) each connection between noun phrases in the causal model was considered correct if the lexico-syntactic method found that one is a hypernym of the other, regardless of direction.

Precision of the causal learning method was calculated by dividing the amount of correct connections by the total amount of connections in the causal model. Recall was calculated by dividing this amount of correct connections by the total amount of hypernymy relations extracted by the lexico-syntactic method.

Again, with precision and recall F1 measure was calculated in the same manner as stated before.

Direction Evaluation of (3) was conducted in the same manner as evaluation of criterion (2). However here the direction of the connection between noun phrases does matter. There are two possibilities regarding the correct direction of hyper-nymy relationships in a causal model. Either a hyponym is a cause of a hypernym, or the other way around.

Both directions have been evaluated.

In addition to performance of direction regarding all the connections in the causal model, performance has also been determined separately for the connections that were correct according to criterion (2).

The results of the skip-gram window-based approach can be seen in Table 4.1 and the results of the article-based approach can be seen in table 4.2.

(46)

Skip-gram window Results K Cluster Connection Direction

(all)

Direction (correct connections) avg Prec avg Rec avg F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 5 0.739 0.937 0.723 0.654 0.783 0.713 0.328 0.326 0.477 0.322 0.388 0.324 0.384 0.374 0.477 0.322 0.425 0.346 10 0.588 0.965 0.581 0.537 0.880 0.666 0.288 0.249 0.606 0.343 0.391 0.288 0.311 0.277 0.606 0.343 0.411 0.307 20 0.505 0.961 0.496 0.456 0.848 0.593 0.243 0.212 0.571 0.333 0.341 0.259 0.279 0.262 0.571 0.332 0.375 0.293 30 (0.469 0.969 0.463 0.421 0.861 0.566 0.227 0.247 0.583 0.583 0.327 0.347 0.194 0.227 0.333 0.333 0.246 0.270

Table 4.1: Results regarding the nskip-gram window-based method.

For each criterion Precision, recall and F1 score can be seen (with the Cluster criterion containing averages of these scores). The Direction criterion is split into score if hyponyms are the cause of hypernyms (top) and the other way around (bottom)

Article correlation Results α Cluster Connection Direction

(all)

Direction (correct connections) avg Prec avg Rec avg F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 0.05 0.003 1.0 0.006 0.0011 0.3333 0.0022 0.0006 0.0005 0.172 0.139 0.001 0.001 0.001 0.001 0.172 0.139 0.002 0.002 0.1 0.003 1.0 0.006 0.0011 0.3120 0.0022 0.0006 0.0004 0.186 0.112 0.0013 0.0009 0.001 0.001 0.186 0.111 0.003 0.002 0.2 0.003 1.0 0.006 0.0012 0.344 0.0023 0.0006 0.0006 0.176 0.144 0.0012 0.0011 0.001 0.001 0.176 0.144 0.002 0.002 0.3 0.003 1.0 0.006 0.0013 0.399 0.0026 0.0006 0.0007 0.195 0.169 0.0012 0.0013 0.001 0.001 0.195 0.169 0.002 0.002

Table 4.2: Results regarding the article-based method.

For each criterion Precision, recall and F1 score can be seen (with the Cluster criterion containing averages of these scores). The Direction criterion is split into score if hyponyms are the cause of hypernyms (top) and the other way around (bottom)

(47)

To determine the performance of the causal learning method and the lexico-syntactic method, the precision was calculated by comparing the extracted hyper-nymy relations to those on WordNet [11]. WordNet is a lexical database containing relations between words in a hierarchical manner. In essence WordNet is a taxon-omy that captures hypernymy, synonymy and meronymy relations of 155327 words.

WordNet eval Precision

Single Noun Full Noun Phrase Both Lexico-syntactic 0.176 0.110 0.134 skip-gram Window k 5 0.083 0.046 0.060 0.062 0.071 0.058 10 0.074 0.042 0.053 0.040 0.066 0.051 20 0.062 0.045 0.053 0.037 0.060 0.047 30 0.054 0.042 0.049 0.036 0.053 0.048 Article occurence α 0.05 0.003 0.010 0.015 0.014 0.015 0.013 0.1 0.003 0.011 0.015 0.014 0.015 0.013 0.2 0.003 0.010 0.015 0.014 0.015 0.013 0.3 0.003 0.009 0.015 0.014 0.015 0.013 Table 4.3 To determine the quality of the

ex-tractions, for each extraction (Hypo, Hyper) it has been checked if Hyper is a direct or inherited hypernym of Hypo in WordNet. If that is the case the extrac-tion is considered as a true pos-itive, otherwise it is a false posi-tive.

For the causal networks resulting from the PC algorithm it is possible that the hypernym/hyponym rela-tions are directed in two ways: ei-ther hyponyms should be dependent on hypernyms (hyper → hypo), or hypernyms should be dependent on hyponyms (hypo → hyper). Both directions have been tested on. Pre-cision was calculated as the num-ber of true positives divided by total number of extracted hypernymy re-lations.

Furthermore, precision was calcu-lated for noun phrases that consist of single nouns, noun phrases that consist of nouns and adjectives and both combined to test whether ad-jectives significantly alter results. The results can be seen in table 4.3.

(48)

Chapter 5 Discussion & conclusion

The results show that the PC algorithm does not extract the same hypernymy rela-tions as the lexico-syntactic method. Even though the PC algorithm, when using a skip-gram window based approach, generally puts correct noun phrases together in clusters of dependencies, with average recall over clusters ranging between 0.937 and 0.969 (See table 4.1), it is often the case that these clusters contain more dependencies of noun phrases than they should according to the lexico-syntactic method; average precision of all clusters ranges between 0.739 and 0.469 (see table 4.1). Moreover, an increased window size results in higher recall because an in-creased amount of noun phrases can co-occur when the window is inin-creased which results in more dependencies. However, this increase in dependencies also causes more nodes that should be considered independent of the nodes in the cluster to be added to the cluster, thus drastically decreasing precision when increasing the window size (dropping from 0.739 to 0.588 with an increase of 5 (Table 4.1)). A clear example of high recall and low precision and how the increase of the window size k can decrease precision further is observable in figure 4.7.

While the clusters of dependencies generally contain the same noun phrases as the lexico-syntactic method, it is often the case that the cluster of hypernymy relations itself is not structured as the lexico-syntactic method dictates. Take fig-ure 4.3 and figfig-ure 4.4 for instance. When the window is set to 5, both examples show the hypernym to be only connected to one hyponym, while the remaining hyponyms are connected to each other.

Furthermore, from the directions of these connections it is difficult to determine whether a hyponym should be the cause of a hypernym or the other way around. Take for instance figures 4.5b and 4.5d. Both models show clusters where the nodes are connected as the lexico-syntactic method dictates. However if it were the case that a hyponym should be the cause of a hypernym, then the model in

Noun phrase Taxonomies: Lexico-syntactic and Causal methods

Noun Phrase Taxonomies:

Lexico-syntactic and Causal

methods

Noun Phrase Taxonomies:

Lexico-syntactic and Causal

methods

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Natural Language Processing

2.1.1

Part Of Speech (POS) tagging

2.1.2

Noun Phrases

2.2

Lexico-syntactic Hypernymy extraction

2.3

Distributional semantics

2.3.1

Probabilities from text

2.4

Causal learning

2.4.1

Causal models

2.4.2

Algorithms

2.5

Independence

2.5.1

Probabilistic independence

2.5.2

Correlation

Chapter 3

Method

3.1

Corpus

3.2

Extracting hypernymy relations through Hearst

patterns

3.3

Representing distributional semantics

3.4

Finding a causal model

3.4.1

Co-occurrences

3.4.2

Noun phrase correlation

Chapter 4

Results

4.1

Extracted relations

4.2

Evaluation

Chapter 5

Discussion & conclusion