Improving a part-of-speech tag set for Dutch

(1)

Improving a part-of-speech tag set

for Dutch

Martijn Wardenaar 10348190 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisors

Dr. Stella Frank and Dr. Ivan Titov Institute for Logic, Language and Computation

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

This project aims at finding methods to merge tags from a fine-grained tag set automatically. Although the main focus was on the Dutch language we have not used any language specific tweaks, meaning that the methods can be used for other languages as well. To ensure that the testing environment is constant while the tag set changes with every test, we have used a mapping to universal tags (Petrov et al., 2012) during the evaluation. Two methods for clustering are tested and evaluated using two kinds of taggers. We have implemented a hierarchical clustering algorithm and a merge/split algorithm and evaluated the methods using a simple HMM tagger and the TnT tagger (Brants, 2000). We have found that the hierarchical clustering algorithm is able to increase the performance of both taggers, however, the results were better for the HMM tagger (0.81 percentage points) than for the TnT tagger (0.15 percentage points). The results of the merge/split algorithm were even better for the HMM tagger (1.82 percentage points), but did not seem to offer any improvement for the TnT tagger. We have found that the optimal method and parameters are dependent on the tagger which is used. We have also found that some tags, such as nouns and verbs, are better suited for merging than other tags, such as determiners.

(3)

5.2 Metrics . . . 12 5.3 Methods . . . 12 5.4 Merging . . . 13 6 Merge/Split 14 6.1 Merge . . . 14 6.2 Split . . . 14 6.3 Parameters . . . 15 6.4 Likelihood . . . 15 7 Results 16 7.1 Evaluation . . . 16 7.2 Hierarchical Clustering . . . 17 7.3 Merge/Split . . . 19 7.4 Tag sets . . . 21 8 Discussion 22 9 Conclusion 23 10 Future Work 24

(4)

1 Acknowledgements

Before we start this paper, there are some people I want to thank for making this project possible and helping me throughout the thesis.

I will start with Dr. Stella Frank, who has supervised this project. She has guided me through this research both with her knowledge and her support during our regular meetings. Her ideas have helped me to carry out my research and to get the thoughts in my head into the computer. Aside from her ideas and knowledge, her optimism and patience has helped me even when I failed to see how to proceed. I have no doubts that I have tested her patience with the enormous amount of emails filled with questions and results, but her feedback has always been informative and kind of tone.

I would also like to thank my faculty supervisor Dr. Ivan Titov, without whom this research could not have taken place at all, and whose course at the University of Amsterdam inspired my interest in this field. I would like to thank Dr. Raquel Fernandez for her aid with finding the supervisors.

Also many thanks to my parents and my girlfriend for supporting me through-out my studies, and during the thesis in particular. Their words of encouragement and their patience (which I must have challenged on more than one occasion with my rambling) have been very helpful.

Last but not least I thank my friends and fellow students at the university. The group we have is always there to help each other, be it with infuriating bugs, mind boggling errors or just to have somebody to complain to when all does not go as we would have hoped. The three years in which I was studying for my bachelors has shown me that collaboration with colleagues can teach you at least as much as any book.

(5)

2 Introduction

Computer programs which have the ability to understand natural language or even communicate using natural language will have a great impact on the way computers are used in the everyday life of people. For years companies such as Google and Apple have been working on smart assistants which can help their customers by answering queries which are given in natural language by means of a keyboard or even voice recognition (Apple’s SIRI for instance). However, natural language processing has more applications than just the personal assistants on phones, tablets and computers. Other tasks, such as machine translation, also require the computer to have a certain amount of knowledge about natural language.

Part-of-speech tagging, a method which is often used in natural language pro-cessing, is able to learn certain aspects of language by itself and to use this infor-mation for processing new text. The idea of part-of-speech tagging is to assign part-of-speech tags (annotations which hold information about the word and its place in the sentence) to words in a piece of text. This information often does not only consist of the part-of-speech, but also of other information like the carnality, the gender or the person of the word. Examples of aspects which are used by these algorithms to understand language and annotate text are transitional probabilities (the probability of a tag given the previous tags) and emission probabilities (the probability of the word given a tag). The tagger learns these probabilities by it-erating over the training set and storing the transitional probabilities it finds from the order of tags it observes and the emission probabilities it finds from the com-binations of words and tags. The collection of part-of-speech tags which are used in a corpus (a large text annotated with part-of-speech tags) is called the tag set. This tag set can be as detailed or fine-grained as the creator of the corpus desires, ranging from merely a dozen to hundreds of tags. The program which is trained on the data to find a model and performs the task of annotating the text is called a tagger.

This paper will explain a project which has been conducted with the aim of optimizing a tag set for Dutch. We propose two methods for optimizing the tag set which is used in the Alpino Treebank (see section 4.1) by merging tags from a very fine-grained tag set. The first approach uses uses the information from the model which was created from the training set and merges tags from the tag set based on the frequencies with which they are found with other tags. This method uses agglomerative hierarchical clustering to find tag sets which increase the overall accuracy of the tagger. Section 5 explains the agglomerative hierarchical clustering algorithm and the implementation of this algorithm during this research in more detail. The second approach is based on the idea that a global maximum can be found by taking enough random steps while having the possibility to also take steps

(6)

back. This idea was inspired by, though in many ways different from, the methods used by Petrov and Klein (2007). The general idea is to merge random tags and see whether or not the new tag set leads to a better fit, keeping the merges which do while discarding the ones which do not. A detailed description of this approach can be found in section 6.

The accuracy of tagging Dutch texts is already high (see section 4.2), neverthe-less we expect to be able to increase this accuracy even further without resorting to any language specific tweaks. Since no language specific information is used, the concepts which are described in this paper may be used for any other language as well, provided that this language has a corpus available which is annotated with a fine-grained tag set. Finding an ideal tag set is unachievable in itself, since the composition of an ideal tag set depends on the tagger which is used (which will be shown in this paper) as well as the task the annotating is used for. For some tasks, being able to identify whether a word is a noun or not may be sufficient, whereas for other tasks the knowledge about what specific kind of noun a certain word is might be crucial. Therefore, the results of this paper will not consist of a definite tag set which is “ideal” for the Dutch language, but rather of a model for optimizing a tag set for annotating text with different taggers.

3 Related Literature

Part-of-speech tagging has played a large role in the field of natural language pro-cessing for quite some time, thus much research has been conducted and many papers have been written on the subject. We use two different kind of taggers dur-ing this project to ensure the results from the approaches is not specific to a sdur-ingle tagger. One of the taggers which is used often is the TnT tagger (Brants, 2000). This is a tagger which uses interpolation and a beam search to find the best tag sequence given an unlabeled sentence and a model which has been trained on an annotated corpus. This tagger will be used during this research, and will be dis-cussed in more detail in section 4.2.2. The main objective of this research is to find an algorithm which can automatically improve a tag set by making it fit the data better. The other tagger which is used during this research is a simple trigram tagger without interpolation or any form of a Viterbi algorithm. This tagger (de-scribed in detail in section 4.2.1) uses the same suffix tagger which is used in the TnT tagger. The HMM tagger uses a simplified Good-Turing smoothing (Gale and Sampson, 1995), which is based on the work of I.J. Good. This smoothing tech-nique is used for the HMM tagger in this research to smooth probabilities of events with a very small frequency, and to estimate a tag for an unknown word when the suffix tagger fails to do so.

(7)

Petrov and Klein (2007) conducted research on improving a probabilistic con-text free grammar (PCFG) by finding latent tags using a split/merge algorithm, and Huang et al. (2009) conducted a research on improving a simple bigram HMM tagger by finding latent tags. Latent tags are tags which are not in the tag set, but are obtained by splitting tags (the resulting tag set will therefor be larger than the initial tag set). In both papers the EM algorithm was used to estimate the new transition and emission probabilities for the newly found tags. In the research de-scribed in this paper the tag set which is initially used has a very detailed tag set, therefore the goal is to merge tags to obtain a smaller tag set which provides better performance. However, the methods which were described by Petrov and Klein (2007) and Huang et al. (2009) are interesting to study since the knowledge of how the tags can be split may prove to be helpful in finding methods to automatically merge them. The merge/split algorithm which will be described in this research, for instance, is inspired by the split/merge algorithm as described by Petrov and Klein.

Petrov et al. (2012) presented a universal part-of-speech tag set. This tag set contains tags which are supposed to be present in almost every language, and could thus be an important element of machine translation related tasks such as cross-lingual parsing. They found that training on a language-specific tag set, tagging a corpus and then mapping it to universal tags improved the accuracy on the test set considerably. They used the Alpino treebank annotated with the coarse tags (twelve tags) and the TnT tagger, and managed to increase the accuracy from 93% to 95% on a Dutch train and test set. In this research the Alpino treebank is used as well (see section 4.1), however, the fine-grained tag set is used (203 tags), which will be the set used for training and eventually merging. After a new tag set has been found, the tagger will tag a test set to obtain a predicted test set, which will then be mapped to the universal tag set using a mapping described by Petrov et al. The accuracy on this mapped (universal) test set will then be used for evaluation by comparing it to a test set which is annotated with universal tags by hand. This method for evaluation ensures that the situation for testing is the same throughout all tests, making the results more reliable.

4 Method

4.1 Preprocessing The Alpino Treebank

The Alpino treebank is a collection of sentences divided in a training set and a test set which are annotated with information about the part-of-speech tags as well as information about the sentence tree. This corpus was created at the Rijksuniver-siteit Groningen and introduced by Van der Beek et al. (2002), it is a part of the

(8)

ConLL-X shared task (Buchholz and Marsi, 2006). Each line consists of one word and all information about this word. The information for each word consists of the number of the word in the sentence, the word itself, the lemma of the word, the part-of-speech tag, the features belonging to the part-of-speech tag and some information about the sentence tree. The information about the sentence tree is not important for this research, and is therefore removed from the file. The tags and features are merged to become one new tag, and since there are multiple feature values for each tag, this results in 203 tags instead of the 13 base tags including a tag for start and end symbols. The words are placed in sentences, removing all other information apart from the tags concatenated with the features. The entire corpus is then converted to lowercase and words which where concatenated by means of an underscore are separated. These concatenated words are annotated with individual tags and features, which makes the separation an easy task. The last step is to add starting and end symbols which are used for the HMM tagger. After the pre-processing of the data, the training set consists of roughly 13.500 sentences and the test set of 380 sentences. We found that the ambiguity in the data is very low, approximately 1.1 tags per word in the training set and 1.15 tags per word in the test set.

An example of a sentence in the original Alpino Treebank:

1 In in Prep Prep voor 8 mod

6 JGZ-disciplines JGZ-disciplines N N soort|mv|neut 8 su

7 vertegenwoordigd vertegenwoordigd V V trans|verldw|onverv 8 predc

8 zijn ben V V hulpofkopp|ott|1of2of3|mv 4 vc

9 . . Punc Punc punt 8 punct

The same sentence after preprocessing:

<@S in@Prep(voor) dat@Pron(aanw|neut|attr)

(9)

4.2 The Taggers

During this project we compare two taggers to test our methods. The first tagger is a very simple second order Hidden Markov Model type tagger, which is only extended with Good-Turing smoothing and a suffix tagger for the unknown words. The other tagger is a TnT tagger (Brants, 2000) which is implemented in the NLTK tookit for Python.

4.2.1 Simple HMM tagger

The HMM tagger which was implemented for this project is a second order Hidden Markov Model type tagger. It does not make use of the Viterbi algorithm or a beam search during the annotating, nor does it use interpolation between orders of the Hidden Markov Model. It calculates the emission probabilities, which are the relative frequencies of word/POS pairs, and the transition probabilities, which are the relative frequencies of trigrams which were found in the training corpus. It uses Good-Turing smoothing (Gale and Sampson, 1995) to estimate a probability for rare and unseen events. The idea to this smoothing method is to estimate the events (in this case word/POS pairs and ngrams) which were seen N times by looking at the events which were seen N + 1 times. The probability for unseen words, for instance, will then be calculated by N1

N where N1 is the count of the

singletons (things which occurred only once), and N is the total number of words. This Good-Turing discounting was used for events which were seen up to 4 times for the transition probabilities. The emission probabilities are smoothed separately for every tag, and are smoothed for events which never occurred or only once.

The tagger was extended with a suffix, which will try to calculate a tag for an unknown word. If it fails to do so, either because the word is to short or the suffix is also unknown, the tag will be calculated using the regular method of this tagger with the use of the Good-Turing smoothing.

To calculate a tag for a certain word in the test data, the algorithm loops over all possible tags and calculates a probability. This calculation is simply a multi-plication of the emission probability (probability of the word having this particular tag) with the transition probability (probability of this particular tag following the previous tags).

P (ti|w, ti−2, ti−2) = P (w|ti) × P (ti|ti−2, ti−1)

The probabilities for every tag are stored in a dictionary, and in the end the tag with the highest probability is chosen.

The reason this tagger is used even though it is very (in some cases too) sim-plistic, is that it is incredibly fast. The lack of a forward-backward algorithm like

(10)

the Viterbi algorithm makes this tagger very fast, and therefore well suited for tasks where the tagger is used multiple times like we will see in section 5. Though the tagger performs very well on the data which was used during this project, it might not be a suited tagger for other data-sets and other languages due to its simplicity.

Tag set Accuracy

Fine-grained (203 tags) 91.90 Coarse (13 tags) 91.24

Table 1: Accuracy on the Alpino test set using the original fine-grained tag set and the coarse tag set (see section 7.1 for evaluation method)

4.2.2 TnT tagger

The TnT tagger is a tagger based on the work of Brants (2000). For this research we used an implementation from the nltk package for Python. This tagger uses interpolation of multiple orders of the Hidden Markov Model to calculate the prob-ability of the tags. The formula for the interpolation is:

P (t3|t1, t2) = λ1P (t3) + λ2P (t3|t2) + λ3P (t3|t1, t2)

where the λ’s are calculated as follows:

foreach trigram in all trigrams: tri = f (t_{f (t}1,t2,t3)−1

1,t2)−1

bi = f (t2,t3)−1

f (t2)−1

uni = f (t_{N −1}3)−1

if tri > uni and tri > bi: λ3 += 1

if bi > uni and bi > tri: λ2 += 1

if uni > bi and uni > tri: λ1 += 1

normalize λ’s

The tagger which was implemented in the NLTK package for Python does not handle unknown words at all. It is possible to create a separate tagger to handle the unknown words, for which we chose the suffix tagger which is usually used with

(11)

the TnT tagger. This suffix tagger is not always able to calculate a tag, therefore a backoff tagger was used to handle these words. This backoff tagger is a default tagger, this means that it always returns the same tag, which is the tag found most often with words which only occur once. It uses a beam search, which is a faster alternative for the Viterbi algorithm. The tagger used in this research does differ from the original TnT tagger in one way; an unknown word will be calculated by the suffix tagger (or default tagger) without any transitional probabilities. Tests re-vealed that this did not lead to any decrease in performance, which will be further explained in the discussion at the end of this paper.

The reason this tagger is used, is that it is a very standard tagger which is used very often due to its good performance and ease of use. It is also used by Petrov et al. (2012), making it possible to compare the results from this paper to their research. This tagger is considerably slower than the HMM tagger which was previously discussed, but the accuracy is also higher. Notice how the accuracy over a corpus annotated with the fine-grained tag set is higher than the accuracy over the corpus annotated with the coarse tag set.

Tag set Accuracy

Fine-grained (203 tags) 95.04 Coarse (13 tags) 92.24

Table 2: Accuracy on the Alpino test set using the original fine-grained tag set and the coarse tag set (see section 7.1 for evaluation method

5 Hierarchical Clustering

The first method for merging tags which will be described in this paper is hierar-chical clustering. The goal is to find a tag set which offers the optimal accuracy by merging tags from a fine-grained tag set to obtain a smaller tag set which results in better performance. Hierarchical clustering is a method of analysis which looks at the distance between points or vectors and creates clusters based on this analysis (Johnson, 1967). In this setting, each tag is represented by a context vector and there are three ways of calculating these vectors which will be described in sec-tion 5.1. The hierarchical clustering method which was used is an agglomerative algorithm, meaning it is a bottom-up method. To cluster the tags, a distance matrix needs to be created from the vectors which will later be used to merge the tags based on the distance between the tags. There are multiple metrics for calculating the distance between the vectors and multiple ways of finding clusters from the distance matrix; a total of six combinations are tested. During this project the

(12)

hier-archical clustering implemented in the scipy package for Python was used. The resulting tag sets are dependent of the tagger which was used during the algorithm since the results of the taggers on a development set are used to find the optimal threshold (section 5.4).

5.1 The Context Vectors

In this research, three methods of creating the vectors have been used which are represented in table 3. The first vector which represents the tag consists of all permutations of three tags with the tag it represents as the last tag. The second vector consists of all permutation of three tags with the tag it represents as the second tag. The last vector is twice as long as the previous vectors, and consists of all permutations of three tags with the tag it represents as the last tag and all permutations of three tags with the tag it represents as the first tag.

Tag1 Tag1 Tag

Tag1 Tag2 Tag

. . .

Tagn Tagn Tag

(a) Vector 1

Tag1 Tag Tag1

Tag1 Tag Tag2

. . .

Tagn Tag Tagn

(b) Vector 2

Tag1 Tag1 Tag

Tag Tag1 Tag1

. . .

Tagn Tagn Tag

Tag Tagn Tagn

(c) Vector 3

Table 3: (a) Vector with the tag it represents as the last tag in a trigram, (b) vector with the tag it represents as the middle tag in a trigram, (c) vector with the tag it represents as the first and the last tag in a trigram

Each entry has a relative frequency which is the number of times that trigram is found depending on the tag the vector describes and calculate by the following formula:

f = Count(trigram)/Count(tag)

In this formula Count(tag) is equal to the sum of all counts in the vector and represents the number of times the tag has been seen in general. These vectors are combined into a matrix, which will then be used for the hierarchical clustering which will be described in the rest of this section.

(13)

5.2 Metrics

The two methods which are used for calculating the distance matrix are the eu-clidean distance and the cosine distance.

The euclidean distance between vectors V and W is calculated by:

v u u t n X i=1 (Vi− Wi)2

which is equal to the vector representation of:

p

(V − W ) · (V − W )

The cosine distance is measured by the formula:

cos(θ) = V · W ||V ||||W ||

The difference between these methods of calculating the distances is best ex-plained by picturing the vectors as actual vectors in a n-dimensional space. The cosine distance will measure the difference in the angle between two vectors, but does not take the length of the vectors in account whatsoever. The euclidean dis-tance treats the vectors as points and calculates the disdis-tance between the points, which means that vectors which have the same angle but different lengths would not be matched correctly. To prevent this error, all vectors are normalized to ensure that the length is 1.

5.3 Methods

The three methods which are used to calculate the distance matrix are single, av-erage and complete. The difference between these methods is the way the new distance is calculated when clusters are merged. The single method will choose the smallest distance, the average method will calculate the average and the com-plete will choose the maximum distance. The difference between the methods is illustrated in figure 1.

(14)

Figure 1: Different clustering methods

5.4 Merging

The result of the hierarchical clustering is a tree with tags which were clustered until only one cluster (containing all tags) is left. This list consists of the clustered tags and the level on which two tags are clustered, which represents the similarity of the tags. To create a new tag set from this list a certain threshold is needed which determines up to what level the tags may be merged. To limit the merging and to make sure that the algorithm does not merge two completely different tags because of a coincidental similarity in the development set, we chose to merge tags only if the coarse tags are the same. The algorithm used two methods for choosing the op-timal threshold on two different grids. The first method is to calculate an accuracy on all possible thresholds and choose the highest (argmax). The second method iterates through the grid, increasing the threshold, and calculates an accuracy until the accuracy drops or stays the same. The grid which was used is: [0.05, 0.1, 0.15, ..., 0.9].

To calculate the accuracy, the algorithm first determines the new tag set using the threshold from the grid. It then creates a mapping from the old tag set to the new tag set, this mapping consists of a dictionary with every tag which ought to be mapped and the tag it is supposed to be mapped to. When the best threshold has been found, the training set and test set will be mapped to the new tag set. The new tag set is evaluated by evaluating the tagger using the new train and test set.

(15)

6 Merge/Split

The aim of the merge/split algorithm is to merge tags randomly to find an opti-mal tag set, while having the possibility to split tags to prevent local maxima. The merge/split algorithm iterates until a certain flag is raised telling it to stop. At each iteration the algorithm will attempt to merge two tags within a coarse tag (con-catenating their features) and if a certain amount of merges have been successful, it will split one random tag into two tags. The split step is meant to reduce the possibility of the algorithm getting stuck in a local maximum. In this section the merge and split steps will be explained in more detail, and the parameters for the algorithm will be explained along with the choices which were made during the project. The tag sets which result from the merge/split algorithm are independent of the taggers, since the methods does not use the accuracy on a development set during the merging.

6.1 Merge

The merging of tags, which happens every iteration of the algorithm, relies on two random steps. First one random coarse tag is selected from a list which contains all coarse tags present in the train corpus. This coarse tag is then passed on to another function which picks two random combinations of feature values from a dictionary which holds all combinations of feature values for every coarse tag. This dictionary is updated after every successful merge and split, to make sure that a the new merged tags can be merged with other tags and that tags which have been merged cannot be merged with different tags. After the tags have been merged the train corpus is mapped to the new tag set (a temporary mapping containing all accepted merges and the merge which has just been done) to be evaluated. To evaluate the new tag set the algorithm calculates the log likelihood (see section 6.4) of the train corpus and compares it to the best log likelihood it has seen so far. If the new log likelihood is acceptable (see section 6.3) the merge is accepted and the temporary mapping is accepted and will be used for the remainder of the iterations.

6.2 Split

The method for splitting tags is, like the merging, a random process which is eval-uated with the log likelihood. When the program has decided it is time to do a split (see section 6.3) a random mapping will be chosen from the set of mappings which have been made so far, and one feature combination will be removed from this mapping at random. If the original mapping consists of a merged tag which has only two feature combinations, the two new tags will both have their original

(16)

feature combination. If the merged tag has more than two feature combinations (resulting from a merged tag which merged again) only one feature combination will be extracted from the merged tag, resulting in two tags; the first with only one feature combination and the second with the remaining feature combinations.

6.3 Parameters

This algorithm has far fewer parameters than the hierarchical clustering algorithm, in fact there are only three. The first parameter is the offset, which decides whether the increase in log likelihood which was achieved was sufficient to accept the merge. The offsets which were tested are -50, -10, 20, 50, 100 and 150. If the offset is a negative number it means that the log likelihood must have at least im-proved by the absolute value of the offset, while a positive value means that the merge might have lead to a decrease within a margin. The second parameter is the stop-parameter, which decided when the algorithm has performed enough iter-ations and may stop to calculate the results (we have tested 15, 30 and 60). After each iteration the stop parameter is compared to a stop value, which consists of a weak-stop and a strong-stop and is calculated by the following formula:

Stop = (0.5 × weak stop) + strong stop

The weak stop value is incremented when a merge does not lead to an increase in the log likelihood but does lead to a decrease which is in the margin given by the offset. The strong stop value is incremented every time a merge leads to a decrease which is larger than the offset and is therefore discarded, so minimum number of times the algorithm tries to merge tags will be the value of the stop parameter. When a merge has been successful and the merge has been accepted, both the weak stop value and the strong stop value are set to 0. The last parameter is the split parameter which decides when a random tag needs to be split, for which we used the value 5 during this project. Every time a merge is successful a counter is incremented, when the counter equals the split parameter and a split has been performed the counter is set to 0.

6.4 Likelihood

To evaluate a merge or a split, the entire training corpus is mapped to the new tag set. Then the log likelihood is calculated over the training set by training a model on the training set and going over the entire training set to calculate the probability of each tag given the word and the previous tag. The log likelihood is calculated

(17)

by the following formula:

N

X

i=0

log(P (ti|ti−2, ti−1) + log(P (ti|wi))

Where N is the number of words in the training set. The log addition is used since multiplying the probabilities may lead to underflow due to the small size of the probabilities.

7 Results

This section will present the results of the tests which were conducted for both algorithms. The criterion is the increase in accuracy over the test set mapped to universal tags, as will be explained in the next section. Since different taggers with very different time complexities are used, the time it takes to train and test are not taken as a criterion.

7.1 Evaluation

To evaluate the taggers and the tag set the tagger is first trained on the corpus with the tag set which is to be evaluated. The tagger will then tag the test set using the model it has learned from the training which results in a predicted test set. Instead of calculating an accuracy over the tagging of the test set, the evaluation uses a method described by Petrov et al. (2012) where the predicted test set is mapped to the universal tags set. The tags in this tag set are Determiner, Noun, Verb, Adjec-tive, Adverb, Pronoun. Adp (prepositions and postpositions), Numerals, Conjunc-tions, Particles, Punctuation and X (a category for all other words such as foreign words). This mapped predicted test set is then compared to the mapped test set to evaluate the tagger and the test set. We use this method to ensure that the testing situation is the same for every test even though the tag sets are different. Table 4 shows the increase this method leads to for both taggers which are used in this research.

Evaluation HMM Tagger TnT Tagger

Normal Evaluation (202 tags) 86.98 87.5

Universal POS Evaluation 91.9 95.04

Table 4: Difference between usual evaluation (where the training set and test set are annotated with the same tag set) and evaluation with an universal tag set

(18)

7.2 Hierarchical Clustering

As was explained in section 5, there are many parameters to take into account when using hierarchical clustering. The results for the TnT tagger are very different than for the HMM tagger, which is not surprising since the baseline for the TnT tagger is higher than the baseline for the HMM tagger (see section 4.2). Table 5 and 6 shows the largest increases obtained using the different methods for creating the vectors.

Vectors Accuracy Increase Tags

Baseline 91.90 - 203

Vector 1 92.71 0.81 183

Vector 2 92.66 0.76 176

Vector 3 92.66 0.76 175

Table 5: Increase for the HMM tagger with different methods for creating the vectors, for the different vectors refer to section 5.1

For the HMM tagger it would seem that the vectors where the tags are the last tag in the trigram give the best result. The other methods for creating vectors also perform quite well and, depending on the task, may be more desirable due to the fact that these methods may result in smaller tag sets. The error reduction, which is the difference in errors which are made by the tagger before and after the clustering, is 10 percentage points for the HMM tagger (calculated using the highest increase).

Vectors Accuracy Increase Tags

Baseline 95.04 - 203

Vector 1 95.14 0.1 188

Vector 2 95.19 0.15 197

Vector 3 95.07 0.03 186

Table 6: Increase for the HMM tagger with different methods for creating the vectors, for the different vectors refer to section 5.1

The hierarchical clustering algorithm seems not to work as well for the TnT tagger as for the HMM tagger. Although a maximum increase of 0.15 does not sound very impressive, the baseline of the TnT tagger is mucher higher than the baseline for the HMM tagger. The error reduction (calculated using the highest increase) for the TnT tagger is 3.02 percentage points, which is a significantly smaller difference from the error reduction of the HMM tagger than the difference

(19)

in increase of accuracy between the taggers.

It appears not to be a simple task to find the optimal parameters for the cluster-ing, as can be seen in table 7. The three highest increases (over all three methods for creating the vectors) have different parameters, which means that the optimal parameters do not just depend on the data, but also on the method for creating the vectors. Since the optimal method for creating the vectors depends on the tagger which is used, as was explained earlier, the optimal parameters also depend on the tagger which is being used. With the TnT tagger it is easier to choose the optimal methods and metrics, since the two highest increases are both obtained by using the cosine distance metrics as can be seen in table 8.

Vectors Accuracy Increase

Baseline 91.90

-Complete/Euclidean (Vector 1) 92.71 0.81 Single/Euclidean (Vector 2) 92.66 0.76

Complete/Cosine (Vector 3) 92.66 0.76

Table 7: Three largest increases with the HMM tagger, for the different vectors refer to section 5.1

Vectors Accuracy Increase

Baseline 95.04

-Average/Cosine (Vector 2) 95.19 0.15 Average/Cosine (Vector 2) 95.16 0.12 Complete/Cosine (Vector 1) 95.14 0.1

Table 8: Three largest increases with the TnT tagger, for the different vectors refer to section 5.1

Figure 2 shows the increase of the accuracy with the size of the tag set. Apart from the dip in increase with a tag set of 179 tags, it would seem that a tag set with less than 184 tags tends to lead to a higher accuracy than a larger tag set. It is, however, not the case that a smaller tag set always leads to a better accuracy. While we have seen that having a smaller tag set (to some extent) tends to lead to a higher accuracy for the HMM tagger, figure 3 shows that the TnT tagger prefers a larger tag set on average. Note that the baseline for the TnT tagger was higher than for the HMM tagger, so smaller increases were to be expected. The difference in size between tests are due to a different parameter for the threshold (see section 5.4) and that some combination of methods might result in more clustering between coarse tags, causing less merges to be performed after the clustering.

(20)

Figure 2: Increase in accuracy over the size of the tag set for the HMM tagger

Figure 3: Increase in accuracy over the size of the tag set for the TnT tagger

7.3 Merge/Split

The first thing which is very noticable about the results from the merge/split algo-rithm is the incredible variety of the resulting tag sets. Depending on the parameter which decides when to stop (see section 6.3) the algorithm appears to be able to be much more aggressive than the hierarchical clustering algorithm, creating tag

(21)

sets with as little as 14 tags. However, the algorithm also showed less aggressive merging with larger tag sets as a result (comparable to the hierarchical clustering algorithm), which shows there is much greater variance than with hierarchical clus-tering. The resulting accuracies over the new tag sets are also much more varied than the results from the hierarchical clustering algorithm for both the HMM tagger as the TnT tagger.

The algorithm seems to work very well for the HMM tagger, with increases in accuracy up to 1.82, more than twice the increase which was obtained using the hierarchical clustering. The TnT tagger, on the other hand, showed only decreases in all tests, aside from two tests where the difference was 0.

Tagger Minimum Maximum Average Standard deviation

HMM tagger 0.06 1.82 0.89 0.58

TnT Tagger -1 0 -0.43 0.29

Table 9: Minimum, maximum and average scores and standard deviations for both taggers using the merge/split algorithm, each combination of parameters was tested three times

Table 9 shows the results which were obtained using the split/merge algorithm for both taggers. It is clear that this method for optimization is not suited for the TnT tagger, but appears to work very well for the HMM tagger. The best score obtained by the HMM tagger (1.82) was obtained with a tag set which consisted of only 30 tags. The TnT tagger, however, obtained its maximum score with a tag set of 200 and 191 tags (this result was obtained twice). The worst score for the HMM tagger was achieved with a tag set consisting of 193 tags, while the worst score for the TnT tagger was a result from a tag set with the size of 37 tags. Note that the tests were not ran individually for each taggers, but after the algorithm was completed both taggers were tested with the same tag set. We will see later one what the correlation between the two taggers on the same tag set is.

HMM tagger 0.11 1.82 1.03 0.6

TnT Tagger -1 0 -0.50 0.3

Table 10: Minimum, maximum and average scores and standard deviations for both taggers using the merge/split algorithm with only positive offsets

Tables 10 and 11 show the minimum, maximum and average increase and stan-dard deviation of both tagger with only positive offsets and negative offsets respec-tively. It should be noted that the number of tests on the negative offset is smaller

(22)

HMM tagger 0.06% 1.36% 0.58% 0.37

TnT Tagger -0.57% 0% -0.25% 0.16

Table 11: Minimum, maximum and average scores and standard deviations for both taggers using the merge/split algorithm with only negative offsets (every merge has to increase the likelihood by a minimum of the offset)

than the number of tests on the positive offset, due to the fact that an offset which is to low will lead to no merges since the rules for merging are too strict. It is clear from the tables that choosing a negative offset, and thereby having a very strict criterion for merging tags, leads to slightly better results for the TnT tagger and worse results for the HMM tagger. Table 12 shows the maximum, minimum and average number of tags and the standard deviation over all tests, only the tests with a positive offset and only the tests with a negative offset.

All tests 14 202 119.78 72.52

Positive offset 14 193 95.38 71.60

Negative offset 126 202 180.78 19.86

Table 12: Minimum, maximum and average number of tags and standard devia-tions for both taggers using the merge/split algorithm with only negative offsets (every merge has to increase the likelihood by a minimum of the offset)

Figure 4 shows the increase or decrease with the number of tags in the tag set. This shows again that while the HMM tagger profits from a small coarse tag set, the TnT tagger performs better with larger tag sets.

7.4 Tag sets

The previous section already showed that the impact of the size of the tag set is heavily dependent on the tagger which is used. Another important aspect is which merges actually affect the increase in a significant way. To investigate this, the merges which were performed by the merge/split algorithm have been recorded along with the change in log likelihood of the training set they have made. Keep in mind that a better log likelihood does not necessarily mean a higher accuracy. The highest log likelihoods were achieved in tests where the smallest number of tags were left after the algorithm had completed, and these tests usually led to a significant increase for the HMM tagger and a significant decrease in accuracy for

(23)

Figure 4: Increase in accuracy over the size of the tag set for the HMM tagger using the merge/split algorithm

the TnT tagger (see previous section).

Almost all tests with a negative offset showed that nouns were best suited for merging. Other tags which performed well with a negative offset are articles and punctuation. Nouns also showed significant increases in the log likelihood for tests with a positive offset, although the difference with other tags was smaller than with the negative offset. Tests with a positive offset also showed high increases when merging conjunctions and in some cases pronouns. Verbs also caused some increase, but not nearly as high as nouns and articles. Determiners and the sym-bol X (which only had two features) were rarely merged and did not lead to any significant increase or decrease.

8 Discussion

The TnT tagger which was used during this project, is not exactly implemented as was described by Brants (2000). The suffix tagger, which is described in his paper, used for estimating unknown words returns a probability distribution over all tags which is then used to perform the beam search as it does when the tag is known. In this project a suffix tagger was also used for unknown words, however, the suffix tagger in this project estimated the tag on its own without the transitional probabilities. If this suffix tagger would fail than a back-off tagger would estimate

(24)

the tag by assigning it the tag which is found most with words that only occur once. The TnT tagger using the suffix tagger as Brants described in his paper was also implemented and tested against the TnT implementation which was used in this project, but the results argued in favor of the implementation which was used in this project. The TnT tagger as Brants described performed between 95.02% and 95.05% depending on the data the suffix tagger was trained on, but it performed considerably slower than the TnT tagger used in this project. Since both algorithms for merging tags take very long to test and the results did not improve considerably, we chose not to use the TnT tagger as explained by Brants but to use the TnT tagger which was implemented with the suffix tagger from the NLTK package for Python. It might also have been interesting to see whether the addition of emission probabilities to the vectors would help the hierarchical clustering to better cluster the tags. This was, however, not achievable during this project due to the many tests which already needed to be conducted.

We are unsure of the reason the performance of these algorithms differ so much between the HMM tagger and the TnT tagger. We expected to find tag sets which would improve both taggers, but instead we found that the tag sets which work very well for the HMM tagger often decrease the performance of the TnT tagger.

The log likelihood seemed not to be the best predictor of a high accuracy as much as we had hoped. The log likelihood is calculated using a standard HMM model, and the results for the TnT tagger might have been better if the calcula-tion of the log likelihood resembled the calculating of the tags as the TnT tagger performs more.

9 Conclusion

This paper has aimed to investigate two methods for increasing the overall perfor-mance of part-of-speech taggers by optimizing the tag set of a given corpus. The method which has been investigated consisted of merging tags from a very fine-grained tag set (203 tags including a start/end symbol) to a coarse tag set. The em-phasis of the project was on the Dutch language, although one of the aspects which was very important to us was to not use any language and data specific tweaks, so that the methods could be used for other data sets and languages as well. We used the Alpino treebank for Dutch and edited it to make it more suitable for our needs (section 4.1). For testing we used mapped the test corpus which was annotated by the tagger to universal tags, and compared this mapped test corpus to a golden test corpus which increased the accuracy significantly (section 4.2) and ensured a con-stant testing environment throughout the project. We used two taggers, a variant on the TnT tagger by Brants (2000) and a simple second-order HMM style tagger.

(25)

We have shown that agglomerative hierarchical clustering and a merge/split al-gorithm can be used to improve the accuracy of taggers, but also that, depending on the tagger, some methods work better than others. First we demonstrated the effectiveness of the hierarchical clustering for both taggers, increasing the maxi-mum accuracy of the TnT tagger by 0.15 and the accuracy of the HMM tagger by 0.81 (section 7.2). We have seen that the optimal parameters for the TnT tagger are very different than the optimal parameters for the HMM tagger. This is likely due to the fact that the TnT tagger achieves higher scores on data sets which are annotated with a very fine-grained tag set rather than a very coarse tag set, while the HMM tagger actually prefers coarser tag sets. This was made even more clear by the results from the second algorithm which was used during this project, the merge/split algorithm. This algorithm (section 6) seems to be capable of much more aggressive merging, which yields coarser tag sets from the mapping. As can be seen in section 7.3, this causes significant increases for the HMM tagger, while causing severe decreases for the TnT tagger.

In our analysis we showed that some tags are better suited for merging such as nouns, articles and conjunctives. Other tags, such as determiners and miscella-neous (e.g. foreign words) are not very suited since merging them will not increase the log likelihood of the data very much (section 7.4).

In conclusion, we have found that it is possible to increase the accuracy of a tagger by optimizing the tag set. However, it seems that the optimal tag set is highly dependent on the tagger which is used, mainly due to the fact that certain taggers prefer smaller tag sets whereas others prefer more detailed tag sets. Using these techniques on different taggers and data sets will therefor require a certain amount of experimenting to obtain the ideal combination of method and parame-ters. The methods described in this paper could be used in many NLP applications, but research into these methods and other methods with the same aim is far from completed.

10 Future Work

One of the aspects we have not covered during the research, due to the limited time, is the renaming of the merged tags. Unfortunately, the features of the tags in the Alpino tree bank are not binary, and even between coarse tags the number of features may vary. A pronoun, for instance, could have 4 features or 5. This makes the creation of new tags quite complex, since the system ought to know what every feature actually means to be able to find a new name for the merged tag. During this research this problem has not been handled, and a merge would simply attach all the features of the merged tags to the new tag. For example, the merging of:

(26)

References

Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC ’00, pages 224–231, Stroudsburg, PA, USA. Association for Computational Linguistics. Buchholz, S. and Marsi, E. (2006). Conll-x shared task on multilingual

depen-dency parsing. In Proceedings of the Tenth Conference on Computational Natu-ral Language Learning, CoNLL-X ’06, pages 149–164, Stroudsburg, PA, USA. Association for Computational Linguistics.

Gale, W. and Sampson, G. (1995). Good-Turing smoothing without tears. Journal of Quantitative Linguistics, 2(3):217–237.

Huang, Z., Eidelman, V., and Harper, M. (2009). Improving a simple bigram HMM part-of-speech tagger by latent annotation and self-training. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Compan-ion Volume: Short Papers, NAACL-Short ’09, pages 213–216, Stroudsburg, PA, USA. Association for Computational Linguistics.

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3):241–254.

Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In Proceedings of LREC.

Petrov, S. and Klein, D. (2007). Learning and inference for hierarchically split PCFGs. In Proceedings of the National Conference on Artificial Intelligence, volume 22, page 1663. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.

Van der Beek, L., Bouma, G., Malouf, R., and van Noord, G. (2002). The Alpino Dependency Treebank. Language and Computers, 45(1):8–22.

Improving a part-of-speech tag set for Dutch