Inflecting Verbs with Word Embeddings: A Systematic Investigation of Morphological Information Captured by German Verb Embeddings

(1)

MSc Artificial Intelligence

Track: Natural Language Processing

Master Thesis

Inflecting Verbs with Word Embeddings:

A Systematic Investigation of Morphological

Information Captured by German Verb

Embeddings

by

S.A. Gieske

6167667

June 30, 2017

42 EC January 2016 - June 2017

Supervisor:

Dr C. Monz

Dr A. Bisazza

Assessor:

Dr W.H. Zuidema

(2)

Abstract

This study presents a generation task and classification task to systematically evaluate the information on morphological inflections for verbs captured in word embeddings for the mor-phologically rich German language. We create morphological representations by subtracting a base form representation from word embeddings in order to assess whether the embeddings captures inflectional information. Results confirm embeddings contain inflectional information which can successfully be leveraged for the inflection of verb forms in half of the cases. Further-more, this information can also successfully be used for classification of inflectional categories of inflected verbs. Morphology representations of similar categories show to be clustered together in the solution space but can also be divided in sub-clusters. We highlight that average perfor-mance gives a misleading view of individual perforperfor-mance as results show several categories gain a significantly lower performance than the average. Moreover, results show there is a significant correlation between performance and word frequency.

(3)

Acknowledgments

I would like to thank my supervisors, Christof Monz and Arianna Bisazza, for their guidance and feedback throughout my thesis, and for their patience for me to work on the thesis part-time. I would like to thank Jelle Zuidema for agreeing to sit in my defense committee. My thanks also go out to Marlies van der Wees for supplying me with several data sets needed for my thesis. Moreover, I would like to heartily thank my parents Bernard and Sham Gieske, for their extensive support in my choice to do a second master programme. Also, I would like to thank Shalien Gieske, Elise van der Pol, Marleen Vos and Bram van den Akker for their support & encouragement. Finally, my thanks go out to Elise van der Pol, Cassandra Loor, Sander Nugteren and Jonas Lodewegen for proofreading my thesis.

(4)

4.1.1 Grammatical Categories . . . 19 4.1.2 Ambiguity . . . 20 4.2 Embedding Corpora . . . 20 5 Word Representation 22 5.1 Representation Models . . . 22 5.2 Representation Ambiguity . . . 25 5.2.1 Purity . . . 26 5.2.2 Shannon Entropy . . . 26 6 Evaluation Setup 29 6.1 Cross-Validation . . . 29 6.2 Evaluation Metrics . . . 29

7 Generation of Inflected Forms 33 7.1 Generation Task . . . 33

7.1.1 Centroid Generation Method . . . 34

7.1.2 Centroid weights . . . 35

7.2 Experimental Results . . . 36

7.2.1 Experiments . . . 36

7.2.2 Results . . . 37

7.3 Main Findings . . . 43

8 Classification of Inflectional Categories 45 8.1 Classification Task . . . 45

8.1.1 Centroid Classification Method . . . 46

8.1.2 k-Nearest Neighbors Classification Method . . . 47

8.2 Experimental Results . . . 48

8.2.2 Results . . . 50

8.3 Main Findings . . . 55

(5)

9.1 Analogy Task . . . 56

9.1.2 Results . . . 58

9.2 Classification and Generation Tasks for Spanish . . . 60

10 Conclusion 63

A Inflectional Category Abbreviations 65

(6)

1 Introduction

In natural language, words are created from combinations of the smallest linguistic units known as morphemes. For instance, take the following sentence, and specifically the word ‘unpacked’ :

‘We unpacked the boxes in the living room.’

The word ‘unpacked’ is built from morphemes un-, pack and -ed. The concatenation of mor-phemes can change the semantics of the word, e.g. the prefix un- changes the meaning of pack. In addition, it can also change grammatical categories, e.g. suffix -ed changes the category of verb pack to a past tense form. The field in linguistics regarding the internal structure of words and how it relates to one another is referred to as morphology. Derivational morphology regards semantic changes, whereas inflectional morphology relates to changes in grammatical categories.

Morphology can be an important cue to determine the role of a word in a particular context. Therefore, the field of morphology is widely explored in Natural Language Processing (NLP). Morphology can be an important cue to determine the role of a word in a particular context. For instance, the grammatical categories of a word can be connected to its context. In addition, understanding the word structure can help the semantic analysis of text. As an example, take the following sentence:

‘She smiles when she sees he bought her chocolate.’

In this sentence, the morphological structure of the verb ‘smiles’ shows that it is in third person singular, present tense. Thus, its subject must also be in the third person singular form. Furthermore, from the verb ‘bought’, we can infer that this action was taken in the past. Languages which rely on morphology to convey many grammatical features are called morpho-logically rich languages (MRLs). In contrast, languages with limited morphology convey these features through syntactic and contextual cues. These languages are called morphologically poor languages.

Traditionally, rule-based and finite-state modules were used for morphological analysis (Kosken-niemi, 1984; Beesley, 1996; D´etrez and Ranta, 2012). However, these methods require hand-crafted rules which makes them expensive and time-consuming. Recently, research has focused on learning transformation rules through analysis of the internal structure of words (Durrett and DeNero, 2013; Ahlberg et al., 2014; Nicolai et al., 2015a; Faruqui et al., 2015). These studies mainly represented words on character-level and have shown high accuracy in tasks regarding morphology.

A different method to represent words is to learn distributional representations, also known as word embeddings, from raw text. This method builds upon the Distributional Hypothesis (Harris, 1954) that states that words which occur in similar contexts often carry a similar meaning and therefore should be similar in their embeddings. With the arrival of neural network models such as Word2Vec (Mikolov et al., 2013a), accurate word embeddings can be learned efficiently from large corpora and are shown to perform well on NLP tasks (Mikolov et al., 2013b; Rocktäschel et al., 2015; Baroni et al., 2014; Demir and Özgür, 2014). Research (Yarowsky and Wicentowski, 2000; Schone and Jurafsky, 2001) has shown that distributional similarity - in addition to internal structure - can be an important cue for morphology.

Overall, there is an increased popularity of using distributional representation for NLP tasks. Thus, the question arises whether these word embeddings can be used for morphological analy-sis. Most research on morphological analysis with the use of word embeddings is mainly focused

(7)

on the English language. However, English is known to be a morphologically poor language; in English, a verb can have 8 different forms of inflection, whereas in the more morphologi-cally rich German language a verb can belong to one of 27 different inflection categories. It is expected that morphological information will be more difficult to capture in MRLs due to a larger solution space. There has been limited research on morphology in word embeddings for MRLs (Nicolai et al., 2015b; Avraham and Goldberg, 2017; K¨oper et al., 2015). Moreover, these studies did not look into the different inflectional categories or only studied the subset of categories found in English. It is expected that morphological analysis with word embeddings for MRLs is more difficult.

Currently, relations in word embeddings are often evaluated by analogy tasks. These tasks consist of two word pairs, each pair with the same relation, where one word is left out to be predicted. This task lends itself for evaluation of different relations, including morphological relations. The following question shows an analogy relation of inflectional morphology, more specifically between the present tense and the participle past for word pairs (‘go’, ‘gone’ ) and (‘buy’,‘bought’ ):

‘go is to gone as buy is to – ?’ (answer: ‘bought’ )

In fact, the analogy task for inflectional morphology involves two sub-tasks: classification of inflectional categories and generation of the left out word form. In the example, the inflectional category of ‘gone’ is first identified by comparing it to the word ‘go’. Then, this inflectional category is used to generate the inflected verb ‘bought’.

However, the analogy task only considers one analogy at a time and often lacks a detailed comparison of the performance of individual morphological categories. Moreover, many analogy datasets focus on semantic relations and only have a small portion of morphological relation analogies. Above all, data on MRLs are often reduced to a subset of morphological relations which also occurs in English. In contrast, this thesis introduces a more systematic approach to the analysis of inflectional morphology in word embeddings by evaluating the classification and generation task separately.

The goal of this thesis is to analyse the extent to which morphological inflections are captured in word embeddings for MRLs. In addition, a more systematic approach is introduced for full analysis of embedding models with the use of generation and classification tasks regarding inflectional categories.

1.1 Research Questions

The increased popularity of word embeddings gives rise to the question whether word embed-dings capture morphological information. Because the embedembed-dings are trained with neural networks, they do not contain fixed values or positions for features such as inflectional cate-gories. Consequently, it is unclear what and how much morphological information is captured in embeddings. Therefore, this information is more difficult to identify in contrast with analysis of the internal structures of words.

Although research has shown that morphological relations can be identified between embed-dings, these studies have mostly been performed on the morphologically poor English language. MRLs hold a larger solution space than English; thus, it is expected the identification of mor-phological relations will be more difficult.

(8)

To what extent is information on morphological inflections captured in word embeddings for morphologically rich languages?

Inflectional morphology occurs for different word classes, such as nouns (‘cat’ and ‘cats’ ), adjec-tives (‘closer’ and ‘closest’ ), verbs (‘swim’ and ‘swam’ ) and many more. Verbs are particularly complex in morphological rich languages as they can display many different grammatical cate-gories. For this reason, this thesis will focus on inflectional morphology in verb forms.

In order to explore the above question, we make the following assumption: if embeddings hold information on morphological inflection, this information is dependent on the inflectional cate-gory and less dependent on the semantics of the words. Thus, this information will be similar over all words for this category and can lend itself to represent inflectional categories. Under this assumption, we approach the question above by answering the three research questions described below.

The first research question examines whether the morphological information captured in em-beddings lends itself to generate inflected word forms:

Q1 To what extent can we inflect verbs with the use of morphological information captured in embeddings?

We attempt to answer question Q1 by creating a generation task in which inflected verb forms are generated as discussed in Section 7.

Not only could the morphological information be used to generate an inflected verb form, but could also be useful in classification of the inflectional categories of encountered verb forms. This raises the following question:

Q2 To what extent can we classify the inflectional category of verb forms with the use of morphological information captured in embeddings?

To assess question Q2, a classification task is formed for the identification of inflectional cate-gories of verb forms and is discussed in Section 8.

So far, research on inflectional categories in MRLs suffers from the fact that it has only focused on a small subset of frequent words or only studied the subset of categories found in English (Nicolai et al., 2015b; K¨oper et al., 2015). Moreover, previous studies have mostly considered average results (Soricut and Och, 2015; Avraham and Goldberg, 2017), devoting little atten-tion to performance of individual inflecatten-tional categories. Because embeddings are trained on large, unannotated corpora of raw text, it is expected words from different categories are not represented with the same accuracy. Hence, average results could give a misleading picture of morphological information captured for individual categories. In contrast, this thesis aims to give a deep analysis on inflectional morphology in word embeddings regarding individual inflectional categories by systematically using all inflection forms. For this purpose, this thesis follows the next research question:

Q3 Is there a difference in the extent of the morphological inflection information captured among inflectional categories?

We will aim to answer research question Q3 by analysis of individual inflectional categories for both the generation task and classification task, in Section 7.2.2 and Section 8.2.2, respec-tively.

The contribution of this thesis is five-fold:

(9)

7 and Section 8, respectively).

2. We introduce a novel approach to represent the base form of a word as an embedding (Section 5).

3. We confirm inflectional information is captured in embeddings by leveraging embeddings for inflecting verb forms (Section 7) as well as for classification of inflectional categories (Section 8). In both tasks, we retrieve satisfying accuracy scores.

4. We find major differences in the extent to which inflectional information is captured for different inflectional categories. Inflectional categories corresponding to infrequent words result in lower accuracy for the generation and classification task (Section 7 and Section 8, respectively).

5. We illustrate that findings for one MRL cannot immediately be inferred for other MRLs. Large differences in performance are found between German and Spanish for the genera-tion and classificagenera-tion task (Secgenera-tion 9).

1.2 Outline

The remaining part of of this thesis is structured as follows: Section 2 introduces the necessary background on morphology and word embeddings. Section 3 surveys previous work on mor-phological inflection generation and analogy tasks for word embeddings. Section 4 outlines the datasets used for this thesis.

Section 5 describes the representation models and introduces a novel approach to represent the base form of verbs. Section 6 describes the evaluation setup used throughout this thesis. Section 7 introduces the generation task and presents its results. Section 8 introduces the classification task and gives an analysis of its results. Section 9 describes additional contrastive experiments, such as the analogy task for inflectional relations in German as well as generation and classification tasks for the morphologically rich Spanish language. Section 10 concludes this thesis and discusses open issues.

(10)

2 Background

This section describes necessary background knowledge for the conducted experiments. First, morphology is described, more specifically inflectional morphology. Furthermore, the Word2Vec and k-Nearest Neighbors algorithms are described, and the necessary background knowledge on word representations is introduced.

2.1 Morphology

In the field of linguistics, morphology regards the structure of words and how these words relate to others in the same language. Words are formed from one or more morphemes, i.e. the smallest grammatical units in a language. Morphemes can stand alone, known as a root, or can be joined with affixation.

A morpheme that can stand alone is a free morpheme: it on itself is a word with meaning, e.g. the morpheme ‘well’. In contrast, bound morphemes can only be attached to another word, e.g. the morpheme ‘un-’ finds its meaning when attached to the earlier mentioned free morpheme in the word ‘unwell’. A special case of morphemes are allomorphs, which differ in pronunciation but are semantically identical. An example are the morphemes which can indicate plurality in English, as seen in the words ‘cats’, ‘buses’ and ‘children’. It can also occur that the meaning of a word changes without visible difference in structure. Here, a null morpheme is added that changes a grammatical category which is to be derived from context, e.g. ‘sheep’ can be of the singular and plural form and the verb ‘eat’ marks the present tense in all forms but third person singular.

It is important to note that many words cannot be simply split into morphemes. For example, the word ‘relate’ is not coherent in meaning when it is pulled apart in the bound morpheme ‘re-’ and root ‘late’. Therefore, in many languages the segmentation of words in morphemes is a non-trivial task.

When a morpheme is joined to a word it can have a derivational or inflectional function1_.

Derivational morphemes change the core meaning of the affected word and, sometimes but not always, its part-of-speech. For example, ‘un-’ in the example ‘unwell’ or ‘-mer’ in ‘swimmer’. In contrast, inflectional morphemes do not affect the core meaning of a word but change its grammatical features, e.g. morpheme ’-s’ joined with ‘cat’ changes the word from singular to plural.

This thesis focuses on inflectional morphology. When there are many grammatical categories in a language, the number of word forms per lexeme can increase significantly. Moreover, there can exist ambiguity of grammatical categories of inflected forms. Consequently, morphological analysis increases in difficulty.

Inflectional Morphology A base form of a word (as found in a dictionary), also known as lemma, can be inflected to different words to represent different grammatical features. For example, in English the word ‘says’ is the third person, singular, present form for the lemma

1_{In literature, a third category can be indentified as compounding (Anderson, 1992) where joining two words} creates a new word, i.e. the German word ‘Orangensaft’ (orange juice) is a compound noun from ‘Orangen’ (orange) + ‘Saft’ (juice) and is separated from the derivational function. However, it is arguable whether a different formal account can be given for derivation and compounding (Booij, 2005). This topic is beyond the scope of this thesis and thus will not be taken into account.

(11)

‘say’. Examples of these features are: number, with values such as singular and plural; per-son, which can have first, second and third person as values; and tense, with values such as present and past. Sylak-Glassman et al. (2015) identified 23 attribute dimensions of meaning with 212 features; however, not all dimensions or all features within a dimension are present among all word classes or for all languages. In this thesis, we use the term inflectional cate-gory to denote a particular combination of grammatical features found in a language, such as third person, singular, present indicative. Words can belong to different inflectional categories without visible difference in syntax or sound by adding a null morpheme, e.g. ‘played’ can be of the type participle past form as well as simple past. These words are called syncretic to one another.

Inflectional Paradigm The collection of inflected words which correspond to the same lemma is called an inflectional paradigm. It can be expressed with rules or formulas for con-struction of the inflection forms from the lemma. In practice2_{, words which belong to the}

same inflectional category are often constructed with similar rules. For example, in the English language, verbs in the third person, singular, present tense are often inflected from the lemma with ‘-s’, e.g. ‘walks’. In addition, subsets of rules within paradigms can be regular within a language. For example, in Spanish, if the lemma ends with ‘-ar’, e.g. ‘hablar’ (to speak), the second person, singular, present ends in ‘-as’, e.g. ‘hablas’. Similarly, its third person, singular present form ends with ‘-a’, e.g. ‘habla’. Consequently, the word form of the same inflectional category can be predictive for different lemmas, for instance for regular verbs. Moreover, the word form of one inflectional category can also be predictive for a different inflectional category within the same lemma. For example, if in Spanish the first person, plural, present tense ends with ‘-emos’, e.g. ‘comemos’, there is a high probability the second person, plural, present tense ends with ‘-´eis’, e.g. ‘com´eis’.

Inflectional Languages Languages which show some degree of inflectional morphology are called synthetic languages. The set of obligatory morphological properties for word classes and how they are affected through inflection is language dependent. For example, in English the noun only has the property number whereas a noun in Spanish holds the number and gender properties in its inflection. Languages are a morphologically rich when many grammatical features are being expressed with morphology. For instance, English is a morphologically poor language, whereas Spanish is an MRL.

Additionally, if a sentence can consist of one single highly inflected word, the language is called polysynthetic. If the language does not hold any inflections, such as in Mandarin Chinese, the language is called analytic. Within synthetic language a division is made between fusional lan-guages, where a single inflection can convey multiple roles, such as German; and agglutinative languages where each inflection conveys a single grammatical role, such as Finnish.

Both synthetic types form obstacles for NLP: fusional languages hold ambiguity with respect to the inflection form as there exist syncretic pairs, whereas agglutinative languages have higher numbers of word forms per lemma which increases data sparsity. Agglutinative languages tend to be very regular where each affix serves just one purpose which makes it easier for a morphological segmenter, whereas segmentation for fusional languages is a harder task and could thereby benefit from taking into account context with word embeddings instead of internal structure. Therefore, thesis focuses on highly inflected fusional languages.

(12)

2.2 Word Representations

Computers can extract little information from the symbolic representation of the word itself. Therefore, it is common in NLP tasks to represent the word by its features in a continuous space. In other words, words are mapped to an array of real-valued numbers where each dimension represents a feature.

A popular method to represent words in a continuous space is by Distributional Semantic Models (Turney and Pantel, 2010). These models are built upon the Distributional Hypothe-sis (Harris, 1954): ‘words that occur in similar context tend to have a similar meaning’. The models approximate meaning of words by incorporating patterns of co-occurrence in corpora. Representations that are close in the vector space are thus more likely to be close in mean-ing.

These distributional representations can be derived through context-counting or context-predicting methods. Context-counting methods use the frequency of words in certain context and trans-form the vectors to more intrans-formative values with weighting functions. A well-known example of this method is Latent Semantic Analysis (Deerwester et al., 1990), which uses Singular Value Decomposition to reduce dimensions while capturing the strongest relationships for word co-occurrence. In contrast, context-predicting methods directly set weighting functions to max-imize the probability of the word appearing in the context. These models were developed for neural language models (Bengio et al., 2003) - among which is the Word2Vec algorithm, further described in Section 2.4. Word representations in context-prediction models are also known as word embeddings. Context-predicting methods have been shown to outperform context-counting methods on several semantic benchmarks (Baroni et al., 2014).

In terms of the similarity between two word embeddings, one of the most popular metrics to measure similarity in NLP is the cosine similarity. The cosine similarity measures the angle between two embeddings in the vector space and is described below.

Cosine Similarity The cosine similarity metric indicates how closely two vectors point in the same direction. The metric measures the cosine angle between two vectors and normalizes the vector’s length thereby ignoring their magnitude. The cosine similarity between vectors ~x and ~y, both of size N , is calculated as follows:

cos(~x, ~y) = ~x T_~_y k~xkk~yk = N P i=1 xiyi s N P i=1 x2 i s N P i=1 y2 i (1)

2.3 Neural Networks

A neural network is a machine learning model which maps a K-dimensional input ~_{x ∈ R}K _to

an M -dimensional output ~_{y ∈ R}M _{with the use of activation functions through interconnected}

layers (Bishop, 2006). An activation is given over ~x by multiplying with a weight matrix W and adding a bias term b. In order to introduce non-linearity into the network, this activation is then mapped through a nonlinear activation function h:

~

(13)

The output can be used as the input for a new layer. By creating a deeper layered network it allows the network to approximate more complex functions. An additional layer can be added as follows:

~z = h0(W0~x + b0) (3)

~y = h1(W1~z + b1) (4)

where indices 0 and 1 indicate the model parameters for the first and second layer, respectively, and D-dimensional ~_{z ∈ R}D _{is the hidden layer. Figure 1 illustrates a single-hidden-layer network}

as described above. .. . ... .. . .. . x1 x2 xK z1 z2 z3 zD y1 y2 yM Hidden Input Output

Figure 1: Example illustration of a single-hidden-layer neural network.

The neural network is trained by updating the weight and bias parameters during training steps for minimizing a specified loss function.

2.4 Word2Vec

The Word2Vec3 _{model by (Mikolov et al., 2013a) generates high-quality embeddings for target}

words. The embeddings learned with the Word2Vec model are constructed so that semantically similar words appear close in the vector space. Patterns in these embeddings can be found in linear translations: a well-know example is vector calculation vec(‘king’ ) - vec(‘man’ ) + vec(‘woman’ ) where the vec(‘queen’ ) is the closest word vector to the outcome (Mikolov et al., 2013c).

The underlying models of Word2Vec are the skip-gram or the continuous bag-of-words (CBOW) (Mikolov et al., 2013a). These are single-hidden-layer neural networks trained to predict neigh-bouring words for target words. The main difference between the models is their task: the CBOW model predicts the context words given a word, whereas the skip-gram model predicts a word given the context words. The CBOW model is faster to train and shows better accu-racy for frequent words. In contrast, the skip-gram model works well with smaller data sets and finds better representations for rare words. A more technical description can be found in Section 2.4. Due to their similarity in network architecture, this section will only describe the skip-gram model.

(14)

The Word2Vec algorithm allows for computationally efficient training. It is trained on unan-notated text and scales well to significantly large corpora - unlike most neural network archi-tectures for learning word vectors - as it does not involve dense matrix multiplications. After training the Word2Vec software does not employ the trained networks, but extracts the embed-dings from its model parameters.

Skip-gram Model The skip-gram model (Mikolov et al., 2013a) is a fully connected neu-ral network4 _{with a single hidden layer designed for computing distributed representations of}

words. The training objective of this network is to predict the probability for each word in the vocabulary to be in the vicinity of a given input word.

The model parameters θ are set to maximize the probability of the corpus T given the words wt and its surrounding context words wt+j as follows:

argmax θ T Y t=1 Y −m≤j≤m j6=0 p(wt+j|wt; θ) (5)

where m is the window-size for the context words surrounding wt and p(wt+j|wt; θ) is the

probability of the context word occurring within the window surrounding wt.

Note that wt+j indicates a word at position t + j in the corpus. However, a word from the

vocabulary can occur on multiple positions. Therefore, take wt+j as the output word wO and

wt is the input word wI from the vocabulary V. The probability of the output word given the

input word is calculated with a softmax function:

p(wO|wI) = exp(~u>_w_O~vwI) V P w=1 exp(~u> w~vwI) (6)

where the softmax is calculated with the use of vector representations, or embeddings, ~vwI and

~vwO for the word and context, respectively. The embeddings are part of the model parameters

θ and are iteratively updated during training time.

Figure 2: The Skip-gram model architecture. Source: Rong (2014)

(15)

Figure 2 illustrates the structure of the skip-gram model. The neural network receives as input a one-hot encoded vector ~x of the size of the vocabulary V corresponding to w as xwI = 1 and

x0_i = 0, ∀i 6= wI. The input vector is mapped to the hidden layer ~h of N dimensions by matrix

multiplication with the weight matrix W (Equation 7). Then, the hidden layer is mapped by matrix multiplication with the weight matrix W0 and a softmax function to a probability distribution over V.

~h = W>

~

x (7)

~y = softmax(W0~h) (8)

As the input is a one-hot-vector, Equation 7 can be reduced so that ~h corresponds to the input embedding vwI as shown in Equation 9.

~h = W(wI,·) := vwI (9)

Continuously, the result of the matrix multiplication of the hidden layer ~h with the weight matrix W0 is a 1 × |V| dimensional vector, where the output for word wO is captured by

embedding uwO.

~

y = softmax(W0vwI) = softmax(~u

>

wO~vwI) (10)

After the weight matrices are optimized, the word embeddings are extracted from the weight matrix W. As the network receives a one-hot encoded vector as input, only row j in this matrix is updated during training and only this row is to be extracted for the embedding of wj.

The intuition behind using the embeddings is that if two different words appear in similar context, the output given by the model for these two words should be similar. This directs the neural network to train the words with similar weights, thus creating similar representation vectors.

2.5 k-Nearest Neighbors

The k-Nearest Neighbors (kNN) algorithm (Cover and Hart, 1967) is a simple supervised clas-sification method where the label of an unlabeled data point is predicted by the majority label of the k nearest neighbors.

The algorithm assumes the data is in feature space. Consequently, distance can be measured between data points. In addition, each training example has a label5_{. The algorithm finds the}

k data points closest to the test data point through a specified distance metric, such as the cosine similarity metric described in Section 2.2. The labels of the neighbors are retrieved and the predicted label is given by the majority vote.

An illustration of the kNN classification is given in Figure 3. In this example, the label for the grey data point is to be predicted. The algorithm retrieves the three nearest neighbors (k = 3) to the grey data point: two nearest neighbors have the blue circle label, whereas one data point

(16)

has the red square label. By using majority voting among the labels of the neighbors, the predicted label is the blue circle label.

?

Figure 3: Example of binary kNN classification with k=3. Data points can have the blue circle label or the red square label. The label of the grey data point is predicted with the majority vote of three neighbors. The dashed circle shows the nearest neighbors.

Note that the best number of neighbors k depends on the data. When k is set too low, the algorithm can under-fit on the data and noise will have a large influence on the classification. In contrast, when k increases the effect of noise is reduced, however boundaries between classes become less distinct and the algorithm can over-fit on the data.

(17)

3 Previous Work

This section describes work related to morphological analysis through inflection generation and analogy tasks with word embeddings.

3.1 Morphological Inflection Generation

One direction for research in morphological analysis is inflection generation. The recent chal-lenge of the SIGMORPHON 2016 Shared Task (Cotterell et al., 2016a) gives the task of mor-phological inflection for a diverse set of synthetic languages. In addition to inflection from a lemma to an inflected form, inflections are to be created from an already inflected form. Over-all, approaches to this task were found in three different directions: (i) a pipelined approach which first applies unsupervised alignment to extract edit operations after which a discrimi-native model applies changes, inspired by Durrett and DeNero (2013); (ii) neural approaches such as recurrent neural networks, inspired by Faruqui et al. (2015); and (iii) methods relying on linguistically-inspired heuristics to reduce the task to multi-way classification, inspired by Eskander et al. (2013) and Ahlberg et al. (2014). The systems gained high accuracy scores; where the neural systems significantly outperformed the other methods. However, little atten-tion was devoted to individual inflecatten-tional categories. In addiatten-tion, all methods were applied on character level. In contrast, this thesis has a word-level focus because the embeddings are word representations influenced by its context words. Moreover, all systems require annotated data to be trained, whereas word embeddings are trained from raw text. In comparison, this thesis only uses annotated data to evaluate word embeddings.

More related to this thesis is recent work by Soricut and Och (2015) who induced morphological transformations with the use of word embeddings. They extracted direction vectors from the embeddings of word pairs in support of lexical rules in the vocabulary. An example of a lexical rule is suffix:ed:ing; this rule denotes the suffix change for word pairs such as (‘bored’, ‘boring’ ) and (‘stopped’, ‘stopping’ ). Together with these rules, the direction vectors were processed to a weighted, directed graph of morphological transformations. This graph can be used to map rare or unseen words back to a related, more frequent variant. In contrast to our work, they guided the direction vectors by lexicalized rules requiring a lexical analysis. In addition, the lexicalized rules only focused on prefix and suffix substitution. In other words, they only handle concatenative morphology, where the prefix or suffix is attached to the base form without modifying it. Moreover, as the word mappings were produced by favoring frequency instead of linguistic rules the graph root did not have to be equal to the lemma. Likewise, inflected words were not necessarily directly connected to the root; therefore, an edge did not represent a clear inflectional category. For example, in their model the word ‘recreates’ is connected to its lemma ‘recreate’ by an intermediate node ‘recreating’. Finally, the model scored high on word similarity, however, the datasets held none to few word pairs of verbs related with inflectional transformations.

Other work by Cotterell and Sch¨utze (2015) guided embeddings to capture morphological re-lations from the input. They augmented a log-bilinear model (LBL) to predict the next word and its morphological tag. They trained the word embeddings on a corpus annotated with morphological tags to encourage the vectors to encode the morphology of the word. In con-trast, in this thesis we will look at embeddings trained from unannotated corpora to identify the morphological information already captured in these embeddings.

(18)

3.2 Analogy Tasks

Analysis of relationships between word embeddings is often performed by analogy tasks. These tasks consist of two word pairs with the same relation, where one word is left out to be pre-dicted. This task lends itself for evaluation of morphological relations. It has shown to be a promising approach for morphological reasoning (Lavall´ee and Langlais, 2009). A wide variety of studies has focused on analogy tasks for the morphologically poor English language (Glad-kova et al., 2016; Mikolov et al., 2013c; Levy et al., 2014); however, limited studies have focused on morphological relations in MRLs.

K¨oper et al. (2015) gave a deeper analysis on relations encoded in embeddings for English and German. Their focus was on semantic relatedness and only touches lightly on morphological relations such as verb inflections. Moreover, the test data for German and English was made comparable, thereby merging different inflectional categories of German and results in some inflectional categories not being present in the test data. In general, they showed overall performance is lower for the MRL German.

In recent work, Nicolai et al. (2015b) conducted a multilingual study of morphological analogies including MRLs. Specifically, the study focused on all morphological inflections for English, German, Dutch, Spanish and French. They stated that embeddings can be trained which pre-serve morphological information. Moreover, they found that accuracy of embeddings negatively correlates with the morphological complexity of the language. Their study gave average accu-racy scores over all possible inflections of a language, as well as accuaccu-racy scores over a subset including only the inflectional forms found in English. However, no insight was given into the performance of individual inflectional categories. In addition, their experiments were only per-formed on analogies between the top 100 most frequent verbal infinitives and their inflected forms.

Furthermore, research by Cotterell et al. (2016b) proposed a method to exploit morphological relations to extrapolate new vectors. They trained a latent variable Gaussian graphical model where the latent variables represent embeddings of morphemes. In turn, the morpheme em-beddings were combined to generate representations for unseen inflected forms. Their model was tested on a morphological analogy task and showed higher performance on several MRLs when incorporating information from morphologically related forms. In contrast, this thesis uses out-of-the-box embeddings with minor modifications.

Other work by Avraham and Goldberg (2017) analysed the interplay of semantics and morphol-ogy in word embeddings for the MRL Modern Hebrew. They composed word embeddings from the sum of embeddings of their linguistic properties. Separate embeddings were trained for the lemma, the surface form and its morphological tag. Moreover, they showed that excluding some of the properties decreases performance in a morphological similarity task.

(19)

4 Datasets

This section describes the datasets which were used in this thesis. An inflectional morphology dataset is used for identification of inflectional categories. In addition, multiple corpora are used for training word embeddings.

4.1 Inflection Morphology Dataset

This research focuses on the MRL German. German is a West Germanic language, both morphologically rich, i.e. it has many different morphological classes, and fusional, i.e. sin-gle morpheme capture multiple morphological features. These characteristics are specifically present in verb forms and for this reason verb inflections are the focus of this thesis. In addi-tion, contrastive experiments are performed for the MRL Spanish in Section 9.2. For ease of comparison to the German language, the statistics for the Spanish datasets are also described in this section.

The inflected verb forms are constructed from Wiktionary (Durrett and DeNero, 2013). Table 1 shows an excerpt from the verb inflection dataset for German. Each row in the data set contains an inflected verb form, its corresponding lemma, and inflectional category.

Inflected Form Lemma Inflectional Category

. . . .

lesen lesen type=infinitive

lesend lesen type=participle:tense=present gelesen lesen type=participle:tense=past

lese lesen person=1st:number=singular:tense=present:mood=indicative liest lesen person=2nd:number=singular:tense=present:mood=indicative liest lesen person=3rd:number=singular:tense=present:mood=indicative

. . . .

Table 1: Excerpt from the German verb inflection dataset for lemma lesen.

Table 2 shows the statistics of the verb inflections in the dataset. German is morphologically rich with regard to verbal inflection since it distinguishes 27 different inflection categories per lemma. Although a distinction is made for many different categories, only approximately half of the total number of inflected verb forms in the dataset are unique word forms. On average, an inflected verb form belongs to roughly 2 inflection categories; however, they can belong up to 9 inflectional categories, such as in the case of ‘rasten’ (to rest). For this reason, German is very ambiguous with respect to verb inflections.

Language DE ES

Lemmas 2,027 4,055

Inflection per lemma 27 57

Total words 54,621 231,135

Unique words 23,890 202,768

Words with single inflection 9,586 174,497 Average inflection per word 2.29 1.14

(20)

Table 3 displays the number of words which can belong to a specific number of inflectional categories. Although there are many words which have a single inflectional category, these are significantly less frequent than words with three, four or five inflectional categories.

# Categories 1 2 3 4 5 6 7 8 9

DE 9,588 5,745 2,806 3,702 2,003 35 10 0 1

ES 175,053 28,041

Table 3: Number of inflected verb forms belonging to a given number of inflectional categories.

As the inflectional categories have multiple features which results in long labels, abbreviations are used throughout this thesis. The abbreviations can be found in Appendix A.

4.1.1 Grammatical Categories

Inflectional verb forms can hold multiple grammatical attributes. These attributes are in agree-ment to other words in the sentence to form a grammatically correct sentence. For example, in the phrase ‘She walks towards the city’, the verb ‘walks’ is of the third person singular form because of the singular subject ‘she’. Below, the grammatical attributes found for German verbs are described: person, number, tense, mood and type.

Person The grammatical attribute person refers to the participant in the conversation. First person (1) refers to the speaker, second person (2) to one spoken to and third person (3) refers to all others.

Number The number corresponds to the quantity: singular person (s) indicates only one, whereas plural person (p) indicates more than one.

Tense The tense corresponds to a location in time relative to the moment of utterance. The most common tenses are present (pr) and past (ps).

Mood In German two different moods are used: the indicative mood (ind) and subjunctive mood (subj). The indicative mood is used for stating facts. For instance, what is happening or has happened. The phrase ‘I go’ is translated to ‘Ich gehe’. In contrast, the subjunctive mood is used for stating possibilities, beliefs or thoughts. For example, the subjunctive version of the previous phrase is translated to ‘Ich ginge’ (I would go).

Type The type feature can have the value infinitive (infinitive) or participle (part). In many languages6_{, among which German, the infinitive is used as the lemma of a verb. The}

participle types7 are combined with a present or past tense. The participle present describes an active action whereas the participle past is passive. For example, for the lemma ‘geben’ (to give), the participle present form is ‘gebend’ (giving) and the participle past is ‘gegeben’ (given).

6_{There are languages which use another form as the lemma of a verb, or can have multiple infinitive forms.} For example, in Greek the first person, singular, present indicative is used as the lemma.

7_{Participle forms can also used as adjectives. For example, in the phrase ‘the loving mother’, the adjective} ‘loving’ is used. Thus, the participle present verb is the same in the adjective use of the word. However, in German participle forms used as adjectives take an adjective ending, i.e. the word must fit the grammatical context. In the phrase ‘die liebende Mutter’ (the loving mother), ‘liebend’ is adjusted to correspond to the feminine noun Mutter. In contrast to English, these adjective cases do not influence the word embeddings of the participle present verb in German because these word forms are not included in the morphology dataset.

(21)

4.1.2 Ambiguity

German verbs can be ambiguous with respect to their inflectional category. For example, the verb ‘liebte’ (loved) corresponds to both the first person and the third person of the singular past tense. In addition, both can be of an indicative or subjunctive mood. In other words, this verb corresponds to four different inflectional categories. As can be seen in Table 2, German verbs belong to two different inflectional categories on average.

As far as inflectional categories are concerned, they can display certain patterns with respect to ambiguity. For example, words of the participle present such as ‘lesend’ (reading) are never ambiguous and only belong to this category. In contrast, words in the second person, singular past tense such as ‘antwortetest’ (answer) or ‘bedachtest’ (thought) are often the same in its indicative and subjunctive mood.

Inflectional ambiguity is not handled by standard word embeddings, which are learned from unannotated text. Therefore, it cannot distinguish between the same word for different inflec-tional categories. For example, the same embedding is used to represent both the indicative and subjunctive mood of ‘antwortetest’ (second person, singular, past tense).

We will conduct a quantitative analysis of the inflectional ambiguity in Section 5.2 after having introduced the word representations used in this thesis.

4.2 Embedding Corpora

The word embeddings are trained on the Wikipedia corpus8_{. Plain text is extracted from the}

Wikipedia Database dumps9 as seen in Figure 4.

‘1802 traf er den aus Ägypten zurückgekehrten und zum Prfekten der Isère er-nannten Mathematiker Joseph Fourier. Dieser zeigte ihm Teile seiner ägyptischen Sammlung und weckte mit der Erklärung, dass niemand diese Schriftzeichen lesen könne, in Champollion das lebenslange Streben nach der Entzifferung der Hiero-glyphen.’

Figure 4: Excerpt from plain text from the Wikipedia Database dump.

The Wikipedia corpus does not contain many verb forms in the first person, i.e. it does not contain many sentences from the perspective of the author. Neither does it contain many verbs forms in the second person, i.e. it does not contain many sentences directed to the reader. Therefore, an additional corpus of subtitles10 _{is added. These subtitles belong to movies and}

tv-shows; this genre is known to contain frequent occurences of first and second person verb forms. Figure 5 shows an excerpt from the subtitle corpus.

8_{generated by the Wikimedia Foundation on February 03, 2016} 9_{https://github.com/attardi/wikiextractor}

(22)

‘Vater , wo seid Ihr ?

Bitte , lass mich meinen Vater sehen . Vater wo bist du ?

Bitte lass sie das nicht machen . Bitte , h¨ort auf !

Lass mich meinen Vater sehen ! - Es ist , was der Herr m¨ochte .

- Es ist eine gute Sache . ’

Figure 5: Excerpt from German subtitles corpus.

Table 4 shows the statistics on the corpora used for training the word embeddings. The vo-cabulary is constructed with a frequency threshold of 5, i.e. only words that have at least 5 occurrences in the corpus are taken into account. When both corpora are used, the size of the vocabulary increases significantly.

Corpus Tokens |V|

Wikipedia Database Dump 618M 1,885,420 German OpenSubtitles 79M 167,325 DE Merged Corpus 697M 1,937.677 Wikipedia Database Dump 491M 256,887 Spanish OpenSubtitles 139M 209,782

ES Merged corpus 630M 937,306

Table 4: Statistics on the number of tokens and vocabulary V of the German and Spanish datasets.

(23)

5 Word Representation

This section describes the word representations used throughout this thesis to extract mor-phological information from embeddings. Furthermore, a quantitative analysis is given on the inflectional ambiguity present in the representation models.

Inflected word forms are represented with word embeddings retrieved by the Word2Vec algo-rithm as described in Section 2.4. These embeddings capture different degrees of word informa-tion. However, we are only interested in morphological inflection information captured in word embeddings, henceforth referred to as inflectional information. For this reason, we manipulate the embeddings to extract morphology representations.

We aim to extract inflectional information by removing information which is also captured in the lemma form of a word. This can be done by subtracting embeddings of lemmas forms (i.e. dictionary forms) but this is an arbitrary choice from the point of view of Word2Vec training. Alternatively, we can construct the embedding of a special base form that is updated every time an inflected form of the same lexeme is encountered in the training. Contrastive experiments are performed with the original embeddings without manipulation for extraction of inflectional information.

Section 5.1 describes the three different representation models which are used for inflectional analysis:

1. Original model: this model uses original embeddings and is used for contrastive experi-ments.

2. Lemma model: we make minor modifications to the original model to extract morphology representations.

3. Extended model: we introduce a novel approach to represent base forms which are used to extract morphology representation.

In addition, Section 5.2 gives a quantitative analysis of the inflectional ambiguity found in the models.

5.1 Representation Models

Three different representation models are used throughout this thesis: the original model, the lemma model and the extended model. Word representations in these models are trained with the Word2Vec algorithm as described in Section 2.4. In all these models, the hyper-parameters for the window-size for context words, negative sampling, and count cut-off are set to their default values. The window-size for context words, L, is set to 5, i.e. the model takes into account 5 words on the left and 5 on the right of the reference word as demonstrated below:

c−5, . . . , c−1, w, c+1, . . . , c+5 (11)

where ci are the context words surrounding the reference word w. The negative sampling

parameter, i.e. the number of words used as negative context examples for training, is set to 5. In addition, the models are trained with a count cut-off of 5, i.e. words with a lower frequency than the cut-off are ignored.

The models learn word embeddings v(w) for all words w in the vocabulary V. Furthermore, morphology representations v∗(w) are created which aim to only capture inflectional information

(24)

of w. The morphology representations are trained differently in each of the models and are described in more detail in the corresponding model sections below.

Original Model The original model is a plain skip-gram model (Section 2.4) as used in literature (Mikolov et al., 2013a) and serves as a baseline. We do not apply special modifications to the embedding for it to represent inflectional information, that is:

v∗(w) = v(w) (12)

Lemma Model From a linguistic view, the lemma represents inflected word forms with the same meaning. Thus, the lemma of a word has a large influence on the kinds of context where the word appears. For example, the lemma ‘eat’ will likely steer the context to food that is eaten or places where you can eat. By removing the vector influence of the lemma, the influence of the semantic context in the embedding will be reduced thereby emphasizing the morphological information.

The lemma model uses the same word embeddings as the original model; however, the mor-phology representation for w is obtained by subtracting the word embedding of its lemma `(w) from the embedding of w:

v∗(w) = v(w) − v(`(w)) (13)

In the case that the lemma word form is the same as the inflected word form, the morphological representation is a zero-vector:

v∗(w) = ~0 if w = `(w) (14)

Note that some words can have multiple lemmas, as illustrated by Table 5. Because word forms of higher frequency will result in more accurate embeddings, the morphology representation uses the lemma for these words which is most frequent in the corpus in the case of multiple lemmas.

Inflected form Lemma Inflection category

betrüge betrügen person=1st:number=singular:tense=past:mood=subjunctive betrüge betragen person=3rd:number=singular:tense=past:mood=subjunctive gebraucht gebrauchen type=participle:tense=past

gebraucht brauchen type=participle:tense=past

Table 5: Examples of inflected forms that may have multiple lemmas.

Extended Model A disadvantage of the lemma model is that some inflected forms in our inflection dataset are the same as their lemmas. For example, the lemma form ‘lesen’ corre-sponds to the infinitive but also, among others, to the first person plural present indicative form (1,p,pr,ind). Because embeddings are equal for the same surface forms regardless of their inflectional category, subtracting the lemma representation will result in zero vectors as shown in Equation 14. As all vector elements are equal to zero, vector length is also zero and thus the cosine similarity with other vectors cannot be calculated. Moreover, the embedding

(25)

for the lemma is only trained whenever its exact word form is encountered in the corpus. If the lemma occurs infrequently, the representation will less accurately represent the context encountered for its inflectional forms. Thus the morphology representation can result in a less accurate description if the lemma embedding is used. Moreover, it can be the case that this form does not appear at all in the corpus and thereby the inflected forms cannot be used for the generation or classification task.

For this reason, we propose a novel approach to represent a lexeme by using a special base form token β(w) associated with all inflected forms of each lemma `(w). For this purpose, we extend the Word2Vec algorithm to update the embedding of β(w) every time an inflected word form w of a specific lemma `(w) is encountered during training:

c−5, . . . , c−1, β(w), c+1, . . . , c+5 (15)

Intuitively, this representation will more accurately represent the lemma as it is trained every time an inflected form is encountered in the corpus compared to the embedding of the lemma word form. This method represents a useful alternative to the lemma model as it does not need to encounter an exact occurrence of the lemma. The model trained with this extension is hereafter called the extended model.

In our extended model, the morphology representation for w is created by subtracting the special base form embedding:

v∗(w) = v(w) − v(β(w)) (16)

In contrast to the lemma model, this token is trained as a separate word and is not equal to inflected word forms. Therefore, there is an insignificant probability11 _{that it is equal to an}

inflected verb and that its morphological representation will be a zero-vector.

Note that due to the adjustment in the training method, our extended model uses different embeddings than the original and lemma model. Since embeddings are analysed within one model it should not significantly effect the relation between embeddings and thereby should not influence the inflectional information captured in embeddings among the different models.

Model Details The vector dimensionality can have an influence on the representation of words. Higher dimensional vectors may capture more morphological information, but may also incorporate more semantic knowledge or lead to over-fitting. As contrastive experiments, the models are analysed with three different vector dimensionalities: 200; 500, as done by Soricut and Och (2015); and 640, as done by Nicolai et al. (2015b).

Depending on the training data, our representation models may not provide embeddings for all words in the inflection dataset. For instance, embeddings are not created for words which occur below the cut-off threshold - or do not occur at all - in the corpus. By comparison to the other models, the lemma model has even less usable morphology representations. For instance, the model holds undefined morphology representations in the case a representation for the lemma of a word is not found. In addition, when morphology representations are zero-vectors - which occurs when the inflected word is the same as its lemma form - the representations are not usable in some tasks due to cosine similarity calculations not being possible.

11_{Only if one single inflected form for a lemma is encountered in the corpora, the embedding for the base} form will be the same as the embedding for the inflected form since it is trained with the exact same contexts.

(26)

The numbers of non-zero morphology representations in each model are displayed in Table 6. Almost all categories have less non-zero morphology representations in the lemma model, nevertheless five categories show an extreme decrease. These categories are: the first person, plural, indicative, present tense (1,p,pr,ind); the third person, plural, indicative, present tense (3,p,pr,ind); first person, plural, subjunctive, present tense (1,p,pr,subj); third person, plural, subjunctive, present tense (3,p,pr,subj); and the infinitive (infinitive); highlighted in Table 612. This large decrease is the result of the ambiguity of the inflected word. More specifically, the lemma form is almost always of the same form as the inflected word in these categories. In the subsection below, a more detailed quantitative analysis is given on the ambiguity encountered in the representation models.

Inflection Orig/Ext Lemma

1,p,ps,ind 1667 1637 1,p,ps,subj 1592 1562 1,p,pr,ind 1892 6 1,p,pr,subj 1892 7 1,s,ps,ind 1775 1746 1,s,ps,subj 1657 1629 1,s,pr,ind 1310 1307 1,s,pr,subj 1384 1381 2,p,ps,ind 101 101 2,p,ps,subj 50 50 2,p,pr,ind 1889 1839 2,p,pr,subj 347 346 2,s,ps,ind 160 160 2,s,ps,subj 96 96 2,s,pr,ind 1026 1024 2,s,pr,subj 145 145

Inflection Orig/Ext Lemma

3,p,ps,ind 1667 1637 3,p,ps,subj 1592 1562 3,p,pr,ind 1892 6 3,p,pr,subj 1892 7 3,s,ps,ind 1775 1746 3,s,ps,subj 1657 1629 3,s,pr,ind 1906 1856 3,s,pr,subj 1384 1381 infinitive 1892 5 part,ps 1855 1767 part,pr 860 859 Total 35,355 25,491 Average 1,309 944

Table 6: Number of non-zero morphology representations for each inflection category for the original model (Orig), the lemma model (Lemma) and the extended model (Ext).

Care is taken, as described in Sections 6.2 and 7.1.2, to evaluate all models on the same num-ber of points. However, under these circumstances the numnum-ber of zero-vectors or undefined morphology representations in the lemma model will influence the performance as many repre-sentations are not usable.

5.2 Representation Ambiguity

Only 25 % of the words for which embeddings are found (3,014 of 12,038) is unambiguous with respect to their inflectional category. The large proportion of ambiguous words in the dataset will make it more difficult to identify inflectional features in the embeddings. For this reason, we introduce ambiguity as an explaining variable for the extent to which inflectional information is captured in word embeddings.

12_{The original model and extended model have the same number of non-zero morphological representations.} In the original model, the morphological representation of a word is equal to the embedding of the word itself. Therefore, it will not be a zero-vector and will only be undefined when no embedding is found for this word. In contrast, in the extended model the morphological representation is created by subtracting the embedding for a special base form token. In this model, a zero-vector only occurs when the embedding for the word form is the same as the embedding for the base form. However, this does not occur in the training data. Thus, the number of data points for these two models are equal.

(27)

This subsection gives a quantitative analysis of the ambiguity in inflectional categories by two metrics: purity and Shannon Entropy. The purity score corresponds to how many words are unambiguous in their category, whereas the Shannon Entropy displays the level of uncertainty if words are ambiguous.

5.2.1 Purity

The purity score (A) of an inflectional category ϕ is the percentage of words in this category which do not belong to any other category. For example, the word ‘gelesen’ solely belongs to the participle past tense (part,ps). If a category has a high purity score, it is expected to have higher performance since it contains fewer ambiguous words. The calculation of the purity score of a category is shown in Equation 17.

A(ϕ) = 1 |Vϕ|

X

w∈Vϕ

u(w) (17)

where Vϕ are the words with inflection category ϕ. Whether w is unambiguous with respect to

its inflectional category is defined as u(w): u(w) =

(

1, w ∈ Vϕ∧ w 6∈ Vϕ0, ∀ϕ0 6= ϕ

0, otherwise (18)

where u(w) is one if the word w does not belong any set of words with another category than ϕ, otherwise 0.

Table 7 contains the purity scores for the inflectional categories of the German verbs. Analy-sis shows that there are only a few categories which contain a large number of unambiguous words, i.e. words which belong only to this specific inflectional category. More specifically, the participle present tense (part,pr) is the only category in which all words are unambigu-ous. Inflectional categories regarding the first and third person show significantly low to zero purity, whereas inflectional categories in the second person vary in purity scores from 3.39% (2,p,pr,ind) to 73.59% (2,s,pr,ind). Performance of highly unambiguous categories in the tasks described in this thesis are expected to be higher than categories which contain ambigu-ity. Inflection A(ϕ) 1,p,ps,ind 0% 1,p,ps,subj 0% 1,p,pr,ind 0% 1,p,pr,subj 0% 1,s,ps,ind 0% 1,s,ps,subj 0% 1,s,pr,ind 6.11% 1,s,pr,subj 0% Inflection A(ϕ) 2,p,ps,ind 46.53% 2,p,ps,subj 22.00% 2,p,pr,ind 3.39% 2,p,pr,subj 40.35% 2,s,ps,ind 41.88% 2,s,ps,subj 18.75% 2,s,pr,ind 73.59% 2,s,pr,subj 12.41% Inflection A(ϕ) 3,p,ps,ind 0% 3,p,ps,subj 0% 3,p,pr,ind 0% 3,p,pr,subj 0% 3,s,ps,ind 0% 3,s,ps,subj 0% 3,s,pr,ind 3.83% 3,s,pr,subj 0% Inflection A(ϕ) infinitive 0.05% part,ps 47.44% part,pr 100%

Table 7: Purity scores for each inflectional category in the representation model.

5.2.2 Shannon Entropy

As described previously, words can be ambiguous with respect to their inflectional category. The number of inflectional categories a word belongs to can vary. For example, ‘geh¨ort’ belongs

(28)

to three categories: the participle past tense; the second person, plural, indicative present; and the third person, singular indicative present. The word ‘belastet’ belongs to the same categories, but in addition also corresponds to the second person, plural, subjunctive present. Under these circumstances, there is a level of uncertainty regarding which inflectional category is intended when a word is encountered. This level of uncertainty can be measured with a Shannon Entropy value.

Let Φ be the set of all inflectional categories for a language. The normalized Shannon Entropy (H) of an inflectional category ϕ ∈ Φ is calculated as shown in Equation 19. Intuitively, a high entropy value indicates that the words in this category can belong to many different categories. A low entropy value corresponds to the scenario where words only belong to a restricted set of other categories. The Shannon Entropy also takes into account the probability of words in ϕ corresponding to another category ϕ0 ∈ Φ. For example, if 80% of the words in ϕ also belong to ϕ0 it receives a lower score than for 20% as there is less uncertainty whether the word belongs to ϕ0 as well.

H(ϕ) = −X

ϕ0_∈Φ

p(ϕ, ϕ0) log₂p(ϕ, ϕ0)

log₂(|Φ|) (19)

where the probability13 _{p(ϕ, ϕ}0_{) of words with inflectional category ϕ also belong to inflectional}

category ϕ0 is calculated as in Equation 20.

p(ϕ, ϕ0) = |Vϕ∩ Vϕ0| |Vϕ|

(20)

where Vϕ and Vϕ0 are the set of words with category ϕ and ϕ0, respectively.

Analysis of the Shannon Entropy scores displayed in Table 8 shows there is uncertainty in all categories except for the participle present (part,pr). Its zero entropy score follows naturally from being a completely pure category, i.e. there is no uncertainty about which categories the word belongs to as it only belongs to this category. The highest uncertainty is found in participle past tense (part,ps).

It is expected the categories with low entropy scores will gain higher performance over categories with high entropy scores as there is a lower level of uncertainty for which categories a word belongs to. Inflection H(ϕ) 1,p,ps,ind 0.352 1,p,ps,subj 0.356 1,p,pr,ind 0.433 1,p,pr,subj 0.433 1,s,ps,ind 0.335 1,s,ps,subj 0.344 1,s,pr,ind 0.265 1,s,pr,subj 0.223 Inflection H(ϕ) 2,p,ps,ind 0.389 2,p,ps,subj 0.254 2,p,pr,ind 0.326 2,p,pr,subj 0.406 2,s,ps,ind 0.314 2,s,ps,subj 0.190 2,s,pr,ind 0.344 2,s,pr,subj 0.146 Inflection H(ϕ) 3,p,ps,ind 0.352 3,p,ps,subj 0.356 3,p,pr,ind 0.433 3,p,pr,subj 0.433 3,s,ps,ind 0.335 3,s,ps,subj 0.344 3,s,pr,ind 0.333 3,s,pr,subj 0.223 Inflection H(ϕ) infinitive 0.434 part,ps 0.502 part,pr 0.000

Table 8: Normalized Shannon Entropy scores for each inflectional category in the representation model.

13_{Note that the probability is asymmetric: p(ϕ, ϕ}0_{) 6= p(ϕ}0_{, ϕ) when |V}

ϕ| 6= |Vϕ0|. In other words, p(ϕ, ϕ0) is

(29)

Purity versus Shannon Entropy A low purity score does not automatically result in a high entropy score, or vice versa. For example, the first person, singular, subjunctive, present tense (1,s,pr,subj) has zero purity score (A(1,s,pr,subj) = 0%) and has a low entropy score (H(1,s,pr,subj) = 0.223). Further analysis showed approximately 90 % of the words in this category also belong to the first person, singular, indicative, present tense (1,s,pr,ind) and the third person, singular, subjunctive present tense (3,s,pr,subj), with only a few words belonging to only the latter and even fewer words belong to both plus a few other categories. The word ‘schreibe’ (write) is a good illustration of a word with the three categories: ‘Ich schreibe den Brief ’ (I write the letter; 1,s,pr,ind), ‘Er sagte, ich schreibe den Brief ’ (He said I am writing the letter; 1,s,pr,subj) and ‘Er sagte, er schreibe den Brief ’ (He said he is writing the letter; 3,s,pr,subj).

Comparatively, a high purity score does not automatically result in a low entropy score. In the case of the participle past (part,ps) (A(part,ps) = 47.44%, H(part,ps) = 0.502), the words which are ambiguous in this category have 9 different inflectional category combinations. Thus far, this thesis has quantified ambiguity in inflectional categories and identified it as an obstacle to the representation of morphological information by unsupervised word embeddings. The generation and classification tasks presented in Section 7 and 8, respectively, will illustrate in detail the extent to which ambiguity affects morphology representations. Before proceeding to the tasks, we introduce the evaluation metrics used throughout this thesis in Section 6.

(30)

6 Evaluation Setup

This section describes the setup for evaluation used throughout the experiments. All models in this thesis are evaluated on the same data sets with the use of cross-validation. Different measurements are used for the generation, classification and analogy tasks. As will be explained in the corresponding sections for each task, the tasks have different characteristics and therefore require different evaluation metrics.

First, the cross-validation technique is described, after which the different evaluation metrics are explained.

6.1 Cross-Validation

The generation and classification tasks are evaluated with the use of k-fold cross-validation. Cross-validation is a model evaluation technique which leaves out a subset of the data as a test set to measure performance. When a sample of the data on which the model is trained is also used for measuring performance, the model can overfit and could show high performance even though performance is low for unseen data. With cross-validation a subset of the data is held out at training and used as a test set to measure performance. Cross-validation reduces overfitting and gives an insight on how the model will generalize to an independent dataset.

The k-fold cross-validation technique randomly samples the data into k subsets. The experi-ments are run k iterations where k − 1 subsets are put together as training set and 1 subset is used as the test set. On each iteration a different subset is held out as the test set. The model performance is averaged over all iterations. The advantage of the k-fold cross-validation is that the manner in which the data is split has less influence on the overall performance score. For the experiments in this thesis, k is set to 10, i.e. the inflection morphology data set is divided into 10 subsets of lemmas. Each lemma corresponds to multiple (`(w), ϕ)-pairs. This data splitting technique simulates the scenario where a linguist has partially annotated a subset of lemmas with the inflected words for each category and the inflected words for the remaining set of lemmas are to be generated.

6.2 Evaluation Metrics

The experiments in this thesis require different evaluation metrics as the tasks have different characteristics. All experiments return a list of items ranked by similarity: the generation and analogy task return a list of words, whereas the classification task returns a list of inflection categories. In addition, the tasks differ in the number of correct answers. For the generation task there is only one single correct answer, while the analogy and classification task can hold multiple correct answers. In addition to overall model performance, performance for individual inflectional categories are evaluated for each task.

Table 9 gives an overview of the evaluation metrics used for each task. The output for the tasks are further described in their corresponding sections.

Inflecting Verbs with Word Embeddings: A Systematic Investigation of Morphological Information Captured by German Verb Embeddings

MSc Artificial Intelligence

Master Thesis

Inflecting Verbs with Word Embeddings:

A Systematic Investigation of Morphological

Information Captured by German Verb

Embeddings

S.A. Gieske

June 30, 2017

Supervisor:

Dr C. Monz

Dr A. Bisazza

Assessor:

Dr W.H. Zuidema

Abstract

Acknowledgments

Contents

1

Introduction

1.1

Research Questions

1.2

Outline

2

Background

2.1

Morphology

2.2

Word Representations

2.3

Neural Networks

2.4

Word2Vec

2.5

k-Nearest Neighbors

3

Previous Work

3.1

Morphological Inflection Generation

3.2

Analogy Tasks

4

Datasets

4.1

Inflection Morphology Dataset

4.2

Embedding Corpora

5

Word Representation

5.1

Representation Models

5.2

Representation Ambiguity

6

Evaluation Setup

6.1

Cross-Validation

6.2

Evaluation Metrics