Synthesizing Data to Improve Neural Morphological Inflection

(1)

Synthesizing Data to Improve

Neural Morphological Inflection

Sanne Eggengoor 10729895

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor A. Bisazza, PhD

Information and Language Processing Systems Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Summary

The process of inflection consists of the modification of words to express gram-matical categories and can be modelled using a sequence-to-sequence model based on Recurrent Neural Networks (RNN). However, results with these models are near, but not yet state-of-the-art. Because of the similarity of this task with Neural Machine Translation (NMT), the performance of these models might increase when adding unlabeled data to the original data, as it does increase performance of NMT. In this project an increase in performance of Morphologi-cal Inflection of Latin and Turkish is attempted to achieve, by adding unlabeled data from Wikipedia. This data is then completed with labels by synthesizing the lemmas and morphological tags using a similar sequence-to-sequence model, which is trained on the same data, but in opposite direction. Subsequently this data is added to both languages on training sets of different sizes, in different ratios. The results of the first two configurations gave promising results, as there is a significant increase in performance after adding synthetic data. However, when moving the configuration closer to the state-of-the-art, the increase fades away. To be able to draw more reliable conclusions on whether adding synthetic data improves the task of morphological inflection, more research needs to be done.

Acknowledgements

I would like to acknowledge my supervisor, Arianna Bisazza, for every good advice and guidance during this project.

(3)

1 Introduction

The process of morphological inflection exists in many languages of the world. This process is responsible for the modification of words to express grammatical categories. For example, the extension ’-ed’ can be added to the verb ’walk’ to express that the action happened in the past. However, in languages such as English and Dutch this process is not very complicated, as there are, for example, only four possible inflections per regular English verb. This is in contrast to languages such as Polish and Turkish, where up to 100 inflections per verb can exist. The differences between these types of languages are clearly visible in the following examples:

English:

work.past+Plural→worked

work.present+3nd+Singular→works

Turkish:

dermek.ind+1st+Pl+Prs+Prog+Pos+Intr→deriyor muyuz? dermek.ind+1st+Pl+Pst+Neg+Decl→dermedik

In the Turkish example much more grammatical information is added to the verb. As a result, the process of inflection is much more complicated in this type of language. In addition, the similarity between two inflections of the same word is harder to recognize for non-speakers of the language and computer systems, which leads to data sparsity in various tasks of language processing (Cotterell et al., 2016). For this reason, the automation of the inflection process might increase the amount of data available for those tasks. This might lead to an improvement in performance of certain tasks, such as Machine Translation.

In 2016 a shared task was released (Cotterell et al., 2016), of which the goal was to create a system that can be used at the generation of inflections in ten lan-guages. Multiple teams took part in this shared task, but the approach, as used by Kann and Sch¨utze (2016b), was most successful. This approach is similar to one used by current state-of-the-art Machine Translation, namely Sequence-to-Sequence Neural Machine Translation with Recurrent Neural Networks (RNN). These systems will be further explained in the Theoretical Foundation section of this thesis.

The task appointed by Cotterell consisted of three sub tasks. In this project the focus will be on one of these tasks, namely the conversion from a lemma to a target form. An example of this task is:

play + V;3;SG;PRS→plays

While the approach of Kann and Sch¨utze obtained an average accuracy of 95% on this task, when there is enough data provided (Kann & Sch¨utze, 2016b). However, there is still ample room for improvement in some languages, espe-cially when limited training data is available (low-resource setting). Because of the similarity between the model used for reinflection and state-of-the-art Neural Machine Translation models, it might be interesting to try a technique which is responsible for improvements in Neural Machine Translation. One of

(5)

these possible approaches is the generation of synthetic training data by using unlabeled data, i.e. unannotated text, which is typically easy and inexpensive to obtain (Sennrich et al., 2015). This data is generated by running the system the other way around, with the target forms as input and the lemmas and tar-get tags as output. As a result new training data is created, which is noisy but potentially useful when only small amounts of gold training data are available. This leads to the research question of this thesis:

To what extent can a state-of-the-art morphological inflection model based on sequence-to-sequence RNN be improved by the addition of synthetic training data?

Because of the similarity between the task of inflection and Neural Machine Translation, the hypothesis is that adding unlabeled data will improve perfor-mance on the task of inflection. In this thesis the research question will be answered as followed: First, a Theoretical Foundation will be given, wherein the systems and theories that are used in both previous work and this thesis will be introduced and explained. Secondly, there is a section about the used data and the method. This part is followed by one about the results and eval-uation of this research. Lastly there is a section with the conclusion, discussion and recommendations for further research.

(6)

2 Theoretical foundation

Traditional inflection programs are mostly hard coded linguistic finite state ma-chines, which have as disadvantages that they lack robustness and that they have to be written manually by linguistic experts (Faruqui et al., 2015). Therefore a new approach might be more satisfying.

In 2016 a shared task (SIGMORPHON-2016) was released (Cotterell et al., 2016). This task concerned the successful reinflection of words in 10 languages, of which two were not included in the original developers data and thus func-tioned as surprise data. The task had essentially three levels, of which the first level only included the inflection of a lemma. In this task the input is a lemma and a target tag and the output is the correctly inflected form of the lemma. An example is:

[run + target tag: PresPart]→[running]

In the second task, the input not only contains a target tag and a source word, but instead of the lemma, it contains an already inflected form and a source tag. An example is:

[ran + source tag:Past + target tag:PresPart]→[running]

The input of the third and hardest task does not include the source tag. Instead, all the information necessary should be retrieved by the system from only the source word. An example of this task is:

[ran + target tag:PresPart]→[running] In this thesis the focus will be on the first sub task.

The best results (95.56% accuracy on the first task and 90.95% on the third task) were obtained by a research group from LMU M¨unich (Kann & Sch¨utze, 2016b) and are near state-of-the-art. For these results, an approach similar to approaches in modern Machine Translation was used by the team of Kann, namely an RNN with an Encoder-Decoder structure (Bahdanau et al., 2014).

However, solving Morphological Reinflection problems with Neural Networks is a very new approach and is therefore not near perfection. For this reason further research needs to be done and is one of the main goals of this research project. In addition, to be able to analyze and improve current results, it is very important to have a clear understanding of how the current systems work.

2.1 Neural Machine Translation

While for years the most frequently used form of Machine Translation was Phrase-Based Machine Translation, where statistics and a lot of data are used to generate translations, Neural Machine Translation is quickly emerging. Gen-erally these systems have an Encoder-Decoder architecture, which means that the original input first is encoded into a vector of a fixed length. Thereafter a decoder retrieves the target sentence from this vector. This Encoder-Decoder pair is trained as a single system, so that the both parts are perfectly attuned to each other. Both parts are Recurrent Neural Networks (RNN) which means that they use previously inputted data. For example, when the encoder is pro-cessing the third word of the source sentence, the first and second word are also considered. Moreover, the decoder uses not only the source words, but also the first and second word of the translated sentence are considered to predict the third translated word. Figure 1 is a schematic view of an RNN with an Encoder-Decoder structure used for translation.

(7)

Figure 1: Schematic overview of a Deep RNN used for Machine Translation (Thang Luong, 2016)

The model used by Kann and Sch¨utze is mostly based on models by Bah-danau, Cho, & Bengio (2014); Sutskever, Vinyals, & Le (2014); Cho, Van Merri¨enboer, Bahdanau, & Bengio (2014), these have a few changes with respect to figure 1. Whereas a system such as the previous figure turned out to have a quickly decreasing performance as the sentences got longer, this problem is solved by Bahdanau, Cho, & Bengio (2014) with the introduction of an attention mecha-nism, where, at each time step, all input word representations are weighted and provided to the decoder for prediction of the next output word. This way the right word gets predicted on the right spot in the sentence. This improves not only computationability, but also keeps performance of Neural Machine Trans-lation high for longer sentences.

2.2 Morphological Encoder Decoder of Kann and Sch¨

utze

The system proposed by Kann and Sch¨utze (2016) is an adaption on the system by Bahdanau, Cho, & Bengio (2014). Because Morphological Reinflection is not as complex as machine translation, there are less hidden layers used (100). Furthermore the input of the system is a single sequence that contains all the information. Another difference with the previously mentioned Deep Recurrent network (figure 1) is that this system is a Bidirectional RNN, which means that not only the previous symbols are processed, but also the following symbols in the sequence. The sequence starts with a start symbol, which is followed by the source tag, target tag and source word and is concluded with an end symbol. Similarly, the output of the system is also a single sequence, but only contains a start and end symbol and the target word. Below is an example of an input and output sentence:

(8)

Input Sstart+ Present.3rd.Singular + Past.3rd.Plural + writes + Send

Output Sstart+ wrote + Send

As this research only contains the first level of the shared task, that is going from a lemma and a target morphological tag to the corresponding inflected form, the input and output for this task would be similar to the following example:

Input SstartPast.3rd.Plural + write + Send

Output Sstart+ wrote + Send

Figure 2: Schematic overview of a Bidirectional RNN as used by Kann and Sch¨utze

The encoder of the system of Kann and Sch¨utze consists of an RNN, which encodes the information of the input into a vector c by

ci= Tx

X

j=1

αijhj (1)

with hidden units h that are LSTM units: hj= [ ~hTj; ←− hTj] T (2) where ~hT j and ←− hT

j are vectors that contain a summary of every input

charac-ter/tag. Attention weights are given by α αij = exp(eij) PTx k=1exp(eik) (3) with eij = a(si−1, hj) (4)

where a is a neural network and jointly trained with the rest of the components and stis an RNN hidden state for time t (Kann & Sch¨utze, 2016a). The decoder

(9)

then predicts the conditional probability of the outputs y by p(y) = Ty Y t=1 g(yt−1, st, c) (5)

with g being a nonlinear function. In figure 2 the Bidirectional RNN as used by Kann and Sch¨utze is shown.

The results of this system are very good, with an average of 95.56% for the ten different languages released by SIGMORPHON in the first task and an average of 90.95% on the third task. Furthermore, the results stay high when the amount of training data for a single tag pair combination is decreased. A tag pair combination consists of the relationship between two grammatical categories, namely the source tags and the target tags. In cases where there are, for example, little samples available that go from ‘familiar second person’ to ‘past plural’, the prediction on new samples of that combination does not decrease in performance. The researchers conclude with a suggestion of adding additional sources of information to the system to improve accuracy, such as unlabeled corpora.

2.3 Unlabeled Corpora

Sennrich et al. (2015) have successfully used unlabeled data to improve their Neural Machine Translation system. A system similar to the one created by Bahdanau, Cho, & Bengio (2014) was used, with which they tried two separate tactics for including this data in the model. The first tactic was to use pairs of an empty source string and the unlabeled target string. More successful was their second idea, which was to synthesize a source string by generating it with a baseline neural machine translation system, which works exactly the other way around. For example, if in the original system the translating task is from French to German, new data is generated by training a second model from German to French. With this model additional French source data is created from untranslated German sentences. As a result there can be many more sentences added to the original data set, which improves performance on the French to German task. Another advantage of this tactic is that these pairs can be treated like any other training example, so the parameter settings in the encoder do not need to be frozen. Adding these pairs in a ratio of 1:1 to the model, results in a great improvement.

It would be very interesting to combine this technique with the available Neural Networks on Morphological Reinflection.

(10)

3 Method

3.1 Tokenizing

Before it is possible to train the model, the data needs to be pre-processed. One of the first steps of pre-processing is converting the data to Unicode, which makes sure that all characters are correctly represented. Thereafter the data needs to be tokenized, because the model needs to work on character level instead of word level. The first method of tokenizing used in this project is splitting all characters, including the characters in the tags, which is done using a self-written python script. This method is used with the first exploratory experiments and is later replaced by another method of pre-processing. An example of this method of tokenizing is given in figure 3.

giving V;V.PTCP;PRS ↓

g i v i n g V V P T C P P R S Figure 3: Tokenizing with separating tags

However, this previously mentioned method of tokenizing is not exactly sim-ilar to the one used by Kann & Sch¨utze (2016a). Instead of splitting all charac-ters, only the characters in the original word are splitted. In contrast, the tags are analyzed as single tokens and are then separated by spaces. This process is also done using a self-written python script. After tokenizing the words, the model processes the input per character and per tag. Likewise, an example of this tokenizing process is shown in figure 4.

giving V;V.PTCP;PRS ↓

g i v i n g V V.PTCP PRS

Figure 4: Tokenizing without separating tags

3.2 Data

The data used in this project is provided by the SIGMORPHON shared task of this year (https://github.com/sigmorphon/conll2017/tree/master/all/task1), which contains data of 52 languages in different data set sizes. For every lan-guage there is a developer set (1000 samples) and three training sets(100, 1000 and 10,000 samples), where the two smallest sets are subsets of the largest one. Furthermore there is a test set, but the correct solutions have not yet been up-loaded at the time of writing. Consequently, the developer set is used as a test set. As a result, the data used in this project consists only of the largest training set and the developer set. Because one of the subjects covered by this research is the influence of size of the training set, there are multiple new training sets derived from the largest one with different sizes (1000, 3000 and 5000). Subse-quently, 2% of these train sets are used as validation sets, which are needed for the system.

(11)

3.3 Baseline

After tokenizing, the model is initialized with 100 hidden units for both the encoder and the decoder, which is the same as the model used by Kann and Sch¨utze had. The embedding size is set to 100 and the optimization technique chosen is ‘adadelta’. An important difference with the research by Kann and Sch¨utze is that in this project both a normal RNN as well as a Bidirectional Neural Network are used. Another important decision is the number of epochs used to train the models. In one epoch the model has processed every data pair exactly once. For example, every source-target pair is processed 20 times by a model that is trained for 20 epochs.

In general, models that are trained for more epochs perform better, but this also results in a long training and a greater risk of overfitting. To find a suiting number of epochs, the performance is compared for models with different numbers of epochs. Using this information, an appropriate number of epochs can be determined.

The model is implemented using OpenNMT (Klein et al., n.d.), which is an open-source system for Neural Machine Translation models based, among others, on the research by Bahdanau, Cho, & Bengio (2014).

As a result of these steps, a baseline is created. An example of the input and output of the model is in figure 5.

Figure 5: Input and output of morphological generation system

3.4 Adding Unlabeled Data

The unlabeled data that will be added to the original data is provided by Wikipedia dumps (dumps.wikimedia.org). These dumps where authorized as bonus data at the original SIGMORPHON task from 2016 (Cotterell et al., 2016). These Wikipedia dumps can not instantly be added to the data, so there is also some pre-processing required here.

In the first step in the process of pre-processing the unlabeled data, a large part (7000 articles) from the textual data is derived from the XML-file dump. Subsequently the occurrence of every word is counted, which results in a list of all the words from the articles, sorted on frequency. This is important because the words that appear in very few situations are most likely not representative for the language and the words that appear in very many situations are mostly so-called stopwords. These are words that are the most common in a language, in many cases these words are not inflected and therefore they are not represen-tative in this task. Thereafter the length of every word in this list is computed, as there are many 2, 3 or 4 character words that are abbreviations or acronyms. Lastly a set of 10,000 words is obtained from this list, these words are more than 4 characters long and have a medium frequency of occurrence. To add

(12)

this data to the data set, also source words and target tags are needed. These are synthesized using the same system as is used to calculate the baseline. An example of the synthesizing process is in figure 6.

Figure 6: Example of synthesizing source data from unlabeled data In order to do this a new model needs to be trained, namely with the target forms as training input and the source forms as training output. This model is trained using the same parameters as the baseline model. After training, the unlabeled words are used to synthesize source words and tags that belong to the unlabeled data.

These synthetic data pairs are added to the original train set in different ratios, namely 2:1, 1:1, 2:3, 1:2. For an original training set of size 5000 this results in four extended sets of size 7500, 10000, 12500 and 15000. These sets are then used to train new models, again using the same parameters as in the previous models. The validation sets remain the same as in the baseline.

(13)

4 Experiments

4.1 Evaluation

Evaluation of results is necessary for drawing valid conclusions. Therefore two methods of evaluation are used in this project, namely: Accuracy and Leven-shtein Distance.

Accuracy: The proportion of right predictions is given by the accuracy as a percentage, which can be calculated using equation 6. Hence, the maximal ac-curacy is 100%, the minimal acac-curacy is 0% and the highest acac-curacy is assigned to the model with the best performance.

Accuracy = # right predictions

# total predictions∗ 100% (6) Levenshtein Distance: There is a great difference in how bad a mistake is. For example, a prediction with not a single character matching to the correct word and a prediction with only one character different from the correct word are not equally wrong. However, the measure of accuracy treats them equally. Consequently another measure is also interesting to look into. A suiting measure for this can be the Levenshtein Distance. This measure calculates the minimal number of operations that is necessary for changing one word into another. Possible operations are: deletion, insertion and substitutions. For example, the Levenshtein Distance between ‘notebook’ and ‘looking’ is equal to 8 (see figure 7). When calculating the Levenshtein distance over a set of predictions the

First word n o t e b o o k - - -Operation d d d d s - - - i i i

b>l i n g

Second word - - - - l o o k i n g

Figure 7: Eight operations are needed to transform ‘notebook’ into ‘looking’, namely four deletions (d), one substitution (s) and three insertions (i)

average is calculated, using equation 7. An average Levenshtein Distance of 0 is the lowest possible value and means every prediction is equal to the correct word, the maximal value is equal to the average number of characters of words in the set. In general the lowest Levenshtein Distance belongs to the model with the best performance.

Average Levenshtein Distance = Pn

i=1Levenshtein(i)

n

with n = number of total predictions

(7)

4.2 Convergence and the Number of Epochs

An important feature of the models used in this project is the number of epochs that is chosen for training. As stated before, a large number of epochs might lead to optimal results, but this leads to very long training duration and a serious risk for overfitting. On the other hand, a small number of epochs might result in non-representative and very bad results. As a result it is very important to choose

(14)

the right amount of epochs. This decision is made using results on a validation set after different numbers of epochs. In figure 8 and 9 the performances of two models, with different data set sizes, are plotted against the number of epochs. In these figures, the blue lines represent the accuracy and the orange lines represent the Levenshtein Distance.

In figure 8 is shown that the performance of a model with 1000 golden samples is fairly constant after 100 epochs and is still only slightly improving until 200 epochs. However, this increase is not very big and for reasons of time the training of models with 1000 golden samples is stopped after 200 epochs.

In figure 9 can be found that the performance of a model with 3000 golden samples is fairly constant after around 60 epochs. To be more sure that this convergence is achieved in all the models, the training of models with 3000 golden samples or more is stopped after 100 epochs.

Figure 8: Performance on baseline with 1000 golden samples. The model used here is a Bidirectional RNN

Figure 9: Performance on baseline with 3000 golden samples. The model used here is an RNN

4.3 Parameter Settings

With the number of epochs determined all the parameter settings are known. These are described in table 1.

(15)

Parameter Value/Type Optimization Method Adadelta

RNN size 100 hidden units

Number of Epochs 200 (models with 1000 golden samples) or 100 (other models)

Embedding Size 100

RNN type RNN (in cases where the type is BRNN (bidirectional RNN)

this is explicitly mentioned) Type of Recurrent Cell LSTM

Number of Recurrent Layers 2 Max source and target length None

Attention Global

Table 1: The standard parameter settings for all models.

4.4 Languages

The new data released by SIGMORPHON this year consisted of data for 52 lan-guages. As a result, it is not possible to use every language in this project. At first five languages were selected, namely: Polish, Danish, Turkish, Arabic and Latin. These language all belong to very different ‘language families’ and there-fore give the most representative results. The results of the first experiments with these five languages can be found in table 2. Latin clearly has the lowest results to start with, so this language was selected for further research. The other language chosen for further research was Turkish, which was chosen as a representative of agglutinate languages. Agglutinate languages are well-known for their complex and rich morphology and are thus interesting to include in this research.

Language Accuracy (%) Average Levenshtein Distance

Polish 85.5 0.47

Danish 93.9 0.156

Turkish 96.9 0.131

Arabic 93.4 0.526

Latin 69.9 1.1

Table 2: Results of first experiments done on these five languages. The models have runned for 200 epochs and are RNNs. These results are obtained with separate tags and a normal RNN (First configuration).

4.4.1 Latin

The first language chosen is Latin, an extinct language, once spoken in the Roman Empire. This language belongs to the Indo-European language family and is known as a highly inflective language. This means that there are many possible inflections per word, which complicates the process of morphological inflection. In Latin a vowel can be either ‘long’ or ‘short’, in many cases this is determined by position. Furthermore, some vowels are long by definitions. However, in some cases there are more words with the same spelling, but with

(16)

another distribution of vowel length. An example of this is the difference be-tween ‘l¯i-ber’ (= ‘free’) and ‘li-ber’ (= ‘book’). This vowel length does in almost all cases not affect the inflections of words.

In the original data set provided by SIGMORPHON the diacritics for vowel length are included and displayed using a bar above long vowels. ‘Ovill¯i’ is an example where the last ‘i’ is long. However, in the unlabeled data on the Latin Wikipedia these diacritics are not included. Adding this data to the original data set would confuse the model and thus the bars are deleted from the original set. This means that not all the information provided is used, but as in most of the cases the vowel length does not affect the inflections, this is only a minor simplification of the problem.

4.4.2 Turkish

The second language chosen is Turkish, which is spoken in Turkey and a part of the Altaic Language Family. Turkish is a so-called ‘agglutinate’ language, which means that many small parts can be added to words to express many grammatical categories. As a result, this language has a fairly complex inflec-tional system, which might complicate the task of Morphological Inflection.

4.5 First Configuration

Results are obtained in different configurations and for this reason these are given in different sections. In all these sections, a similar structure has been maintained. First, the baseline and synthesizing results are given and thereafter the results after adding the synthetic data are shown.

In the first configuration the tags are separated as shown in figure 3 and a non-bidirectional RNN is used. This method is tested on Latin.

4.5.1 Baseline and Synthesizing Results

The results of the baseline models in the first configuration are shown in figure 10, where the blue lines represent the accuracy. Similarly the orange lines represent the Levenshtein distance. Performance in these models decreases when reducing the amount of training data available and does not seem to be stabilized with 10,000 samples.

Before moving to the results of the models with added synthetic data, the performance of the synthesizing itself is given in figure 11. The task of syn-thesizing consists of generating the lemma and grammatical tags that belong to the unlabeled word, these pairs are then used to extend the train data sets. Therefore, a high performance on the task of synthesizing results in ‘better’ synthetic data to be added to the train sets. This performance is calculated on the same test set, but uses the target words as input and the source word plus target tags as output. As visible in figure 11, this task also suffers from data sparsity, as the performance on small data sets is lower. An important remark in the performance of the synthesizing task is that there might be multiple correct answers. However, only one of these solutions is considered to be correct in the evaluation.

(17)

Figure 10: Results of Baseline Models in Latin, with separated tags.

Figure 11: Results of Synthesizing Mod-els in Latin, with separated tags.

4.5.2 Results with Additional Synthetic Data

Thereafter the unlabeled words that are obtained from Wikipedia are added to the train set in different ratios (2:1,1:1 and 2:3). The results of the inflection models after this extension of the train set are given in table 3 and figures 12, 13 and 14. In the figures, the dashed line represents the performance when no synthetic data is added. As a result, the performance is higher than the baseline when the line in the same color is higher than the dashed line.

In models with 3000 and 5000 Golden samples the performance increased substantially, with significant improvements in the models with 3000 samples (p<0.01, with 4500 added samples) when performing a binomial test on the accuracy. In the models with 10,000 Golden samples there is solely a decrease visible.

Ratio: → No added data 2:1 1:1 2:3 #golden Acc Lev Acc Lev Acc Lev Acc Lev samples: ↓

3000 71.7 0.849 72.6 0.765 74.7 0.712 76.4 0.612 5000 79.6 0.574 80.4 0.492 80.6 0.504 - -10000 91.4 0.218 88.3 0.315 88.3 0.303 - -Table 3: Performances on Latin models with different amounts of added syn-thesized data. Italic numbers represent models which are an improvement on the baseline. This model uses separate tags and is a normal RNN.

(18)

Figure 12: Results of Inflection Mod-els in Latin, with 3000 Golden Samples, separated tags and different amounts of added synthetic samples.

Figure 13: Results of Inflection Mod-els in Latin, with 5000 Golden Samples, separated tags and different amounts of added synthetic samples.

Figure 14: Results of Inflection Models in Latin, with 10,000 Golden Samples, separated tags and different amounts of added synthetic samples.

4.6 Second Configuration

In the second implementation the data is tokenized without separating tags, this is comparable to figure 4. This method of tokenizing is similar to the one used by Kann and Sch¨utze. The language used in this section is Turkish. 4.6.1 Baseline and Synthesizing Results

The results of the baseline models in the second configuration are shown in figure 15. As similar to figure 10, the colors represent the two measures for performance. The performance in these models decreases when reducing the amount of training data available and seems to achieve (sub-)optimal results above 3000 samples.

Before moving to the results of the models with added synthetic data, the performance of the synthesizing itself is given in figure 16. This performance is calculated on the same test set, but uses the target words as input and the

(19)

source word and target tags as output. As visible in this figure, this task also suffers from data sparsity, especially in the models with only 1000 samples.

Figure 15: Results of Baseline Models in Turkish, without separated tags.

Figure 16: Results of Synthesizing Mod-els in Turkish, without separated tags.

After synthesizing lemmas and grammatical tags, the synthetic data is added to train sets of different sizes (1000, 3000 and 5000 samples) in different ratios (2:1, 1:1, 2:3 and 1:2). The performances of the resulting models are given in table 4 and figures 17, 18 and 19. Similar to the figures from the first configuration, the dashed lines represent the baseline. Only in the models with 1000 Golden samples there is an increase in performance, which is significant in all three cases (p<0.01). However, this increase is somewhat unpredictable as there is also a situation where there is an enormous decrease in performance.

In general the results after adding more than one time the amount of the Golden samples seem to be unreliable and unpredictable.

Ratio: → No added data 2:1 1:1 2:3 1:2

#golden Acc Lev Acc Lev Acc Lev Acc Lev Acc Lev samples: ↓

1000 53.9 2.980 69.1 1.560 63.5 1.888 25.2 5.105 64.3 1.943 3000 91.2 0.342 69.5 1.894 66.7 2.095 69.4 0.1.811 83.0 0.730 5000 95.4 0.154 92.6 0.205 84.3 0.920 84.1 0.790 89.9 0.316

Table 4: Performances on Turkish models with different amounts of added syn-thesized data. Italic numbers represent models which are an improvement on the baseline. This model is a normal RNN.

(20)

Figure 17: Results of Inflection Mod-els in Turkish, with 1000 Golden Sam-ples and different amounts of added syn-thetic samples.

Figure 18: Results of Inflection Mod-els in Turkish, with 3000 Golden Sam-ples and different amounts of added syn-thetic samples.

Figure 19: Results of Inflection Models in Turkish, with 5000 Golden Samples and different amounts of added synthetic samples.

4.7 Final Configuration

The final configuration is most similar to the current State-of-the-Art and con-sists of a Bidirectional RNN, identical to the one showed in figure 2. In addition, the method of tokenizing is without separating the tags. In this configuration, both languages are addressed.

4.7.1 Baseline and Synthesizing Results

The baseline results of Morphological Inflection in Turkish and Latin are shown in figure 20 and 21. In this configuration the baseline is already substantially higher than in the previous implementations, which are shown in, respectively, figure 15 and 10. Still, these models do suffer from data sparsity. Though, the differences between the highest and lowest performances are not so large as in the previous models.

(21)

Figure 20: Results of Baseline Models in Turkish, without separated tags and in a BRNN.

Figure 21: Results of Baseline Models in Latin, without separated tags and in a BRNN.

The performances of lemmatization and generating grammatical tags are shown in figures 22 and 23. These are also a lot higher than those of previous configurations (figures 16 and 11).

Figure 22: Results of Synthesizing Mod-els in Turkish, without separated tags and in a BRNN.

Figure 23: Results of Synthesizing Mod-els in Latin, without separated tags and in a BRNN.

Likewise, the synthesizing models are used to generate source data for the un-labeled data from Wikipedia. After adding this synthetic samples in different proportions to the train data, the performances are calculated again. These are shown in tables 5 and 6 and figures 24 to 29.

In only one case there is a slight increase in performance, namely in Latin with 5000 Golden Samples and synthetic samples added in a ratio of 2:1. This increase, however, is not significant and might therefore also be explained by chance.

(22)

Ratio: → No added data 2:1 1:1 2:3 1:2 #golden Acc Lev Acc Lev Acc Lev Acc Lev Acc Lev samples: ↓

1000 80.9 0.779 71 1.381 75.6 1.143 76.7 1.056 72.8 1.729 3000 92 0.283 89.7 0.376 89.5 0.418 89.3 0.369 88.7 0.434 5000 95.1 0.137 92.2 0.233 94.6 0.171 93 0.277 92.5 0.242

Table 5: Performances on Turkish models with different amounts of added syn-thesized data. This model is a BRNN.

Ratio: → No added data 2:1 1:1 2:3 1:2

#golden Acc Lev Acc Lev Acc Lev Acc Lev Acc Lev samples: ↓

1000 68.6 0.847 65.3 0.954 60.9 1.056 62.4 1.099 63.2 1.081 3000 81.3 0.469 78.1 0.542 78.3 0.541 78.6 0.519 76.2 0.621 5000 86.2 0.365 86.7 0.297 84.6 0.377 85.4 0.332 82.9 0.399

Table 6: Performances on Latin models with different amounts of added syn-thesized data. Italic numbers represent models which are an improvement on the baseline. This model is a BRNN.

Figure 24: Results of Inflection Mod-els in Turkish, with 1000 Golden Sam-ples and different amounts of added syn-thetic samples. The model is a BRNN.

(23)

Figure 27: Results of Inflection Mod-els in Latin, with 1000 Golden Sam-ples and different amounts of added syn-thetic samples. The model is a BRNN.

(24)

5 Discussion & Conclusion

Summarizing, this project aims to research whether it is possible to improve performance on the task of morphological inflection by adding synthetic data to the original train sets. The models used are based on research by Kann and Sch¨utze (2016a) and are Sequence-to-sequence models with an Encoder-Decoder Structure based on a BRNN architecture.

The hypothesis that adding synthetic data to the train set, leads to an increase in performance, happens to be true only in some cases. This can be seen in the results with the first two configurations. These give promising and significant results, but as the models get closer to the current State-of-the-Art, the improvements mostly fade away. This might be due to the increase in baseline performance, as it gets harder for the model to overcome the baseline. Furthermore, RNN Models suffer from data sparsity, as the performances heavily decrease when reducing data.

An important matter to take into consideration when pursuing this research is the risk on overfitting when choosing a static number of epochs. It might also be interesting to do research on Latin where the synthetic data also contains diacritics for syllable length, these diacritics might still contain useful informa-tion.

In conclusion, the method of adding unlabeled synthetic data to Morpholog-ical Inflection models gives promising results, but in order to find more reliable results on adding unlabeled data to train sets when using Sequence-to-sequence Encoder-Decoder models with a BRNN structure, more research needs to be done. This research could focus on situations with less data available.

(25)

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 . Cho, K., Van Merri¨enboer, B., Bahdanau, D., & Bengio, Y. (2014). On the

properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 .

Cotterell, R., Kirov, C., Sylak-Glassman, J., Yarowsky, D., Eisner, J., & Hulden, M. (2016). The sigmorphon 2016 shared task—morphological reinflection. ACL 2016 , 10.

Faruqui, M., Tsvetkov, Y., Neubig, G., & Dyer, C. (2015). Morphological inflection generation using character sequence to sequence learning. arXiv preprint arXiv:1512.06110 .

Kann, K., & Sch¨utze, H. (2016a). Med: The lmu system for the sigmorphon 2016 shared task on morphological reinflection. ACL 2016 , 62.

Kann, K., & Sch¨utze, H. (2016b). Single-model encoder-decoder with explicit morphological representation for reinflection. arXiv preprint arXiv:1606.00589 .

Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (n.d.). OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints. Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine

translation models with monolingual data. arXiv preprint arXiv:1511.06709 . Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning

with neural networks. In Advances in neural information processing systems (pp. 3104–3112).

Thang Luong, C. M., Kyunghyun Cho. (2016). Neural machine translation. acl 2016 tutorial. https://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf.

Synthesizing Data to Improve Neural Morphological Inflection