Reconstructing language ancestry by performing word prediction with neural networks

(1)

MSc Artificial Intelligence

Track: Natural Language Processing

Master Thesis

Reconstructing language ancestry by performing

word prediction with neural networks

by

Peter Dekker

10820973

Defense date: 25th January 2018

42 ECTS

January 2017 – January 2018

Supervisors:

dr. Jelle Zuidema

prof. dr. Gerhard Jäger

Assessor:

dr. Raquel Fernández

Institute for Logic, Language and Computation – University of Amsterdam

Seminar für Sprachwissenschaft – University of Tübingen

(2)

(3)

Introduction

How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics. Recently, computational methods have been applied to aid historical linguists. Independently, in the computer science and natural language processing community, machine learning methods have become popular: by learning from data, a com-puter can learn to perform advanced tasks. In this thesis, I will explore the following question:

How can machine learning algorithms be used to predict words between languages and serve as a model of sound change, in order to reconstruct the ancestry of languages?

In this introductory chapter, I will first describe existing methods in historical linguistics. Subse-quently, I will cover the machine learning methods used in natural language processing. Finally, I will propose word prediction as task to apply machine learning to historical linguistics.

1.1 Historical linguistics

1.1.1 Historical linguistics: the comparative method and beyond

In historical linguistics, the task of reconstructing the ancestry of languages is generally performed using the comparative method (Clackson, 2007). The goal is to find genetic relationships between lan-guages, excluding borrowings. Different linguistic levels, like phonology, morphology and syntax can be taken into account. Mostly, phonological forms of a fixed list of basic vocabulary is taken into ac-count, containing concepts with a low probability of being borrowed.

Durie and Ross (1996, ch. Introduction) describe the general workflow of the comparative method, as it is followed by many historical linguists.

1. Word lists are taken into account for languages which are assumed to be genetically related, based on diagnostic evidence. Diagnostic evidence can be common knowledge, eg. Nichols (1996) argues that in different historical texts by Slavic speakers, the Slavic languages already have been assumed to be related for ages. Another possibility is morphological or syntactical evidence. 2. Sets of cognates (words that are ancestrally related) are created.

3. Based on these cognates, sound correspondences are identified: which sounds do always change between cognates in two languages?

4. The sound correspondences are used to reconstruct a common protolanguage. First, protosounds are inferred, from these, protoforms of words are reconstructed.

(6)

5. Common innovations of sounds and words between groups of languages are identified.

6. Based on the common innovations, a phylogenetic tree is reconstructed: language B is a child of language A, if B has all the innovations that A has.

7. An etymological dictionary can be constructed, taking into account borrowing and semantic change.

This process can be performed in an iterative fashion: cognate sets (2) are updated by finding new sound correspondences (3), and by further steps, the set of languages taken into account (1) can be altered.

Although the comparative method is regarded as the most reliable method to establish genetic rela-tionships between languages, sometimes less constrained methods are applied, to be able to look further back in time (Campbell, 2003, p. 348). One of the most notable is Greenberg’s mass lexical comparison (Greenberg, 1957), in which genetic relationships are inferred from surface forms, instead of sound correspondences, of words across a large number of languages. This workflow is criticized for not dis-tinguishing between genetically related words (cognates) and other effects which could cause similarity of words, such as chance similarity and borrowing (Campbell, 2013).

In this thesis, we will therefore take the comparative method as starting point, and look for compu-tational methods that can automate parts of that workflow.

1.1.2 Sound changes

When comparing words in different languages, sound changes between these words are observed. Most sound changes are assumed to be regular: if it occurs in a word in a certain context, it should also occur in the same context in another word. The Neogrammarian hypothesis of the regularity of sound change states that “sound change takes place according to laws that admit no exception” (Osthoff and Brugmann, 1880). There are however also cases of irregular, sporadic sound change. (Durie and Ross, 1996). Regular sound change is crucial for the reconstruction of language ancestry using the comparative method.

I will now discuss possible sound changes, divided in three categories: phonemic changes, loss of

segments and insertion/movement of segments. For each sound change, I will mention whether it is

regular or sporadic. I will refer to the sound changes and their regularity later when looking at different computer algorithms that should be able to model these sound changes. This is a compilation of the sound changes described in Hock and Joseph (2009), Trask (1996), Beekes (2011) and Campbell (2013).

Phonemic changes

Phonemic changes are sound changes which change the inventory of phonemes. When a phonemic

change occurs, a sound changes into another sound in all words in a language, this is thus a regular change. Two general patterns of phonemic change can be distinguished: mergers and splits. A merger is a phonemic change where two distinct sounds in the phoneme inventory are merged into one, existing or new, sound. A split is a phonemic change where one phoneme splits into two phonemes. Based on these two general patterns on the phoneme inventory level, a number of concrete changes in word forms can occur, of which I will highlight two: assimilation and vowel changes.

Assimilation (regular) Assimilation is the process where one sound in a word becomes more

similar to another sound. For example, in Latin nokte ‘night’ > Italian notte, /k/ changes to /t/, influenced by the neighboring /t/.

Assimilation of sounds more distantly located in a word is also pssible. Umlaut in Germanic phonol-ogy is an example of this: the first vowel of the word changes to be more similar to the second vowel. For example, the plural of the Proto-Germanic word gast ‘guest’ became gestiz. /a/ changed to /e/ to be

(7)

1.1. HISTORICAL LINGUISTICS 7 more similar to /i/, as both /e/ and /i/ are front vowels. This Proto-Germanic umlaut is still visible in the German plural Gast/Gäste.

Lenition (regular) Lenition is a process which reduces articulatory effort and only affects

con-sonants. Lenitive changes include voicing (voiceless to voiced), degemination (geminate, two of the same sounds, to simplex, one sound) and nasalization (non-nasal to nasal).

An example of voicing is the change from Latin strata ‘road’ to Italian strada. A voiceless /t/ changes to a voiced /d/. Degemination can be observed when comparing the change from Latin gutta ‘drop’ to Spanish gota.

The reduction of articulatory effort can continue so far, that a sound is completely omitted. For example, Latin regāle > Spanish real.

Vowel changes (regular) There are a number of changes where vowels change into other vowels,

some of them caused by the aforementioned processes, like lenition. However, some vowel changes should be mentioned explicitly. There are a number of changes, where a vowel acquires a new phonetic feature, like lowering, fronting, rounding). An example of fronting is Basque dut ‘I have it’ > Zuberoan

dʏt, where back /u/ changes to near-front /ʏ/. Coalescence is the effect where two identical vowels

change into one. Compensatory lengthening is the lengthening of a vowel after the loss of the following consonant. For example Old French bɛst ‘beast’ > French bɛ:t.

Loss of segments

In the previous section, I discussed changes of a sound into another sound. Now, I will cover sound changes were segments are completely lost.

Loss (regular) There are several types of regular loss. Aphearesis is loss at the beginnning of a

word, as the loss of k in English knee. Loss at the end of a word is called apocope. When a sound in the middle is omitted, this is called syncope, as in English chocolate.

Haplology (sporadic) When two similar segments occur in a word, haplology is the process

which removes one of these segments. Eg. combining Basque sagar ‘apple’ with ardo ‘wine’ gives

sagardo ‘cider’, dropping an ar segment.

Insertion/movement of segments

Next to change and loss of segments, segments can also be inserted or moved in a word.

Insertion (regular) New sounds can be inserted at different places of a word. Insertion at the

start of the word is called prothesis, eg. Latin scala ‘sword’ > Spanish escala. Insertion of a sound in the middle is called epenthesis, illustrated by the development in Latin poclum ‘goblet’ > poculum. Addition of a sound to the end of a word is excrescence, eg. Middle English amonges > English amongst.

Metathesis (sporadic) Metathesis is a sound change which alters the order of sounds in a word.

(8)

1.1.3 Computational methods in historical linguistics

Computational methods are applied to automate parts of the workflow in historical linguistics. Reasons to apply computational methods include speeding up the process and having a process following more formal guidelines, instead of expert intuition (Jäger and Sofroniev, 2016). Approaches which received a lot of attention were Gray and Atkinson (2003), which charted the age of Indo-European languages, and Bouckaert et al. (2012), which proposed to map the Indo-European homeland to Anatolia.

Steps in the workflow of the comparative method for which automatic methods are available include the detection of cognates (2), the detection of sound correspondences (3), reconstruction of protoforms (4) and reconstruction of phylogenetic trees (6).

Some computational methods stay conceptually closer to the comparative method than others. List (2012) distinguishes between computational methods which act based on genotypic similarity and meth-ods based on phenotypic similarity. Genotypic methmeth-ods compare languages based on the language-specific regular sound correspondences that can be established between the languages. Phenotypic meth-ods compare languages based on the surface forms of words. When using surface forms, it is harder to detect ancestral relatedness of words which underwent much phonetic change and it is more challenging to detect borrowings.

I will now describe computational methods for different tasks in historical linguistics. I will refer to these tasks later, when I introduce a model that can be applied to a number of these tasks.

Cognate detection

In cognate detection, the task is to detect ancestrally related words (cognates) in different languages. Inkpen et al. (2005) apply different ortographic features (number of common n-grams, normalized edit distance, common prefix, longest common subsequence, etc.) for the task of cognate detection. This gives good results, even when not training a model, but just applying the features as a similarity mea-sure. List (2012) places phonetic strings of words into sound classes. Then, a matrix of language-pair dependent scores for sound correspondences is extracted. Based on this matrix, distances are assigned to cognate candidates. Finally, they are clustered into cognate classes. Jäger and Sofroniev (2016) com-pute paiwise probabilities of cognacy of words using an SVM classifier. PMI-based features are used to compare phonetic strings. Probabilities are then converted to distances and words are clustered into cognate clusters. Rama (2016b) applies a siamese convolutional neural network (CNN), trained on pairs of possible cognates. A CNN is a machine learning model that uses a sliding filter over the input to get the output. A siamese CNN runs two parallel versions of the network, one for each input word, but shares the weights. After the two parallel networks, a layer calculating absolute distance between the output of the two networks is applied and the network outputs a cognacy decision.

Sound correspondence detection

Sound correspondence detection is involved with the identification of regular sound correspondences: a

sound in one language that always changes into another sound in a different language, given the same context. Kondrak (2002) treats sound correspondences in the same way as translation equivalence is treated in bilingual corpora in machine translation. Both should occur regularly: if they occur in one place, they should also occur in another place. The alignment links that are made between words in machine translation, are now made between phonemes. The benefit of this method is that it gives an explicit list of sound correspondences and is also suitable for cognate detection. Hruschka et al. (2015) creates a phylogeny of languages using Bayesian MCMC, while at the same time giving a probabilistic description of regular sound correspondences.

(9)

1.1. HISTORICAL LINGUISTICS 9

Protoform reconstruction

In protoform reconstruction, word forms for an ancestor of known current languages are reconstructed. Bouchard-Côté et al. (2013) perform protoform reconstruction by directly comparing phonetic strings, without manual cognate judgments, in order to reconstruct protoforms. Probabilistic string transducers model the sound changes, taking into account context. A tree is postulated, and in an iterative process, candidate protoforms are generated. Parameters are estimated using Expectation Maximization. This approach works for protolanguage reconstruction, cognate detection. Furthermore, the results support the functional load hypothesis of language change, which states that sounds that are used to distinguish words, have a lower probability of changing.

Phylogenetic tree reconstruction

The reconstruction of a phylogenetic, ancestral, tree of languages can be performed using different types of input data. Depending on the type of data, different methods are applied. When a distance matrix between languages is used, distance-based methods are applied. A different type of data is character data, where every language is represented by a string of features, eg. cognate judgments. In the case of character data, maximum parsimony or likelihood-based methods are used.

Distance-based methods UPGMA (Sokal and Michener, 1958) and neighbor joining (Saitou and Nei,

1987) are methods which hierarchically cluster entities based on their pairwise distances. At every step, the UPGMA method joins the two clusters which are closest to each other. UPGMA implicitly assumes a molecular clock (a term which originates from phylogenetic models in bioinformatics): the rate of change through time is constant. The neighbor joining method uses a Q matrix at every step, in which the distance of a language to a newly created node is based on the distances to all other languages. Neighbor joining does not assume a molecular clock, so different branches can evolve at different rates. An example of using a distance-based algorithm for tree reconstruction is Jäger (2015). String simi-larities between alignments of words are directly used as distances between the languages. This enables the use of data without cognate judgments, which is more widely available. Taking into account surface forms, instead of cognates and sound correspondences, resembles Greenberg’s mass lexical comparison, described in 1.1.1. The concerns raised against mass lexical comparison are accommodated by removing words which have a high probability of being a loanword or occurring by chance. The resulting distances are fed as input to a distance-based clustering algorithm, the greedy minimum evolution algorithm, to contruct a phylogenetic tree.

Character-based methods Now I will describe methods which operate on character data, where

every language is represented by a string of features. Two types of these character-based methods are maximum parsimony methods and likelihood-based methods. Maximum parsimony methods try to create a tree by using a minimum number of evolutionary changes to explain the character data, in most cases cognate judgments. One of the problems of this approach is long branch attention: long branches, branches with a lot of change, tend to be falsely clustered together. Likelihood-based methods solve this problem, by looking at the likelihood: the probability of the data, given a certain tree. Evaluating the likelihood for all possible trees is computationally infeasible. Maximizing the likelihood can be efficiently performed using Bayesian Markov Chain Monte Carlo (Bayesian MCMC) methods. These methods randomly sample from the space of possible trees, in order to find the tree with the highest likelihood (Dunn, 2015). Applications of Bayesian MCMC methods for phylogenetic tree reconstruction include Gray and Atkinson (2003), Bouckaert et al. (2012) and Chang et al. (2015).

(10)

1.2 Developments in natural language processing

In the previous section, I have looked at historical linguistics, the comparative method, and computa-tional automation of this method. Now, I will look at a different field, natural language processing, which has a different research goal. It however provides techniques to learn from large amounts of data, which I will try to apply to research in historical linguistics.

1.2.1 Natural language processing

The field of natural language processing (NLP) is involved with “getting computers to perform use-ful tasks involving human language, tasks like enabling human-machine communication, improving human-human communication, or simply doing useful processing of text or speech” (Jurafsky, 2000, p. 35). Contrary to linguistics, the main objective of NLP is to perform practical tasks involving language, getting a better understanding of language as a system is only a secondary goal. However, the methods employed in natural language processing, can be useful to apply to problems in linguistics.

Tasks in natural language processing include the syntactic parsing of sentences, sentiment analysis, machine translation and the creation of dialogue systems. Several approaches exist in natural language processing, including the logical and the statistical approach. In recent years, the statistical approach, using machine learning methods, has shown success in performing a range of tasks.

1.2.2 Machine learning and language

Machine learning “is concerned with the automatic discovery of regularities in data through the use

of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories” (Bishop, 2006). In a supervised setting, this is performed by learning from training examples (x, y) during the training phase. During the prediction phase, the algorithm is presented with test examples x, without a label y. The goal of the algorithm is to predict correct labels

y∗for the test examples. The algorithm is able to do this by its ability to generalize over the training

examples.

Language is a sequential phenomenon: when a speaker performs an utterance, this does not happen at one moment in time, but stretches out over time. Furthermore, there are dependencies between the linguistic items at different time steps. For example, in many languages, a speaker has a high probability of using a vowel, after two consonants have occurred. It is also likely that a determiner will be followed by a noun or an adjective.

When using machine learning methods for prediction tasks in language, this sequential nature can be exploited. Instead of predicting the linguistic items at different time steps independently, the depen-dencies between the items at different time steps can be taken into account. To this end, sequential (Di-etterich, 2002) or, more general, structured prediction methods (Daume and Marcu, 2006) are employed. Examples of sequential and structured methods used in NLP are Hidden Markov Models (HMMs) (Baum and Petrie, 1966), Probabilistic Context Free Grammars (PCFGs) (Baker, 1979; Lari and Young, 1990), Conditional Random Fields (CRFs) (Lafferty et al., 2001) and structured perceptrons (Collins, 2002).

1.2.3 Deep neural networks

Neural networks are a class of machine learning algorithms, which consist of nodes ordered in layers.

Between the different layers, non-linear functions are applied, allowing the network to learn very com-plex patterns. The architecture of neural networks is loosely inspired by the structure of the brain. The networks are therefore sometimes proposed as cognitive models. However, in many cases, biological plausibility is not claimed, and a neural network is just used to solve a certain machine learning problem.

(11)

1.3. WORD PREDICTION 11 In recent years, with the availability of more computing power and large amounts of data, deep

learning has seen its advent. By employing deep neural networks, with many layers and feeding a

high volume of data, representation learning becomes possible. In traditional machine learning, feature

engineering takes place: the data is structured to be in a representation where the machine learning

algorithm can learn from. In deep learning, data is fed as raw as possible, the successive layers of the network are able to learn the right representation of the data (Goodfellow et al., 2016). Deep learning methods have been applied to areas as diverse as pedestrian detection (Sermanet et al., 2013), exploration of new molecules for drugs (Dahl et al., 2014) and analysis of medical images (Avendi et al., 2016).

As described in section 1.2.2, machine learning methods have to be adjusted to the sequential na-ture of language. Recurrent neural networks (RNNs) are neural networks designed for sequential input (Rumelhart et al., 1986). When feeding the input at a certain timestep forward through the network, the input from the previous timestep is also taken into account. In order to accommodate long-distance depencies well, modified network units were developed: the Long Short Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al., 2014). LSTM and GRU networks have been succesful in natural language processing tasks, such as language modelling and machine translation. In machine translation, encoder-decoder approaches have been adopted: one recurrent network (encoder) encodes the input into a representation, which is decoded by another re-current network (decoder) (Sutskever et al., 2014; Cho et al., 2014).

1.3 Word prediction

1.3.1 Word prediction

In section 1.1, I have given an overview of the challenges in historical linguistics and the efforts to automate these tasks. In section 1.2, I showed the recent successes of machine learning methods, and specifically deep neural networks, in natural language processing. In this thesis, I propose the task of

word prediction, phrasing the reconstruction of language ancestry as a machine learning problem.

For this, a dataset of words for a large number of concepts in a large number of languages is needed. A machine learning model is trained on pairs of word forms denoting the same concept in a two lan-guages. Through training, the model learns correspondences between sounds in the two lanlan-guages. Then, for an unseen concept, the model can predict the word form in one language given the word form in the other language. This task can be performed as pairwise word prediction, where predictions are made per language pair: information from other language pairs is not taken into account. I also explore possibilities to exploit the assumed phylogenetic structure of language structure during predic-tion, which I call phylogenetic word prediction. In this setting, information is shared between language pairs.

Word prediction can be used to automate several tasks in historical linguistics. I assume that lan-guage pairs with lower prediction error are more closely related, enabling reconstruction of phylogenetic

trees. Languages can be hierarchically clustered, using the prediction error for every language pair as

distance. Furthermore, the model learns sound correspondences between input and output. These can be identified by visualizing the learned model parameters or by looking at the substitutions between source and prediction. Finally, cognate detection can be performed, the clustering of words based on their ancestral relatedness. I use the prediction error per word from the model to perform this task.

Earlier work related to word prediction is Mulloni (2007), Beinborn et al. (2013) and Ciobanu (2016). Although the applied methods differ, the approaches have in common that their input consists solely of cognates. Furthermore, orthographic input is used. In my approach, the algorithm can be trained on data which is not labelled for cognacy. The input is phonetic, reducing the effect of orthographic differences between languages.

(12)

Some of the computational methods in historical linguistics, described in 1.1.3, also apply machine learning algorithms. However, in the word prediction task I propose, machine learning is at the core of the method. I try to exploit the analogy between the regularity of sound change and the regularities machine learning algorithms use to learn. Machine learning is aimed at retrieving regularities from data and predicting based on these regularities. Sound changes, a central notion in historical linguistics, are also assumed to be regular. The machine learning algorithm serves as a model of sound change.

1.3.2 Model desiderata

When building a machine learning model for the task of word prediction, one has to ask which phenom-ena the algorithm should be able to model. The task in this thesis is to predict a word wd,Bin language

Bfrom a word wd,Ain language A. The question is which sound changes can occur between wd,Band

wd,B in different languages, for which the model should account. These sound changes are described

in section 1.1.2.

Bouchard-Côté et al. (2013) describes that most regular sound changes (eg. lenition, epenthesis) can be captured by a probabilistic string transducer. This is an algorithm with a relatively simple struc-ture, but sensitive to the context in which a sound change occurs. For other changes (eg. metathesis, reduplication), more complex models need to be applied. However, these changes are in many cases not regular, but sporadic. Models applied to word prediction should at least have the capabilities of a probabilistic string transducer to model regular sound change.

A phenomenon for which a model should ideally also account, is semantic shift. The words for concept c in languages A and B may not be cognate. Therefore, the sound changes learned from this pair can be seen as noise during training. However, the meaning of concept c may have shifted to concept d. The word for another concept d in language B, may be cognate with the word for c in language A, so this pair could be used as training data. It would be beneficial if an algorithm could find these cross-concept cognate pairs, or at least be able to give less significance to non-cognate pairs during training.

1.4 Summary

In this introductory chapter, I described the general method in historical linguistics and computational methods applied in the field. Then, I introduced the machine learning and deep learning methods cur-rently applied in natural language processing. Finally, I proposed the word prediction task, using ma-chine learning algorithms as models of sound change, to automate tasks in historical linguistics.

In the next chapter, I will specifically describe which machine learning models and linguistic data I will use in my pairwise word prediction experiments. Furthermore, I will give an overview of the tasks in historical linguistics that can be performed based on word prediction.

In subsequent chapters, I will show the results of the experiments on the pairwise word prediction task and propose an extension of the task, sharing more information between language pairs: phylogenetic

word prediction. Finally, I will draw conclusions on the contributions that the methods described in this

(13)

Chapter 2

Method

Now the task of word prediction has been defined, I will describe the models and data I will use in my experiments. Furthermore, I will show how I will apply word prediction to different tasks in historical linguistics. In this chapter, I will discuss pairwise word prediction: the prediction of words between two languages, without taking into account information from other languages. Phylogenetic word prediction, where information is shared between language pairs, will be covered in a later chapter.

2.1 Pairwise word prediction

2.1.1 Task

I will now more precisely define the task of word prediction, described informally in section 1.3.1. A machine learning model is trained on pairs of phonetic word forms (wc,A, wc,B)denoting the same

con-cept c in two languages A and B. By learning the sound correspondences between the two languages, the model can then predict, for an unseen concept d, the word form wd,Bgiven a word form wd,A. In

this training and prediction process between languages A and B, information from a third language C is not taken into account.

After training a model for a language pair, the prediction distance between the prediction and the real target word is informative. Also, the internal parameters of the model can convey interesting information on the learned correspondences between words in the two languages. In section 2.2, I will further describe how these outcomes of word prediction can be applied to useful tasks in historical linguistics. But first, I will turn to the core of the method, word prediction itself, and specify the used model and data.

2.1.2 Models

As machine learning algorithms to perform word prediction, I apply two neural networks (see section 1.2): a more complex RNN encoder-decoder and a simpler structured perceptron.

RNN encoder-decoder

The first model I apply, is a recurrent neural network (RNN), in an encoder-decoder structure. An RNN takes a sequence as input and produces a sequence. RNNs are good at handling sequential information, because the output of a recurrent node depends on both the input at the current time step (the phoneme at the current position) and on the values of the previous recurrent node, carrying a representation of previous phonemes in the word.

(14)

One RNN emits one output phoneme per input phoneme, assuming that the source and target lengths are the same. I apply an encoder-decoder structure, inspired by models used in machine translation (Sutskever et al., 2014; Cho et al., 2014). An encoder-decoder model consists of two RNNs, see Figure 2.1. The encoder processes the input string and uses the output at the last time step to summarize the input into a fixed-size vector. This fixed size vector serves as input to the decoder RNN at every time step. The decoder outputs a predicted word. This architecture enables the use of different source and target lengths and outputs a phoneme based on the whole input string.

Instead of normal recurrent network nodes, I use Gated Recurrent Units (GRU) (Cho et al., 2014), which are capable of capturing long-distance dependencies. The GRU is a adaptation of the Long Short Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997). Both the encoder and decoder consist of 400 hidden units. I apply a bidirectional encoder, combining the output vectors of a forward and backward encoder. The weights of the network are initialized using Xavier initialization (Glorot and Bengio, 2010). With the right initialization, the network can be trained faster, because the incoming data fits better to the activation functions of the layers. Xavier initialization is designed to keep the variance of the input and output of a layer equal. It works by sampling weights from a normal distribution

N (0,_n 1

incoming, with nincomingthe number of units in the previous layer. I apply dropout, the random

disabling of network nodes to prevent overfitting to training data; the dropout factor is 0.1. Data is supplied in batches, the default batch size is 10.

Because I use one-hot output encoding, predicting a phoneme corresponds to single-label classifi-cation: only one element of the vector can be 1. Therefore, the output layer of the network is a softmax layer, which outputs a probability distribution over the possible one-hot positions, corresponding to phonemes. The network outputs are compared to the target values using a categorical cross-entropy loss function, which is known to work together well with softmax output. I add an L2 regularization term to the loss function, which penalizes large weight values, to prevent overfitting on the training data.

The applied optimization algorithm is Adagrad (Duchi et al., 2011): this is an algorithm to update the weights with the gradient of the loss, using an adaptive learning rate. The initial learning rate is 0.01. The threshold for gradient clipping is set to 100. In the experiments, the default number of training epochs, the number of times the training set is run through, is 15. The network was implemented using the Lasagne neural network library (Dieleman et al., 2015).

In the model desiderata (section 1.3.2), I formulated the types of sound changes that a model should account for. Schmidhuber et al. (2006) showed that LSTM models (closely related to the GRU used in the RNN encoder-decoder) are capable of recognizing context-free and context-sensitive languages, up to a certain length. This is even more than my desideratum of modelling sound change described by a regular language.

Cognacy prior As formulated in the desiderata, ideally, the model should learn more from cognate

word pairs than from non-cognate word pairs in the training data. I tried to cater for this by including a cognacy prior in the loss function, one of the contributions of this thesis.

The network should learn as little as possible from non-cognate word pairs. The weights are updated using a derivative of the loss. Therefore, I would like to make the loss dependent on an estimation of cognacy of the input. A heuristic for cognacy could be edit distance: words with a small distance can still be deemed cognate, but words with a large edit distance cannot.

I propose a new loss function Lnew, based on the cross-entropy loss LCE and a cognacy prior

function CP :

Lnew= LCE(t, p)· CP (t, p)

CP (t, p) = 1

1 + eLCE(t,p)−θ

(15)

2.1. PAIRWISE WORD PREDICTION 15 Where:

Lnew the new loss function, which takes into account cognacy

LCE(t, p) the original categorical cross-entropy loss between target t and prediction p

CP (t, p) the cognacy prior: the estimated score (between 0 and 1) of t and p being cognate θ threshold after which inverse sigmoid function starts to decline steeply

LCE history mean of all previous cross-entropy loss values

v constant, determines number of standard deviations that is added to the treshold

Note that the cross-entropy loss LCE(t, p)occurs two times in the formula: as the “body” of the new

loss function and inside the cognacy prior. In its original formulation, a larger distance between t and

pgives a higher loss. Inside the cognacy prior, the cross-entropy loss is wrapped in an inverse sigmoid

function, see figure 2.2, where a larger distance suddenly gives a lower loss, after a certain threshold. The idea is that not much should be learned from words with a very large distance, which are probably non-cognates. The treshold at which a distance is considered “very large”, is determined by taking the mean of all previous distances (losses), plus a constant number of standard deviations.

Structured perceptron

The second machine learning model I use in my experiments is a simpler model, the structured

percep-tron. A structured perceptron (Collins, 2002) is an extension of a perceptron (one-layer neural network)

(Rosenblatt, 1958) for performing sequential tasks.

Algorithm The structured perceptron algorithm is run for I iterations. At every iteration, all N data

points are processed. For every input sequence (word, in this case) xn, a sequence ˆynis predicted, based

on the current model parameters w:

ˆ

yn= argmaxy∈YwTϕ(xn, yn) (2.1)

wT_ϕ(x

n, yn)is the equation of a perceptron: a one-layer neural network with activation function ϕ

and weights w. By the argmax, the perceptron has to be evaluated for all possible values of yn; the value

which gives the highest output is used as prediction ˆyn. This argmax is computationally expensive,

therefore, the Viterbi algorithm (Viterbi, 1967) can be run to efficiently estimate the best value ˆyn.

If the predicted sequence ˆynis different from the target sequence yn, the weights are updated using

the difference between the activation function applied to the target and the activation function applied to the predicted value:

w← w + ϕ(xn, yn)− ϕ(xn, ˆyn) (2.2)

After I iterations, the weights w of the last iteration are returned. In practice, the averaged structured

perceptron is used, which outputs an average of the weights over all updates. Figure 2.3 shows the

pseudocode of the averaged structured perceptron algorithm.

Application I formulated in the model desiderata (Section 1.3.2) that a machine learning model should

at least be capable of modelling regular languages. The structured perceptron is succesfully applied to POS tagging (Collins, 2002), a task which can be described by a regular language, so a structured perceptron should be powerful enough for the word prediction task.

I use the implementation from the seqlearn library1_{. In the experiments, the structured} percep-tron algorithm is run for 100 iterations of parameter training.

(16)

ht0 ht1 ht2 ht3

W W W

ɣ eː s t

C

Encoder

Fixed-size context vector

ht0 ht1 ht2 ht3 V V V Decoder Tt0 Tt1 Tt2 Tt3 ɡ _aɪ _s _t Loss

Figure 2.1: Structure of the RNN encoder-decoder model

p(cog)

E(t, p)

Figure 2.2: The cognacy prior of the loss function has an inverse sigmoid shape: when the distance between target and prediction is below θ, the cognacy prior value is close to 1. The cognacy prior sharply decreases when the distance between target and prediction is greater than or equal to θ.

Figure 2.3: Pseudocode of the averaged structured perceptron algorithm (adopted from Daume and Marcu (2006)).

(17)

2.1. PAIRWISE WORD PREDICTION 17

2.1.3 Data

Data set

Data from many linguistic levels can be used to study language change, including lexical, phonetic or syntactic data. Using word forms (lexical or phonetic) seems suitable for the prediction task. There are many training examples (words) available per language and the prediction algorithm can generalize over the relations between phonemes. Word forms also have a lower probability of being borrowed or being similar by chance than syntactic data (Greenhill et al., 2017). I use word forms in phonetic representation, because this stays close to the actual use of language by speakers. Word forms in or-thographic representation are dependent on political conventions: the same sound can be described by different letters in different languages.

I use the NorthEuraLex dataset (Dellert and Jäger, 2017), which consists of 1016 phonetic word forms for 107 languages in Northern Eurasia. In historical linguistics, generally, only basic vocabulary (eg. kinship terms, body parts) is used, because this vocabulary is least prone to borrowing (Campbell, 2013, p. 352). However, machine learning algorithms need a large number of examples to train on and a meaningful number of examples to evaluate the algorithm. I hope that performance increase of the algorithm by using enough training examples compensates for the possible performance decrease by borrowing. I use a version of the dataset which is formatted in the ASJPcode alphabet (Brown et al., 2008). ASJPcode consists of 41 sound classes, considerably less than the number of IPA phonemes, reducing the complexity of the prediction problem. Table 7.1 gives an overview of ASJP phonemes.

There can be multiple word forms for a concept in one language. Per language pair, I create a dataset by using all combinations of the alternatives for a concept in both languages. The dataset is then split into a training set (80%), development set (10%) and test set(10%). The training set is used to train the model. The training and test set should be separated, so the model predicts on different data than it learned from. The development set is used to tune model parameters. Models are run for different parameters and evaluated on the development set. The parameter setting with the highest performance on the development set, is used for the real experiments on the test set.

Input encoding

To enable a machine learning algorithm to process the phonetic data, every phoneme is encoded into a numerical vector. I will evaluate three types of encoding: one-hot, phonetic and embedding encod-ing. The embedding encoding is a new encoding in computational historical linguistics and one of the contributions of this thesis.

One-hot In one-hot encoding, every phoneme is represented by a vector of length ncharacters, with

a 1 at the position which correspondends to the current character, and 0 at all other positions. No qualitative information about the phoneme is stored. Table 2.1 gives an example of a one-hot feature matrix.

Phonetic In phonetic encoding, a phoneme is encoded as a vector of its phonetic features (eg. back,

bilabial, voiced), enabling the model to generalize observed sound changes across different phonemes. Rama (2016a), using a siamese convolutional neural network for cognate detection, shows that a pho-netic representation gives a better performance for some datasets. I used the phopho-netic feature matrix for ASJP tokens from Rama (2016a), adding the encoding of the vowels from Brown et al. (2008). Table 2.2 shows an example of a phonetic feature matrix. Table 7.2 shows the full table of phonetic features used in the experiments.

(18)

ASJP phoneme p 1 0 0 0 b 0 1 0 0 f 0 0 1 0 v 0 0 0 1

Table 2.1: Example of feature matrix for one-hot encoding, for an alphabet consisting of four phonemes. Every phoneme is represented by one feature that is turned on, that feature is unique for that phoneme.

ASJP phoneme

Voiced Labial Dental Alveolar · · ·

p 0 1 0 0 · · · b 1 1 0 0 · · · f 0 1 1 0 · · · v 1 1 1 0 · · · m 1 1 0 0 · · · 8 1 0 1 0 · · ·

Table 2.2: Example of feature matrix for phonetic encoding: every phoneme can have multiple features turned on.

Embedding Encoding linguistic items as a distribution of the items appearing in their context is called

an embedding encoding. Word embeddings are succesfully applied in many NLP tasks, where a word is represented by a distribution of the surrounding words (Mikolov et al., 2013; Pennington et al., 2014). The assumption is that “you shall know a word by the company it keeps” (Firth, 1957). If two words have a similar embedding vector, they usually appear in the same context and can thus relatively easy be interchanged. I would like to apply embeddings, and the notion of interchangeability, to a different linguistic level: phonology. I encode a phoneme as a vector of the phonemes occurring in its context. The same interchangeability of word embeddings is assumed: if two phoneme vectors are similar, they appear in a similar context. This corresponds to language-specific rules in phonotactics (the study of the combination of phonemes), which specify that a certain class of phonemes (eg. approximant) can follow after a certain other class (eg. voiceless fricative). It can be expected that embeddings of phonemes inside a certain class are more likely to be similar to each other than to phonemes in other classes. In some respect, the embedding encoding learns the same feature matrix as the phonetic encoding, but inferred from the data, and with more room for language-specific phonotactics.

I created language-specific embedding encodings from the whole NorthEuraLex corpus. For every phoneme, the preceding and following phonemes, for all occurrences of the phoneme in the corpus, are counted. Position is taken into account, ie. an /a/ appearing before a certain phoneme is counted separately from an /a/ appearing after a certain phoneme. Start and end tokens, for phonemes at the start and end of a word, are also counted. This approach is different from word embedding approaches in NLP, which count a larger window of eg. 15 surrounding words and do not take into account position. However, I wanted to put more emphasis on the direct neighbours of a phoneme, as most phonotactic rules describe these relations.

After collecting the counts, the values are normalized per row, so all the features for a phoneme sum to 1. Table 2.3 shows an example of an embedding feature matrix.

(19)

ASJP phoneme

START iLEFT SLEFT pRIGHT · · ·

3 0.004 0.003 0.001 0.002 · · ·

E 0.024 0.000 0.000 0.003 · · ·

a 0.050 0.002 0.000 0.012 · · ·

b 0.388 0.000 0.000 0.004 · · ·

p 0.152 0.039 0.000 0.000 · · ·

Table 2.3: Example of feature matrix for embedding encoding: every phoneme is represented by an array of floating point values, which correspond to the probabilities that other phonemes occur before or after this phoneme. The values in a row sum to 1.

embeddings for Dutch from the whole NorthEuraLex corpus. I then reduced these embeddings to two dimensions using Principal Components Analysis (PCA) (Pearson, 1901). For comparison, I also added PCA to the phonetic feature matrix (Table 7.2). Figure 2.4 shows PCA plots for the embedding and phonetic encoding. It can be seen that both encodings partially show the same representation of the phonetic space, but generally the phonetic representation is more clustered, while the embedding rep-resentation is more spread out. In both plots, a cluster of vowels can be observed. Also, in both plots,

n, l, r and N are close together. There are also striking differences between the encodings: S and s

are, in accordance with phonological theory, close together in the phonetic matrix, but far remote in embedding encoding. This remoteness does not have to be wrong, it can be a phonotactic pattern seen in the NorthEuraLex pattern, and it can be an effective way to code information. It however shows that the embedding and phonetic encoding do not learn the same representation in all cases.

Target encoding

For the target data in the neural network models, one-hot encoding is used, regardless of the input encoding. This means that target words are encoded in one-hot encoding and the algorithm will out-put predictions in one-hot encoding. One-hot outout-put encoding facilitates convenient decoding of the predictions. Other output encodings did not show good results in preliminary experiments.

As target for the structured perceptron model, unencoded data is supplied, since this is the format used by the used implementation of the algorithm.

Input normalization

The training data is standardized, in order to fit it better to the activation functions of the neural network nodes. For the training data, the mean and standard deviation per feature are calculated, over the whole training set for this language pair. The mean is subtracted from the data and the resulting value is divided by the standard deviation. After standardization, per feature, the standardized training data has mean 0 and standard deviation 1.

The test data is standardized using the mean and standard deviation of the training data. This transfer of knowledge can be regarded as being part of the training procedure.

2.1.4 Experiments

I specified the machine learning models and data used in the experiments. Now, I will describe how the training and prediction in the experiments will take place.

(20)

(a) Embedding matrix

(b) Phonetic matrix

Figure 2.4: PCA plots of the embedding encoding matrix for Dutch, generated from NorthEuraLex, and the phonetic feature matrix (Table 7.2).

(21)

Languages Description

ces bul rus bel ukr pol slk slv hrv Slavic

swe isl eng nld deu dan nor Western Germanic lat fra ron ita por spa cat Romance

lav hrv rus bel ukr ces slv Balto-Slavic fin krl liv ekk vep olo Finnic lit hrv ukr ces slv slk Balto-Slavic lit hrv ukr ces slv lav Balto-Slavic uzn tur tat bak azj Turkic

sjd smj sms smn sme Sami

smj sma sms smn sme Sami

Table 2.4: Overview of the 10 largest maximal cliques of languages with at least 100 shared cognates. To give an impression of the languages in the cliquees, informal descriptions of the language groups are added, these differ in level of grouping. For a mapping from the ISO code used here to language names, see Table 7.3.

Training

Training is performed on the full training set per word pair, which consists of both cognate and non-cognate words. It would be easier for the model to learn sound correspondences if it would only receive cognate training examples. However, I want to develop a model that can be applied to problems where no cognate judgments are available.

Evaluation

I evaluate the models by comparing the predictions and targets on the test section of the dataset. During development and tuning of parameters, the development set is used. I only evaluate on cognate pairs of words. If words are not genetically related, the algorithm will not be able to predict this word via regular sound correspondences. Cognate judgments from the IELex dataset are used.2 _{For words in} NorthEuraLex for which no IELex cognate judgments are available, LexStat (threshold 0.6) automatic cognate judgments are generated.

Languages which are not closely related, do not share many cognates. Because I only evaluate on cognate words, the test set for those language pairs will become too small. To alleviate this problem, I evaluate only on groups of more closely related languages. In these groups, every language in the group shares at least n cognates with all other languages. I determine these groups, by generating a graph of all languages, where two languages are connected if and only if the number of shared cognates exceeds the threshold n. Then, I determine the maximal cliques in this graph: groups of nodes where all nodes are connected to each other and it is not possible to add another node that is connected to all existing nodes. These maximal cliques correspond to my definition of language groups which share n cognates. Table 2.4 shows the 10 largest maximal cliques of languages with at least 100 shared cognates.

The distance metric used between target and prediction is the Levenshtein distance (Levenshtein, 1966) (also called edit distance) divided by the length of the longest sequence. An average of this distance metric, over all words in the test set, is used as the distance between two languages.

I use the prediction distances for two goals: to determine the distance of languages to each other and to determine the general accuracy of a model. If a certain model has a lower prediction distance over all language pairs than another model, I consider it to be more accurate.

(22)

The prediction results are compared to two baselines, for which the distance between target and baseline is calculated. The first baseline is the trivial source prediction baseline, predicting exactly the source word. The second baseline is based on Pointwise Mutual Information (PMI). As in Jäger et al. (2017), multiple runs of Needleman Wunsch-alignment (Needleman and Wunsch, 1970) of words are performed on a training set, iteratively optimizing PMI-scores and alignments. At prediction time, the alignment of the last training iteration is used. For every source phoneme, the target phoneme with the highest probability of being aligned to the source phoneme is predicted.

2.2 Applications

In the next paragraphs, I will show multiple applications of word prediction in historical linguistics:

phylogenetic tree reconstruction, sound correspondence identification and cognate detection. The

applica-tions use the outcomes of pairwise word prediction as a basis.

2.2.1 Phylogenetic tree reconstruction

Language pairs with a good prediction score (low edit distance) usually share more cognates, since these are predictable through regular sound correspondences. I regard the prediction score between language pairs as a measure of ancestral relatedness and use these scores to reconstruct a phylogenetic tree. I perform hierarchical clustering on the matrix of edit distances for all language pairs, using the UPGMA (Sokal and Michener, 1958) and neighbor joining (Saitou and Nei, 1987) algorithms (described in section 1.1.3), implemented in the LingPy library (List and Forkel, 2016).

The generated trees are then compared to reference trees from Glottolog (Hammarström et al., 2017), based on current insights in historical linguistics. Evaluation is performed using Generalized Quartet Distance (Pompei et al., 2011), a generalization of Quartet Distance (Bryant et al., 2000) to non-binary trees. I apply the algorithm as implemented in the QDist program (Mailund and Pedersen, 2004).

2.2.2 Sound correspondence identification

To be able to make predictions, the word prediction model has to learn the probabilities of phonemes changing into other phonemes, given a certain context. I would like to extract these correspondences from the model. It is challenging to identify specific neural network nodes that fire when a certain sound correspondence is applied. Instead, I estimate the internal sound correspondences that the network learned, by looking at the output: the substitutions made between the source word and the prediction. Pairs of source and predictions words are aligned using the Needleman-Wunsch algorithm. Then, the pairs substituted phonemes between these source-prediction alignments can be counted. These can be compared to the counts of subsituted phonemes between source and target.

2.2.3 Cognate detection

Cognate detection is the detection of word forms in different languages (usually per concept), which

derive from the same ancestral word. In order to perform cognate detection based on word prediction, I cluster words for the same concept in different languages based on the prediction distances per word. First, word prediction is performed for all language pairs of languages which we want to evaluate. In the normal word prediction workflow (section 2.1.4), predictions are made only on word pairs which are deemed cognate by existing judgments. When performing cognate detection, the whole point is to make these judgments, so I perform word prediction on the full test set: cognates and non-cognates. I take into account concepts for which word forms occur in all languages, this vastly reduces the number of concepts. For every concept, I create a distance matrix between the word forms in all languages, based

(23)

2.3. SUMMARY 23 on the prediction distance per word. Next, I perform a flat clustering algorithm on this distance matrix. Applied clustering algorithms are flat UPGMA (Sokal and Michener, 1958), link clustering (Ahn et al., 2010) amd MCL (van Dongen, 2000), implemented in the LingPy library. Preliminary experiments show that a threshold of θ = 0.7 gives best results for MCL and link clustering, and θ = 0.8 gives best results for flat UPGMA.

Conceptually, the performed cognate detection operation is the same as the phylogenetic tree recon-struction operation, but now I cluster per word, instead of per language, and I perform a flat clustering instead of an hierarchical clustering.

For evaluation, I use cognate judgments from IElex (Dunn, 2012). Evaluation is performed using the B-Cubed F measure (Bagga and Baldwin, 1998; Amigó et al., 2009), implemented in the bcubed library3_.

2.3 Summary

In this chapter, I described the method of the experiments on pairwise word prediction. After defining the task, models and data, I showed how several tasks in historical linguistics can be performed based on pairwise word prediction. Contributions of the methods described in this section include:

• The proposal of a new cognacy prior loss, enabling a neural network to learn more from some training examples than from others.

• The usage of embedding encoding, inspired by word embeddings in natural language processing, to encode phonemes in historical linguistics.

• Use of clustering algorithms to identify learned patterns by a neural network.

• Inference of cognates from word prediction distances. Earlier cognate detection detection algo-rithms did not use prediction as basis.

In the next chapter, I will show the results of the experiments.

(24)

(25)

Chapter 3

Results

3.1 Word prediction

Pairwise word prediction was performed for all possible language pairs for the 9 languages from the Slavic group in the Indo-European language family: Czech, Bulgarian, Russian, Belarusian, Ukrainian, Polish, Slovak, Slovene and Croatian. This is the largest group of languages sharing at least 100 cognates, found in section 2.1.4. A smaller number of experiments was performed for the second largest clique, a group of 8 Germanic languages: Swedish, Icelandic, English, Dutch, German, Danish and Norwegian.

For three free parameters, the results were evaluated in the experiments: • Machine learning model: RNN encoder-decoder/structured perceptron • Input encoding: one-hot/phonetic/embedding

• Cognacy prior (only for encoder-decoder): off/v = 1.0/v = 2.0 All other parameters were set as described in section 2.1.2.

Table 3.1 shows the output of the word prediction algorithm, for a structured perceptron model on the language pair Dutch-German. For every word, a prediction distance is calculated. From these word distances, a mean distance per language pair is calculated. When again taking the mean of the scores of all language pairs in a family, one could get a score which represents the performance of a model on a language family.

Table 3.1 shows mean word prediction distances for the different conditions. The structured percep-tron model outperforms the RNN encoder-decoder for the language groups under consideration. This difference is larger for the Slavic than for the Germanic language family. The differences between dif-ferent conditions (input encoding and cognacy prior) for the RNN model are small, but the embedding encoding generally seems to perform a bit better. For the Slavic language family, the PMI baseline model outperforms both of the prediction models. For the Germanic language family, the structured percep-tron outperforms the baseline, with a slightly better performance. This difference may be explained by the fact that the Slavic languages in the dataset are more closely related and therefore easier for the baseline models to predict right, if little changes are made to the source words.

3.2 Phylogenetic tree reconstruction

In section 3.1, word prediction was performed for the Slavic language family. From the prediction results, I now reconstruct phylogenetic trees, by hierarchically clustering the matrix of edit distances

(26)

Input Target Prediction Distance

blut blut blut 0.00

inslap3 ainSlaf3n inSlaun 0.33

blot blat blat 0.00

wExan vEge3n vag3n 0.33

xlot glat glat 0.00

warhEit vaahait vaahait 0.00

orbEit aabait oabait 0.17

mElk3 mElk3n mEl3n 0.17

vostbind3 anbind3n fostaiN3n 0.78

hak hak3n hak 0.40

stEl3 StEl3n Staln 0.33

hust3 hust3n hiSta 0.67

xord3l giat3l goad3l 0.33

l3is laus laiS 0.50

mont munt mant 0.25

ler3 leG3n le3n 0.20

fEift3x finfciS faift3n 0.71

zwEm3 Svim3n Svam3 0.33

slap Slaf Slep 0.50

klop3 klop3n klaun 0.50

vex3 feg3n feg3 0.20

tont3 tant3 tanta 0.20

dox tak daS 0.67

nevEl neb3l nebEl 0.20

ku ku kl 0.50

spits Spic Spist 0.40

lerar leGa leGa 0.00

dot das dat 0.33

brot bGot bGat 0.25

bind3l bind3l b3nd3l 0.17

Table 3.1: Word prediction output for a structured perceptron on language pair Dutch-German, encoded in ASJP phonemes (see Table 7.1 for an overview of ASJP). Prediction is the German word predicted by the model when Input is given as Dutch input. The edit distance between the prediction and the target German word, which is not seen by the model, is calculated. Lower distance is better performance.

(27)

3.3. IDENTIFICATION OF SOUND CORRESPONDENCES 27

Method Language family

Model Input encoding Cognacy prior Slavic Germanic

Encoder-decoder One-hot None 0.5582 0.5721

Encoder-decoder Phonetic None 0.5767 0.5853

Encoder-decoder Embedding None 0.5579 0.5710

Encoder-decoder One-hot v = 1.0 0.5607 0.5754 Encoder-decoder Phonetic v = 1.0 0.5770 0.5824 Encoder-decoder Embedding v = 1.0 0.5573 0.5620 Encoder-decoder One-hot v = 2.0 0.5620 0.5688 Encoder-decoder Phonetic v = 2.0 0.5752 0.5744 Encoder-decoder Embedding v = 2.0 0.5543 0.5580

Structured perceptron One-hot 0.3436 0.4374

Structured perceptron Phonetic 0.3465 0.4497

Structured perceptron Embedding 0.3423 0.4375

Source prediction baseline 0.3714 0.4933

PMI-based baseline 0.3249 0.4520

Table 3.2: Word prediction distance (edit distance between prediction and target) for different test con-ditions, for two language families: Slavic and Germanic. The distance is the mean of the distance of all language pairs in the family. Lower distance means better prediction.

of all language pairs. Table 3.3 shows the generalized Quartet distance between the generated trees, for different conditions, and a Glottolog reference tree.

In the table, one could see that the structured perceptron model consistently creates good trees. Some trees generated from the RNN encoder-decoder model give the same performance, but this is less stable across conditions. The baseline models, especially the PMI model, also create good trees. The performance differences between models are smaller than in Table 3.2. This is not very surprising, given that phylogenetic tree reconstruction is an easier task than word prediction: there are less possible branchings in a tree, than possible combinations of phonemes in a word. Even a model with lower performance on word prediction can generate a relatively good tree.

Figure 3.1 graphically shows the trees inferred from three different models. The reference tree is added for comparison. Trees (a) and (b) (structured perceptron, UPGMA and NJ) are the same, the branches are only shown in a different order. These trees receive the lowest possible distance to the reference tree of 0.047619: the generated binary trees will never precisely match the multiple-branching reference tree. This is also the reason why no generated tree reaches a Quartet distance of 0 in Table 3.3. Tree (c) is one of the worst-performing models (encoder-decoder, one-hot, UPGMA), with a Quartet distance to the reference tree of 0.31746. Comparing to the reference tree, it can be observed that Polish, Czech and Slovak are placed in the wrong subtrees.

3.3 Identification of sound correspondences

I identified sound correspondences between Dutch and German, two closely related Germanic lan-guages. Source-prediction substitutions were extracted using a structured perceptron model, with the default settings described in section 2.1.2. The test set, on which the model was evaluated, consisted of 93 cognate word pairs. Table 3.4 shows the most frequent sound substitutions for source-prediction and

(28)

b b bbul b bslv bhrv b b brus b bbel bukr b bpol b bces bslk

(a) Structured perceptron, one-hot encoding, UPGMA clustering. Quartet distance: 0.047619. b b b b bel b ukr brus b b b b slv b hrv b bul b b pol b b ces bslk

(b) Structured perceptron, one-hot encoding, NJ clustering. Quartet distance: 0.047619. b b b _bul b b b ces b slk b b slv b hrv b b _pol b b rus b b bel b _ukr

(c) RNN encoder-decoder, one-hot encoding, UPGMA clustering. Quartet distance: 0.31746. b b bbel brus bukr b b b hrv b slv bbul b b b ces b _slk bpol (d) Glottolog

Figure 3.1: Phylogenetic trees for the Slavic language family, using different models and clustering methods, and the Glottolog reference tree. For the model trees, Quartet distances to the Glottolog reference trees are shown. Lower distance means better correspondence.

(29)

3.4. COGNATE DETECTION 29

Method Clustering

Model Input encoding Cognacy prior UPGMA Neighbor joining

Encoder-decoder One-hot None 0.047619 0.047619

Encoder-decoder Phonetic None 0.047619 0.190476

Encoder-decoder Embedding None 0.047619 0.047619

Encoder-decoder One-hot v = 1.0 0.31746 0.190476 Encoder-decoder Phonetic v = 1.0 0.047619 0.047619 Encoder-decoder Embedding v = 1.0 0.31746 0.31746 Encoder-decoder One-hot v = 2.0 0.31746 0.190476 Encoder-decoder Phonetic v = 2.0 0.31746 0.047619 Encoder-decoder Embedding v = 2.0 0.190476 0.190476

Structured perceptron One-hot 0.047619 0.047619

Structured perceptron Phonetic 0.047619 0.047619

Structured perceptron Embedding 0.047619 0.047619

Source prediction baseline 0.269841 0.047619

PMI-based baseline 0.047619 0.047619

Table 3.3: Generalized Quartet distance between trees of the Slavic language family, inferred from word prediction results and the Glottolog reference tree. Lower is better: a generated tree equal to the ref-erence tree will have a theoretical distance of 0. In this case, the lower bound is 0.047619, because the generated binary trees will never precisely match the multiple-branching reference tree.

source-target. It can be observed that the most frequent substitutions between source and prediction are also frequent between source and target. This implies that the model learned meaningful sound correspondences.

3.4 Cognate detection

I perform cognate detection for the Slavic and Germanic language families, by clustering words based on word prediction distances. I evaluate performance for the encoder-decoder and structured perceptron models. The models use the parameter settings described in section 2.1.2. As described in section 2.1.4, during cognate detection, contrary to the default setting, prediction is performed on both cognates and non-cognates. I apply three clustering algorithms, as described in section 2.2.3: MCL (θ = 0.7), Link

clustering (θ = 0.7) and Flat UPGMA (θ = 0.8).

Table 3.5 shows B-Cubed F scores for cognate detection, using different models and clustering al-gorithms on the Slavic and Germanic language families. For the Slavic language family, the source prediction baseline model slightly outperforms the prediction models. The structured perceptron model performs better than the RNN model. For the Germanic language family, both prediction models per-form above the baseline. Here, the one-hot structured perceptron using MCL clustering perper-forms best. It must be noted that the sample of shared concepts in a language family is small, this makes results less stable.

(30)

Substitution Source-prediction frequency Source-target frequency o a 21 13 r a 14 13 s S 14 8 v f 12 10 E a 12 9 3 n 10 1 r G 9 12 x g 9 10 w v 8 9 - n 7 28 3 a 4 1 i n 4 t - 3 3 r - 3 2 - 3 3 7 p u 3 r n 2 e a 2 3 w - 2 2 i 3 2 - t 2 1 e E 2 1 x S 2 2 p f 2 5 u l 2 x v 2 v b 2 3 k - 2 x n 1 i - 1

Table 3.4: Substitutions between aligned source-prediction pairs and substitutions between aligned source-target pairs for Dutch-German word prediction, using a structured perceptron model, with a test set consisting of 93 cognate pairs, under standard conditions. The list is ordered on frequency of source-prediction substitutions, the 30 most frequent entries are shown.