• No results found

Does Regularity Make Reading a Foreign Language Easier?

N/A
N/A
Protected

Academic year: 2021

Share "Does Regularity Make Reading a Foreign Language Easier?"

Copied!
152
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

University of Groningen

Does Regularity Make

Reading a Foreign

Language Easier?

Studying the power of entropy measures when

predicting written mutual intelligibility among

five Germanic languages.

Anne Kingma

S2405792

anne.s.kingma@gmail.com

March 6, 2015

Program: Language & Cognition 1st supervisor: Charlotte Gooskens

(2)
(3)

Acknowledgments

(4)

Table of Contents

Acknowledgments ... 2

1. Introduction ... 5

2. Background ... 6

2.1 Intelligibility ... 6

2.2 Previous research on intelligibility ... 9

2.3 Measuring linguistic distance ... 12

2.4 The MICReLa project ... 19

3. Research questions ... 20

4. Languages... 22

4.1 Overview... 22

4.2 Germanic languages ... 24

4.2.1 North Germanic: Swedish and Danish ... 26

4.2.2 West Germanic: German and Dutch ... 27

4.2.3 West Germanic: English... 30

4.3 Germanic orthography ... 32 4.3.1 English ... 33 4.3.2 Dutch ... 35 4.3.3 German ... 35 4.3.4 Danish ... 36 4.3.5 Swedish ... 37 4.3.6 Summary ... 38

4.3.7 The North Wind and the Sun ... 39

5. Data ... 41

6. Methods ... 44

6.1 Methods for measuring linguistic distance ... 44

(5)

6.1.2 Orthographic Levenshtein distance ... 46

6.1.3 Measuring conditional entropy ... 48

6.2 Measuring intelligibility ... 53

6.2.1 Participants ... 53

6.2.1 Cloze test ... 56

6.2.2 Word translation task ... 56

7. Results ... 57

7.1 Linguistic measures ... 57

7.1.1 Lexical distance ... 57

7.1.2 Orthographic Levenshtein distance ... 59

7.1.3 Entropy measures ... 61

7.1.4 Correlations of the different linguistic measures ... 68

7.2 Using linguistic measures to predict intelligibility ... 71

7.2.1 Cloze test ... 73

7.2.2 Word translation task ... 77

8. Discussion ... 82 8.1 Linguistic measures ... 82 8.2 Research questions ... 84 8.3 Future research ... 86 9. Conclusion ... 88 10. References ... 89 Etymological dictionaries ... 93 Appendix ... 94

Appendix A: Excluded Words ... 94

(6)

1. Introduction

The Scandinavian languages Swedish, Danish and Norwegian are of the North Germanic branch of the Indo-European language family. They are so similar to each other, that their speakers can (and do) to some extent communicate with each other while each using their own language: receptive multilingualism. The successfulness of this type of communication depends on the level of mutual intelligibility that exists between the languages: how well can both speakers understand each other’s language. Three factors are thought to determine this (Gooskens 2007a:446):

1. The listener’s attitude towards the speaker’s language

2. The listener’s contact with the speaker’s language and other language experience

3. Linguistic distance of the speaker’s language to the listener’s language

In this thesis, I will focus on the third of these factors: the linguistic factors. The linguistic relations among five Germanic languages (English, Dutch, German, Danish and Swedish) are calculated on the orthographic level in three ways: lexical distance, Levenshtein distance (Heeringa 2004) and conditional entropy (Moberg et al. 2007). These results are correlated to the results of two written intelligibility tasks carried out by Femke Swarte as part of the Micrela project of the University of Groningen (see e.g. Heeringa et al. 2013): a cloze test and a word translation task. The main purpose of this study is to determine the value of entropy calculations in addition to the lexical and Levenshtein distance: entropy between two languages is inherently asymmetrical, unlike lexical and Levenshtein distances. Therefore, it could be a useful way to capture the existing asymmetry found in mutual intelligibility (Moberg et al. 2007), for example the asymmetry between Swedish and Danish (Schüppert 2011). More on this asymmetry can be found in Chapter 2.

(7)

Chapter 6 explains the procedures of these measures and the intelligibility experiments. Chapter 7 shows the results, Chapter 8 discusses these and Chapter 9 is the conclusion of this thesis.

2. Background

The first part of this chapter will outline the background and history of intelligibility research. In the second part, the issue of measuring linguistic distance will be elaborated upon. Finally, in the third section a recent project on linguistic factors influencing intelligibility will be described: the Micrela project.

2.1 Intelligibility

(8)

the minority language. The ultimate goal of successful communication is reached, however.

A second possibility is for both speakers to learn a third language, native to neither of them. This frequently happens with lingua francas, like English, Latin, Hindi, Modern Standard Arabic.1 It could be argued that certain dialect situations fall into this category as well, depending on whether learning the standard language is considered learning a new language. The advantage of this strategy is that both speakers are equal: they both have to learn a new language and they both have to struggle with using a language that is not native to them. At the same time, of course, this is also its major disadvantage. More people having to learn a new language takes more time, effort, and money. More languages run the risk of being endangered, as in the previous situation. Also, the risk of miscommunication is higher when neither speaker is able to use their native language. In addition to this, even in this situation some speakers might have an advantage over others: if for example their native language is close to the lingua franca, or they have an aptitude for language learning.

There is a third option, however: receptive multilingualism. In this case, each speaker simply speaks his or her own language: they are using both languages simultaneously to communicate. The great advantage of this strategy is of course that both speakers have the comfort and ease of being able to express themselves in their native language. Neither of them has to invest the time and effort required to learn to speak another language: they only need to learn to understand it. If the languages are closely related, even this might not or barely be necessary. In this case, the languages are inherently mutually intelligible.

There are several known situations in which this tactic is applied. Serbian and Croatian, for example, are very similar to each other, and for the most part mutually intelligible. The distinction between the languages is a political one more than linguistic. Another example can be found in Scandinavia. The three main languages spoken there, Norwegian, Swedish and Danish, are so similar to each other, that all

(9)

speakers can understand the other languages quite well without any previous experience or training. When they need to communicate with a speaker of a different language, rather than resorting to a third language as many other Europeans in that situation would, they often use the strategy of receptive multilingualism.

What factors contribute to making receptive multilingualism possible? Gooskens (2007a:446) mentions three factors that could contribute to the level of mutual intelligibility:

1. The listener’s attitude towards the speaker’s language

2. The listener’s contact with the speaker’s language and other language experience

3. Linguistic distance of the speaker’s language to the listener’s language

The first factor, attitude, refers to a listener’s opinion of or feeling towards a certain language or language variety. If a listener dislikes a language and/or its speakers, he will probably be less willing to put effort into trying to understand it. This might result in less successful communication. If the listener likes the language variety he is listening to, however, he might understand more, just by trying harder. Schüppert, Hilton and Gooskens (accepted), for example, find a low but significant positive correlation (r = .19) between attitude and word intelligibility for Danish and Swedish.

(10)

however, will immediately recognize the German word (Kartoffel) and translate it to Dutch correctly.

The third factor, linguistic distance, is independent of the specific speaker and listener involved, but refers only to how distant the two languages are from each other. What exactly this means and how it can be measured has a whole history of its own, which will be outlined in section 2.3. First, I will discuss the history of intelligibility research in general.

2.2 Previous research on intelligibility

Research on intelligibility amongst the Scandinavian languages has a long history. Schüppert (2011) summarizes this history in her introduction. One of the first studies on this topic was carried out by Haugen and published in 1953 (in Norwegian, published in English as Haugen 1966). He sent out a questionnaire in Norway, Sweden and Denmark asking people in the first place about their personal experiences with receptive multilingualism: had they ever communicated with people speaking one of the other two languages, how well had they understood each other, and which problems had occurred. These were very important questions, as no research into this had been done up to that point for the Scandinavian languages. Furthermore, the questionnaire asked about the informant’s opinion on the other languages (i.e. attitude) and amount of contact with them:

“He was further asked to indicate the approximate amount of instruction he had received in the languages, how much he read in each, whether he enjoyed inter-Scandinavian radio programs, and whether he listened to broadcasts from the neighbouring countries.”

(11)

Haugen’s (1966) study was based on a questionnaire. All of the data was indirect; it consisted of reports by the respondents of their personal experiences. That is, he used the method of ‘asking the informant’ (Voegelin and Harris, 1951): asking speakers how well they think they can understand a certain language variety, or how well they think they can be understood by speakers of a certain language variety. A more extensive version of essentially the same method is to present informants with a sample of the particular language variety, instead of naming it. This helps them to focus more on the actual linguistic information, instead of the non-linguistic connotations they may have with the variety. Tang and Van Heuven (2009) observe that "[l]isteners appear to have reliable (i.e. reproducible) ideas about how much language B differs from their own, even if they know the stimulus language from past exposure, and even if the recording quality of the speech samples may differ substantially" (p. 710). However, these judgments do not necessarily match

actual intelligibility.

(12)

Maurud (1976) carried out such an experimental study of the same three Scandinavian languages: Danish, Swedish and Norwegian. He had informants who lived in the capital cities (respectively Copenhagen, Stockholm and Oslo) translate texts from both other languages to their own language. His results agreed with Haugen’s (1966) results in that the combination of Swedish and Norwegian is the most successful and the combination of Swedish and Danish the most problematic, but unlike in Haugen’s study, the scores were not symmetrical. This is true for the spoken texts especially: Swedes had a much harder time with Danish (understanding about 23%) than Danes had understanding Swedish (43%; Schüppert, 2011). Maurud himself seems to attribute this result to non-linguistic factors:

“Swedes’ low understanding of the neighbour languages is a sign that the habit of hearing them and the attitude towards the need for understanding them are of major importance for the Scandinavians’ ability to communicate with each other in their respective languages.”

(Maurud 1976:71, translated in Schüppert 2011:5) One problem with Maurud’s (1976) study is the fact that all his informants lived in the capital cities of each country (Schüppert 2011). The capital of Denmark, Copenhagen, is very close to Sweden. There is likely to be some contact between Danes and Swedes there, and the people living in that region have access to TV and radio programmes in Swedish. Sweden’s capital Stockholm, on the other hand, is quite far away from both Denmark and Norway. The advantage that Danes seem to have over Swedes, then, could be simply due to the geographical location of the particular participants in this study and the amount of contact with the other languages that that implies.

(13)

Although this research established that both previous contact with and the listener’s attitude to the speaker’s language influenced the level of intelligibility, it was still unclear in how far intelligibility is determined by purely linguistic factors. An attempt to fill this hole was made by Gooskens (2007a). She used the results from an extensive set of intelligibility experiments carried out some years earlier (Delsing and Lundin Åkesson 2005). This study included only background questions on attitude and contact (numbers 1 and 2 of the list in 2.1); no attempt was made to study the influence of linguistic factors (number 3). Gooskens correlated their results with objective measures of linguistic distance, both lexical and phonological, and found that phonetic distance was indeed the best predictor for intelligibility between Swedish, Danish and Norwegian (r = -.80).

Probably part of the reason why linguistic factors have been neglected in intelligibility research is the fact that objectively measuring the distance between two languages, like Gooskens (2007a) did, is not quite a straightforward matter. In the next section, I will elaborate on this issue.

2.3 Measuring linguistic distance

(14)

specific linguistic distance. A more objective way of determining distance between dialects was needed.

An early criterion to determine language distance was in fact intelligibility itself: if two people can understand each other, they must be speaking the same language (see e.g. Voegelin and Harris, 1951). A problem for this strategy is posed by dialect continua. Going from the west of the Netherlands to the east of Germany, for example, every dialect is mutually intelligible with its neighbouring dialects. Does this mean Dutch and German are one language? Yet the dialects at each end of the spectrum are completely unintelligible to one another. Going from north to south, the situation seems similar: a speaker of a West Flemish dialect (spoken in Belgium in the south of the Dutch language area) will have a hard time communicating with a speaker of a Groningen dialect (spoken in the north) without switching to a standard language. Yet both varieties are considered dialects of the same language: Dutch.

As pointed out as early as 1959 by Wolff (1959), intelligibility is not a reliable measure for linguistic distance. Too many other factors play a role. Some of these are attitude and previous contact, as also mentioned by Gooskens (2007a, see section 2.1). Languages are not isolated things, they are used in the context of a certain culture. An objective, computational method would be a more reliable way to measure only linguistic distances. The problem with this is formulated by Tang and Van Heuven (2009) as follows:

“In spite of its apparent success and conceptual simplicity, the notion of linguistic distance, i.e. the inverse of similarity shared between languages, has persistently eluded quantification. The problem is that languages do not differ along just one dimension. Languages may differ formally in their lexicon, phonetics and phonology, morphology, and in their syntax. And again, at each of these linguistic levels, the ways in which languages may vary are further subdivided along many different parameters.”

(15)

up close, it is often not very clear which differences should be given the most importance when classifying them. A commonly used method is to draw a line on a map representing the border between two particular representations of a linguistic item: isoglosses. A group of isoglosses together is a bundle and signals a possible dialect border (after all, the varieties on either side of the bundle differ on several points). A problem with this is that isoglosses do not always group together nicely into bundles. And even if they do, it is not always clear when this bundle should be considered a border between dialects. Chambers and Trudgill (1980) put the problem as follows:

“It is undeniable that some isoglosses are of greater significance than others (…). It is equally obvious that some bundles are more significant than others (…). Yet, in the entire history of dialectology, no one has succeeded in successfully devising a satisfactory procedure or a set of principles to determine which isoglosses or which bundles would outrank some others. The lack of a theory or even a heuristic that would make this possible constitutes a notable weakness in dialect geography.”

(Chambers and Trudgill 1980:112, quoted in Tang and van Heuven 2009:710) Two languages being similar in one respect does not entail their being close to each other on the other levels. Moreover, the way to measure and the criterion for closeness to be used are different on every level. We need to determine, then, on which level(s) the distance should be measured, and how to measure it.

(16)

be able to understand the text as a whole, even if most of the function words are clear to him (Gooskens, Heeringa and Beijering 2008).

Heeringa (2004) presents a history of computational methods used in dialectology. According to him, the first who used a computational strategy to determine dialect distance was Séguy and his associates in creating their Atlas

linguistique de la Gascogne (published in six volumes between 1954 and 1973). He

mapped many different features of French dialects and calculated distances by counting the number of items on which two neighbouring dialects disagreed. These items were taken from all linguistic levels: lexicon, pronunciation, phonology, morphology and syntax. The higher the percentage of differing items, the more distant the two dialects are. When these distances are visualized in a map, separate dialect areas can be distinguished.

Goebl (1982, 1993) took a similar approach to Séguy (although developed independently (Heeringa 2004)) by comparing individual items across dialects. He did not count the items that differed, however, but the items that were the same. His scores do not reflect dialect distance, then, but its opposite, dialect similarity.

Hoppenbrouwers and Hoppenbrouwers (1988, 2001) developed the corpus

frequency method in order to calculate dialect distances (Heeringa 2002, 2004). In

essence, this method compares two languages based on text corpora. It started with the letter frequency method, in which the frequencies of individual letters in the corpora are compared. An issue with this method is the fact that different languages’ orthographies do not represent those languages in the same way – the same sound can be spelled in different ways, or the same spelling used for different sounds. A more accurate comparison, then, would be on the phone level: the phone frequency

method. This is essentially the same as the letter frequency method, but instead of

(17)

A more refined version of this method, then, is the feature frequency method. It breaks down the individual phones into phonological features (front/back, rounded/unrounded, plosive/fricative, et cetera). Calculating the frequencies of these features in texts in the different languages results in a more reliable measure for dialect distance. Using this method, Hoppenbrouwers and Hoppenbrouwers (2001) mapped and classified 156 varieties of Dutch as spoken in the Netherlands and Belgium.

A disadvantage of the feature frequency method is that it does not take the order of speech segments into account (Heeringa, 2004). If two corresponding words in two languages contain exactly the same features, but in a different order, the feature frequency method will not be able to take this difference into account. A simplified example, using letters instead of features, is English wart and its Dutch translation wrat, or Dutch drie ‘three’ with its German equivalent drei.

Kessler (1995) introduced a more accurate method to measure dialect distances: the Levenshtein distance. Heeringa (2004) refined his method and applied it to Norwegian and Dutch dialects. Its mechanism consists of mapping words of both languages onto each other and counting how many individual elements (e.g. phonemes or graphemes) need to be changed, removed or inserted to get from one language to the other. The method is described in more detail in Chapter 6.

Measuring distance with the Levenshtein algorithm has been done in intelligibility research (amongst others by Gooskens (2007a), which was described above) and been shown to be an accurate measure of linguistic distance. In many cases, it predicted intelligibility better than the lexical distance (Gooskens 2007a, 2007b; Beijering, Gooskens and Heeringa 2008; Kürschner, Gooskens and Van Bezooijen 2008).

(18)

however, clearly been established in past research. Spoken Danish, for example, is harder to understand for Swedes than spoken Swedish is for Danes (Maurud 1976; Bø 1978; Börestam 1987; Delsing and Lundin Åkesson 2005; Gooskens et al. 2010; Schüppert 2011; Gooskens and Van Bezooijen 2013). Gooskens, Van Bezooijen and Van Heuven (accepted) show a similar asymmetry between German and Dutch: Dutch is harder to understand for Germans than German is for Dutch listeners (while controlling for non-linguistic factors such as previous contact, which would otherwise be the more likely cause for asymmetry). The existence of asymmetry, even when all non-linguistic factors have been accounted for, indicates that the Levenshtein distance cannot be the only explanatory factor of the level of intelligibility. There has to be something that explains the difference, something that takes into account the direction of the communication.

Moberg et al. (2007) attempt to explain the asymmetrical intelligibility by measuring the amount of entropy in each combination. As the languages involved in these intelligibility studies are related and share a history, the differences between them are not completely random. There is a certain regularity to it. Because sound changes tend to be regular, a certain sound in one language can systematically correspond with a certain different sound in another language. This systematicity can aid the listener with understanding the language. The entropy calculations are a way to measure this regularity: given a certain sound (or character) in language A, how predictable is the corresponding sound (or character) in language B? The more predictable this sound is, the lower the entropy. Higher predictability aids intelligibility, therefore the hypothesis is that a low entropy measure corresponds with a high intelligibility score.

(19)

correspondence index, Cheng 1997) and found that it correlated well with the results

from their intelligibility experiments on 15 Chinese dialects (r = .772 and r = .769). One of the strengths of the entropy measurement is the fact that it is naturally asymmetrical. I will demonstrate this using the correspondence between German and Dutch in Table 1. In this particular set, there is no entropy for the vowels. In other words, German <ü> always corresponds to Dutch <u>, German <o> always corresponds to Dutch <oo> and German <u> always corresponds to Dutch <oe>; and vice versa. A speaker of either language reading the words in the other language needs in theory to have no doubt about which sound to look for in his or her own vocabulary.2

Looking at the initial consonants, however, a different story unfolds. German <d> always corresponds to Dutch <d>, German <t> always corresponds to Dutch <d>, and German <z> always corresponds to Dutch <t>. So far so good. Dutch people reading German can predict the sounds in their own language with 100% certainty. The other way around, however, this is not the case. Dutch <d> corresponds to German <d> in 50% of the cases and to German <t> in the other 50% of the cases. A German reader encountering a word containing <d> in a Dutch text cannot be sure of which character to map this unto in his or her own language. There is then some entropy in the direction from Dutch to German, but no entropy from German to Dutch.

Table 1: A mini corpus consisting of three word pairs in three languages.

German Dutch English

dünn dun thin

tot dood dead

zu toe to

(20)

intelligibility. However, because the entropy measure, like intelligibility, is asymmetrical, it might be able to provide some more predictive power in addition to the Levenshtein distance. The main purpose of this thesis is to find evidence for this.

2.4 The MICReLa project

MICReLa stands for: Mutual intelligibility of closely related languages. It is an extensive project at the Center for Language and Cognition Groningen at the University of Groningen. It is funded by the Netherlands Organization for Scientific Research (NWO). This thesis originated in this project and draws on its materials and preliminary results. In this section, a description of the project will be given, in order to show how this thesis fits into the bigger picture. For more information, see the Micrela project description3 and Heeringa et al. (2013).

The project was started in 2011 and is scheduled to last for five years, until 2016. The project leader is Charlotte Gooskens. The project originated from the intelligibility research described in section 2.2 (such as Gooskens, 2007a). This research focused mostly on Scandinavian languages and showed promising results for these languages. In the Micrela project, the research is extended to the three major groups of closely related languages in Europe: Germanic languages, Romance languages and Slavic languages. The main aim is to “develop a model of intelligibility of closely related languages” (Micrela project description, p. 6). This thesis research takes place exclusively within the Germanic languages group.

One of the things this project focuses on is how to explain the asymmetrical mutual intelligibility found by previous research. One of the research questions is: “What explanations can be found for asymmetric intelligibility?” (Micrela project description, p. 7). This thesis hopes to contribute to finding an answer to this question by determining the effect of the amount of entropy.

(21)

Germanic languages, and Danish and Swedish as North-Germanic languages. Intelligibility is tested by means of three experiments: a word translation task, a cloze test and a picture task. An effort is made to find data for every language combination in both directions. The methodology of the experiments included in this thesis, the word translation task and the cloze test for the Germanic languages, can be found in Chapter 6.

3. Research questions

The history described in Chapter 2 has led to the first research question:

Are orthographic entropy measures a useful predictor of written intelligibility in addition to Levenshtein distance?

As is clear from this question, this thesis will be concerned with written intelligibility only, and correspondingly, the orthographic distances between the languages (as opposed to distances based on phonetic transcriptions of the words). Five Germanic languages are included: English, Dutch, German, Danish and Swedish.

(22)

reader to decipher the language, as he cannot rely on regular correspondences. Therefore, intelligibility is lower when the entropy is high. The results from Moberg et al. (2007) suggest that this hypothesis is true, but as this study included only three languages, no correlation between entropy and intelligibility could be calculated. In this study, five languages are included, amongst which not only Scandinavian languages, but the West Germanic languages German, Dutch and English as well. Can lexical distance accurately predict written intelligibility?

This second question focuses only on the relationship between lexical distance and intelligibility. When the lexical distance between two language varieties is high, this means that they share relatively few cognates. Non-cognates are incomprehensible for a reader who has not learned the language in question, therefore a high number of non-cognates means low intelligibility. A negative correlation between lexical distance and intelligibility is therefore expected. Previous research has often found this negative correlation: the higher the lexical distance between two language varieties, the lower intelligibility. Tang and Van Heuven (2009), for example, found correlations of .78 and .75 for 15 Chinese dialects, and Gooskens, Heeringa and Beijering (2008) found a correlation of -.64 for 18 Scandinavian language varieties. In Gooskens (2007a), investigating six Germanic languages, the correlation with lexical distance was not significant (p = .11), but the tendency was in the same direction. The results of the present study should be in line with previous research, and show a negative correlation between lexical distance and intelligibility.

(23)

Figure 1: Map of Northwestern Europe. The standard languages of the five marked countries are included in this study. Starting from the left, counterclockwise: the United Kingdom (English), the Netherlands (Dutch), Germany (German), Denmark (Danish) and Sweden (Swedish).

between intelligibility and Levenshtein distance was -.86, where the correlation between intelligibility and lexical distance was -.64. Gooskens (2007a) found no significant correlation between lexical distance and intelligibility, but she did find a correlation between Levenshtein distance and intelligibility of -.64. The results of the present study should be in line with previous research, and show a negative correlation between orthographic Levenshtein distance

and intelligibility. In addition, this correlation should be greater than that of lexical distance with intelligibility.

4. Languages

4.1 Overview

The languages included in this thesis are five Germanic languages spoken in the northern and western parts of Europe. The map in Figure 1 shows where these languages are spoken.4 These languages

(24)

they are related to each other and how they were influenced by each other and by other languages outside of the Germanic group.

English

English is spoken all over the world by some 335 million people as a native language, and by many more as a second language. In this project, standard British English is used, as spoken in the United Kingdom.

Dutch

Dutch has some 20 million speakers, most of whom (about 16 million) live in the Netherlands. In this thesis, standard Netherlandic Dutch is used.

German

The German language has almost 80 million speakers, the majority of which (70 million) live in Germany. In this thesis, standard High German is used.

Danish

Danish is spoken by over 5.5 million people, almost all of whom live in Denmark. In this thesis, standard Danish as spoken in Denmark is used.

Swedish

(25)

4.2 Germanic languages

(26)
(27)

4.2.1 North Germanic: Swedish and Danish

Around 500 AD, North Germanic, in turn, split into two varieties as well: east and west (Vikør 2001). From the eastern dialect developed Norwegian, Icelandic and Faroese, whereas the languages concerned in this study, Danish and Swedish, are both descendents of the western branch. The distinction is not as clear-cut as that between North and West Germanic, however. It is more of a continuum with two extremes. Icelandic and Faroese have through their conservatism separated from the others, quite possibly because of their location on islands, but the mainland Scandinavian languages are still very close to each other:

“Rather than viewing Norwegian, Swedish and Danish as units, we should think of these names as loose designations for groups of dialects, arbitrarily distinguished on the basis of linguistic characteristics selected by modern language historians.”

(Vikør 2001:34) In the Middle Ages, however, a new split occurred, this time between the north and the south (Vikør 2001). This essentially separates Danish from the two other (standard) languages (some dialects in the south of Norway and Sweden show characteristics of the southern group). The main changes separating Danish from the other languages are phonological. First of all, vowels in unstressed inflectional endings were merged into a schwa, just like in the West Germanic languages. Thus Swedish timmar ‘hours’ corresponds to Danish timer (the <e> in this case is pronounced as a schwa), and Swedish stjärnor ‘stars’ corresponds to Danish stjerner. Secondly, unvoiced plosives following long vowels were weakened, leading to correspondences like Swedish gripa ‘to seize’ and bita ‘to bite’ with Danish gribe and

bide. Finally, Danish developed the phenomenon of stød, a kind of creaky voice

(28)

4.2.2 West Germanic: German and Dutch

West Germanic split into several different varieties as well, but as they were spoken in one area with many possibilities for contact between groups of people, these language varieties kept influencing each other continuously (Harbert 2007). This has resulted in a dialect continuum covering a large area (stretching from the Alps in Austria and Switzerland to the North Sea coast), and classifications into groups can be hard. Newer contact-induced changes have blurred the earlier distinctions caused by dialect splits.

In classifications of language varieties in these areas, the terms ‘High’ and ‘Low’ occur frequently. These refer to geographical locations: ‘High’ varieties originated in the relatively mountainous south of the area, whereas the ‘Low’ varieties originate from the flat, lower lying north (Harbert 2007). In the middle ages, one of the low varieties (Middle Low German, Harbert (2007)) became the lingua franca of the Hanseatic League, heavily influencing the mainland Scandinavian languages. Nowadays, the status of the descendents of this variety has been reduced to being considered dialects of the standard language of the country in which they are spoken (either Dutch or German), despite their separate origin (Harbert 2007, see also Figure 2).

Currently, two national standard languages are dominant in this area: Dutch and German.5 They can be considered part of one dialect continuum, together with all the other dialects that are still in use.

Standard German, the official language of Germany today, is based mostly on the higher and middle varieties. In many parts of the country, however, dialects are still in common use, and their speakers can be considered bilingual, even if their native language is generally considered a mere dialect.

Standard Dutch, on the other hand, developed from Low Franconian varieties, which were spoken along the western coast of Belgium and the Netherlands.

5 Both of these language have more than one local standard. Belgian Dutch and Netherlandic Dutch, for

(29)

Although the standard languages of Germany and the Netherlands thus developed from very different varieties, several varieties are still present in both countries, being considered dialects or regional languages.

One of the most salient differences between German and Dutch is the High German Consonant Shift (Figure 3). It occurred around 500 AD (Van Gelderen 2006) and involved the transformation of voiceless plosives [p, t, k] into, depending on position, an affricate or a fricative (see Table 2). The consonant shift is absent in the lower varieties, including Dutch, and complete in the southernmost (i.e. ‘highest’) varieties of German. Several varieties in between have partially completed the shift (see Figure 3). One of these is standard German, which includes all changes except for the shift of [k] to [kχ] (hence the unexpected unaffricated [k] in Kopf ‘head’ and

backen ‘to bake’, see the third column of Table 2).

(30)

Figure 3: The Rhenish Fan, showing the partial completion of the High German Consonant Shift in the southwest of Germany (Van Gelderen 2006: 39).

Table 2: Some cognates between Dutch (left) and German (right) that demonstrate the effects of the High German Consonant Shift.

p > pf/f t > z/s (<z> is pronounced [ts]) k > ch (<ch> is pronounced [χ])

peper – Pfeffer 'pepper' tien – zehn 'ten' maken – machen 'to make' dapper – tapfer 'brave' tuin 'garden' – Zaun 'fence' boek – Buch 'book'

kop – Kopf 'head' zitten – sitzen 'to sit' zoeken – suchen 'to search, seek' schaap – Schaf 'sheep' laten – lassen 'to let, leave' kop – Kopf 'head'

(31)

4.2.3 West Germanic: English

English originates from one particular sub-group of West Germanic: North Sea Germanic (Harbert 2007, see Figure 2). These varieties were spoken along the North Sea shore. This group still has descendants on the main land in the north of the Netherlands and Germany, but as contact with the other varieties spoken there has influenced them so strongly, they are generally considered mere dialects of the standard language of the country in which they are spoken. The exception to this is constituted by the Frisian languages, but even these have been heavily influenced by Dutch and German.

Some groups of speakers of a North Sea Germanic language, however, crossed the North Sea and landed in England around 450 AD (Van Gelderen 2006). Over the following centuries they expanded and their languages gradually replaced the Celtic languages spoken on the British Isles before that time. Some of these languages are still very much alive (such as Irish, Welsh, Scottish Gaelic), but English is the dominant language almost everywhere in the area. As it was relatively cut off from the other West Germanic languages, it has had its own independent developments.

First of all, English has been influenced more by Celtic than by the other West Germanic languages, being so close to the area where Celtic languages were spoken. This influence shows mainly in loan words and names, although it is argued that the syntax was influenced as well (Van Gelderen 2006). With the spread of Christianity came some Latin words, as in all of the other Germanic languages, but it was nothing compared to the later influence of Latin during the Renaissance.

(32)

simplification of word endings spread from the north to the rest of the island. This is probably caused or enhanced by contact with Scandinavian (Van Gelderen, 2006). Even today, English morphology is less extensive than it is in the other West Germanic languages.

In 1066, the Normans, speaking a variety of French, defeated the English king. The English nobility was replaced by Normans and French became the dominant language, although English remained the language of the masses. Because this situation lasted for a few hundred years, French had an extensive influence on English, mainly in the vocabulary: possibly up to 10 000 words (Van Gelderen 2006). Unlike what was the case with the Scandinavian influence, the native English words were not replaced by Germanic words, but by Romance words, setting English apart from the other Germanic languages. Some of the many words borrowed in this period are royal, tax, judge, grammar, art, poet, dinner, confess, mercy, age, damage. In addition to whole words, affixes were borrowed as well. Most of these stick to words of Romance origin (disinterest, solemnity) but there are some hybrids of Germanic words with Romance affixes (disbelief, oddity) or Romance words with Germanic affixes (useless, apprenticeship).

In the Renaissance, English further borrowed many words directly from Latin, as did the other Germanic languages. The same is true for the new words needed for technological advancements in the 19th and 20th centuries (Van Gelderen 2006). As these words were borrowed relatively recently, they are still quite similar in all these languages, especially in their written forms.

(33)

4.3 Germanic orthography

As this study concerns only the written version of these languages, it is important to know the background of their orthographies. All Germanic languages are written using the Roman alphabet. This alphabet was originally developed for Latin, the language of writing in the (early) Middle Ages (Molewijk 1992, Scheuringer and Stang 2004). When in the later Middle Ages it became more customary to write in the common languages of the people instead of, or in addition to, Latin, there was no universal spelling standard the writers could adhere to. They had to invent their own way to write these languages, using the alphabet they already knew. The Latin alphabet, however, is not perfectly suited for Germanic languages. These languages contain sounds that are not present in Latin. For example, there were no letters for the sounds /j/ and /w/ (the current letters developed from Latin I and V (= /u/) respectively (Scheuringer and Stang 2004), hence still the English name ‘double u’ for the letter ‘w’). For other sounds, digraphs were established (such as <ng> for /ŋ/) or letters from other alphabets were introduced (such as <þ> (thorn) from the Runic alphabet). Also, there was no universal way to distinguish between short and long vowels, a very important distinction for these languages (Scheuringer and Stang 2004). Every writer had to come up with his own solution to the problems this caused. This, in addition to the fact that every writer based his spelling on his own dialect as there were no standard languages yet, resulted in a wide range of variation. Some of the spellings included in the Oxford English Dictionary (OED) for book, for example, are: boocke, bouke, boock, beuk, buik, bewk, bouck, bouk, bowyk, buike, buk, buyk, bvik, bwck, bwik, bwike, bwk, booke, buick, book, buik, buke, beuk, beuck.

(34)

conventions became the most widespread. This is similar to how the dialect of the most prestigious region ends up being the basis of the standard language. Only after an initial standard had been established, people started to consciously influence it. When and how this happened and what the current attitude to the spelling of a language is, differs for each of the languages in this study. Therefore, I will discuss their recent history and current situation one by one below.

4.3.1 English

The spelling of English is notoriously irregular (Van Gelderen 2006). Although the development of its writing system started out similarly to those of the other Germanic languages, several circumstances have contributed to its being irregular nowadays. Its standardization started quite early – in the early 15th century (Van Gelderen 2006). Although many attempts have been made, no real spelling reform happened after the establishment of this standard. The spelling therefore essentially reflects the pronunciation of the language in the 15th and 16th century. The Great Vowel Shift (described in section 4.2.3) which changed almost all long vowels in the language, happened after this time. Because of this, the pronunciation of many letters in English no longer matches the way these particular letters are pronounced in the other languages (see Table 3).

Table 3: Some words showing the difference in pronunciation between English and other Germanic languages of some vowel graphemes.

English Dutch Swedish

state /stejt/ staat /sta:t/ stat /stɑ:t/ cook /kʊk/ kook /ko:k/

week /wi:k/ week /we:k/

wine /wain/ vin /vi:n/

(35)

Etymological respelling happens when the spelling of a word is changed, not according to its pronunciation, but according to its (supposed) origin. The word debt, for instance, was borrowed from French without the b (as French had already lost it at that point). Learned writers however, recognizing its connection with the Latin words it derived from, added the b in the written form of the word, to show this connection more clearly. Doing this, however, moves the spelling away from the pronunciation. In addition, this did not happen consistently: for some words, the respelling became standard, and for some it didn’t. The word receipt, for example, has a silent p, but conceit does not. In some cases, these respellings were based on a mistaken etymology. The s in island, for example, originates from its supposed connection to the French loan word isle, when in fact the first part of the word is a Germanic root that never contained an s (Old English íg, íeg (OED)).

Loan words for which the pronunciation has been adapted to English, but the original spelling has been retained (Van Gelderen 2006), cause further irregularities. This spelling then does not match the pronunciation of the word in English. Examples of this are suite, glacier, phoenix. For words like this, spelling and pronunciation needs to be learned separately. The other Germanic languages have the same problem, but to a lesser extent: the spelling of words are adapted to the language’s own spelling system more easily. Dutch, for example has words like foto ‘photo’ and orthografie ‘orthography’ and kwarts ‘quartz’; and Swedish has byrå ‘bureau’ and buljong ‘bouillon’.

(36)

4.3.2 Dutch

The first bible translation into Dutch was published in 1637 (the Statenvertaling, State’s translation, because it was funded by the state). As this translation was meant to be used throughout the Dutch language area, an effort was made to use a somewhat ‘neutral’ Dutch, with elements from different dialects (Molewijk 1992). Because of the wide influence of this bible translation, this has become the basis of modern Dutch. The spelling, as does the spoken language, consists mainly of characteristics of the (south)western varieties of Dutch, because this was the economic centre. The last big spelling reform happened in 1946-1947, uniting Netherlandic Dutch and Belgian Dutch spelling (Molewijk 1992).

Spelling changes made after this are minor, concerning mostly the spelling of foreign loan words and the spelling of compound words. Proposals for more phonetic spellings have been made, but receive so much opposition from the public, that they were never carried through. The spelling, then, is not completely regular and phonetic: especially words of foreign origin are exceptions. In loans from French (crèche ‘day care’, garage ‘garage’, comité ‘committee’), German (überhaupt ‘at all, anyway’, sowieso ‘anyway’, föhn ‘hair dryer’) and more recently English (computer,

race, cake, poster) the original spelling is retained, even when this does not match the

pronunciation in Dutch.

Dutch uses diacritics when they are present in loan words (unlike English, where the diacritic is usually dropped), but has not make any additions to the basic 26-letter alphabet for the spelling of native words.

4.3.3 German

(37)

century, but like in Dutch, proposals for bigger changes meet with heavy resistance (Scheuringer & Stang 2004). At the end of the 20th century, proposals were made that in effect would make the spelling more regular, such as spelling all <ai> as <ei> (they are pronounced the same). This would affect very many words, and the general public was so much opposed to it, it was never carried through. A proposal in the 90s to write all common nouns with a lower case (instead of the current practice to capitalize them), was resisted as well. Eventually, only minor changes came in effect, having to do with punctuation, word separation, and spelling of foreign loan words.

German spelling, as mentioned, is characterized by the practice to capitalize all nouns. In addition, it has four extra letters compared to the English 26-letter alphabet: ä, ö, ü and ß (pronounced /s/).

4.3.4 Danish

After World War II, an idealistic movement of the necessity for Scandinavian unity grew strong in the Scandinavian countries (Vikør 2001). In Denmark, being so close to Germany in a time so shortly after the war in which Germany was ‘the enemy’, this movement was strongest and showed itself in a move away from German. A spelling change was adopted in 1948 (Vikør 2001). Among other things, this involved the decapitalization of nouns (which up to that point had been written with a capital, as is still the case in German) and spelling <aa> as <å>, conforming with Swedish and Norwegian. This reform initially met with opposition, but after some time was nevertheless accepted everywhere. Since then, however, no serious reforms have been made or even attempted, except for some notes on how to handle foreign loan words.

(38)

the other Scandinavian languages than the spoken version is. Moreover, a more phonological orthography would, like in English, involve so many changes it is hardly feasible:

“[A] completely phonological orthography would have to be so totally different from the present one that it would be unreadable for the entire Danish population – to learn it would be almost like learning a new language. […] By such drastic reform, the Danes would exclude themselves from their own literary heritage as well as from inter-Nordic written communication.”

(Vikør 2001:190) Apart from the <å>, which it shares with Swedish (and Norwegian), Danish has two more letters not present in the other languages included in this study: <æ> and <ø>.

4.3.5 Swedish

Swedish, as Danish, strives towards Scandinavian unity. The last big spelling reform for Swedish took place over a hundred years ago (Vikør 2001), in 1906. The changes in this reform resulted mostly in making the spelling more phonetic, that is, more representative of how the words are actually pronounced. haf /ha:v/ ‘ocean’, for example, became hav, and rödt /røt/ ‘red (adv.)’ became rött. Some of these changes made it more similar to the other Scandinavian languages, and some made it more distinct. Later attempts to make the spelling even more phonetic have not been adopted into the spelling standard. Swedish is a bit more prone than the other languages, however, to adapt foreign loan words to its own spelling system: English

hike became hajk, French directeur became direktör (Vikør 2001).

(39)

standard language. Thus, for example, drottning ‘queen’ went from /drɔniŋ/ to /drɔtniŋ/ and till ‘to’ went from /te/ to /til/.

The Swedish alphabet has the additional letters å, ä and ö, where <ä> corresponds to Danish <æ> and <ö> corresponds to Danish <ø>.

4.3.6 Summary

All of these languages, then, developed a standard spelling, solving the issues that arose from using the Latin alphabet. Some issues are solved in different ways in the different languages, however. A long vowel, for example, is in English often signalled by a silent ‘e’ following the consonant: cape, make, cake, duke, grape, rope, etc. In Dutch, however, the vowel is doubled: kaap ‘cape’, maak ‘make’, leek ‘layman’,

meen ‘mean’, vuur ‘fire’, rood ‘red’. In German, a common strategy is to add an <h>

(40)

4.3.7 The North Wind and the Sun

Below, the short fable of The North Wind and the Sun is printed in all five languages, in order to give an impression of the orthographies of these languages.

English

The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak. They agreed that the one who first succeeded in making the traveler take his cloak off should be considered stronger than the other. Then the North Wind blew as hard as he could, but the more he blew the more closely did the traveler fold his cloak around him; and at last the North Wind gave up the attempt. Then the Sun shone out warmly, and immediately the traveler took off his cloak. And so the North Wind was obliged to confess that the Sun was the stronger of the two.

(Ladefoged 1999) Dutch

De noordenwind en de zon hadden een discussie over de vraag wie van hun tweeën de sterkste was, toen er juist iemand voorbij kwam die een dikke, warme jas aanhad. Ze spraken af dat wie de voorbijganger ertoe zou krijgen zijn jas uit te trekken de sterkste zou zijn. De noordenwind begon uit alle macht te blazen, maar hoe harder hij blies, des te dichter de voorbijganger zijn jas om zich heen trok. Tenslotte gaf de noordenwind het maar op. Vervolgens begon de zon krachtig te stralen, en onmiddellijk daarop trok de voorbijganger zijn jas uit. De noordenwind kon toen slechts beamen dat de zon de sterkste was.

(Gussenhoven 1999) German

(41)

Wanderer zwingen würde, seinen Mantel abzunehmen. Der Nordwind blies mit aller Macht, aber je mehr er blies, desto fester hüllte sich der Wanderer in seinen Mantel ein. Endlich gab der Nordwind den Kampf auf. Nun erwärmte die Sonne die Luft mit ihren freundlichen Strahlen, und schon nach wenigen Augenblicken zog der Wanderer seinen Mantel aus. Da mußte der Nordwind zugeben, daß die Sonne von ihnen beiden der Stärkere war.

(Kohler 1999) Danish

Nordenvinden og solen kom engang i strid om, hvem af dem der var den stærkeste. Da så de en vandringsmand, der kom gående, svøbt i en varm kappe. Og de enedes om, at den der først kunne få kappen af ham skulle anses for den stærkeste. Først tog nordenvinden fat, og han blæste og blæste, men jo mere han blæste, des tættere holdt manden kappen sammen om sig. Til sidst måtte nordenvinden give fortabt. Så tog solen fat. Og han skinnede og skinnede, og til sidst fik manden det for varmt og måtte tage kappen af. Da måtte nordenvinden indrømme, at solen var den stærkeste af de to.

(Grønnum 1998) Swedish

Nordanvinden och solen tvistade en gång om vem av dom som var starkast. Just då kom en vandrare vägen fram, insvept i en varm kappa. Dom kom då överens om, att den som först kunde få vandraren att ta av sig kappan, han skulle anses vara starkare än den andra. Då blåste nordanvinden så hårt han nånsin kunde, men ju hårdare han blåste desto tätare svepte vandraren kappan om sig, och till sist gav nordanvinden upp försöket. Då lät solen sina strålar skina helt varmt och genast tog vandraren av sig kappan, och så var nordanvinden tvungen att erkänna att solen var den starkaste av dom två.

(42)

5. Data

In order to calculate the linguistic distances, a corpus of parallel word lists in the five languages included in the study is needed. In the Micrela project, a word list of a hundred nouns is used to collect data on the mutual intelligibility of the languages in the project. These nouns are taken from a list of all the words contained in the British National Corpus6 (BNC) ordered by frequency. Roughly, they are simply the 100 most frequent nouns in the corpus. These words were translated to the other languages in the project, creating parallel word lists for all languages (see e.g. Heeringa et al. (2013) for more details on the creation of these lists). The lists are being used to calculate lexical distances and Levenshtein distances in publications of the project (Heeringa et al. 2013). In many other publications involving lexical and Levenshtein distance, these distances have been calculated with relatively short word lists as well. Gooskens, Heeringa and Beijering (2008), for example, used the words from the text

The North Wind and the Sun which they used in their experiment - about 100 words,

depending on the language variety. Gooskens (2007a) used the words from the text she used in her experiment, as well: a news item consisting of about 250 - 290 words, depending on the language. These lists are too short to reliably calculate entropy measures, however (Moberg et al. 2007; see also Chapter 6). Therefore, I created new word lists consisting of 1500 words and used them not only to calculate the entropy, but the lexical and Levenshtein distances as well. The word list size should not make a significant difference for these distance calculations, but in order to be certain of this, I will correlate these results with the lexical and Levenshtein distances calculated by Heeringa et al. (2013), based on the much smaller set of 100 words.

(43)

and German. During this process, some words were removed from the list because they proved to be too hard to be translated reliably. These cases consisted usually of words from the original English list which simply do not exist, or at least do not exist in the same form, in one or more of the other languages. A word like ‘whatever’, for example, does not have a clear translation in any of the other languages, and even if it has, it can only be translated by a multi-word expression. In Dutch, for example, the ‘translation’ consists of three words which are intervened by other words in the sentence (see the example sentence below). Another example of a problematic word is the verb ‘to face’, which does not have a clear translation covering its meaning in the other four languages. It can be translated by many different verbs and expressions, depending on subtle differences in the context.

EN Paint your house in whatever colour you like

DU Verf je huis in wat voor kleur je maar wilt paint your house in what for colour you just want Another translation issue is caused by certain function words which might not even exist in the other languages. The translation of English modals, for example (such as ‘could’, ‘might’, ‘should’), depends highly on the context. Translating them as a separate word, as is necessary for this list, is difficult. Therefore, these were removed as well. In total, 51 words were removed from the list. In order to replace them, new words were added at the end of the list (simply the next words in the BNC frequency list). Because a margin was taken to anticipate words being possibly excluded at a later step in the process, the final list ended up containing 1510 words.

In many cases, the English original word is ambiguous, and its meanings are covered by several different words in one or more of the other languages. In this case, one of these meanings was chosen and used consistently for the other languages. The noun practice, for example, can mean (amongst others) the following things:

(44)

1. The carrying out or exercise of a profession, esp. that of medicine or law

2. The actual application or use of an idea, belief, or method, as opposed to the theory or principles of it

3. The habitual doing or carrying on of something

4. Repeated exercise in or performance of an activity so as to acquire, improve, or maintain proficiency in it

(Oxford English Dictionary online edition,7 entry for ‘practice’, meanings 1-4) In the other languages, different words are used to express these meanings. In Dutch, for example, meanings 1 and 2 are covered by praktijk, meaning 3 translates as

gewoonte and the fourth meaning is expressed by oefening. In this case, the fourth

meaning was chosen for all languages. The choosing of one meaning was not done systematically – in many cases it was simply the meaning that first emerged in the translator’s head or the first translation given by the dictionary used. Care was taken, however, to always choose one of the most common meanings of the word, and not one of the more obscure ones.

Once one of the meanings of the English word was decided upon, it was translated by the most common word in the target language that accurately represents this meaning. If there were two or more alternatives that are both common (i.e. not considered jargon), and one of these was a cognate to the word in English or in one or more of the other languages, that word was chosen.

(45)

with ‘practice’ above. This should not be considered a problem, however. The goal of this part of the research is to create a list of words with corresponding meanings in the five languages that are included. Which words these are exactly is not of importance, as long as they are randomly chosen and are good representations of the languages. I believe that in this case, these conditions have been met.

A list of the words that were excluded and the full word list in the five languages and can be found in appendices A and B respectively.

6. Methods

As elaborated on in Chapter 2, there are several ways to go about measuring the linguistic similarity between two language varieties. In this study, three methods were used: lexical distance, Levenshtein distance, and conditional entropy measuring. As this thesis focuses on written language, these were applied only on the orthographic level, on the data described in Chapter 5. In the first part of this chapter, I will describe these methods in detail. In the second part of this chapter, the methods used in measuring intelligibility will be described. The experiments mentioned there were carried out as a part of the MICReLa project described in section 2.4.

6.1 Methods for measuring linguistic distance

6.1.1 Lexical distance

A computationally simple way to measure linguistic distance is by measuring the

lexical distance. This has been used many times in the past. An example of this is the

(46)

relationships to each other. Lexical distance, as can be expected from its name, consists of measuring distance on a lexical level.

When two language varieties are related to each other, they usually share many cognates, but there will also be a part of the lexicon that consists of non-cognates. This happens, for example, when one language has borrowed a word from a third language, whereas the other language has maintained the inherited word. English has many examples of this phenomenon, where mainly Latin and French words have replaced the Germanic words. Compare for example to contribute with its translations in the four other languages in this study: bijdragen, beitragen, bidrage,

bidra. It can also be the result of semantic shift, however: the cognate word is in fact

still present in both languages, but no longer has the same meaning. This results in false friends, such as English queen with Swedish kvinna ‘woman’, or English town with Dutch tuin ‘garden’ and German Zaun ‘fence’.

The idea behind measuring lexical distance is this: The more cognates two language varieties share, the closer they are to each other.8 Lexical distance is then simply the percentage of non-cognates between a given language pair. This is to be measured for the 1500-word samples from each of the languages in this study.

In order to measure this, it has to be determined whether two corresponding words are cognates. The traditional definition of cognate words stresses the shared origin of the words in an older form of the languages, as in this definition from the Oxford English Dictionary: “Coming naturally from the same root, or representing the same original word, with differences due to subsequent separate phonetic development”. For this research, however, a broader definition was used. In the situation in which a speaker of one language is trying to understand the words of another language, he or she does not see the etymological history of a word. The only thing that matters to the reader, is the fact that there is some kind of similarity to the corresponding word in their own language. Therefore, any two words of which the stems are related were considered cognates. This of course includes cognates in the

(47)

traditional sense, but it also includes loan words sharing a common source, such as German Party and its English equivalent which it is derived from, and words such as

information, which occurs in all five languages and has a common source outside the

Germanic family. Words which share a base form but have different affixes were considered cognates as well, such as Dutch betalen (‘pay’, be- + talen) with German

zahlen (‘pay’, lacking the be- prefix). When a word consists of multiple lexical items,

however, and one of them is not related, the complete words were not considered cognates. Take, for example, the compounds buitenlands (Dutch) and udenlandsk (Danish, ‘foreign’, literally roughly ‘out-landish’). The second parts of these words,

lands and landsk, are cognates, but the first parts derive from different root words.

The word pair as a whole is therefore not considered a cognate pair.

When there was doubt about whether or not two words shared the same origin, etymological dictionaries were used. In addition, because the data consisted of parallel word lists, only word pairs with the same meaning in both languages were considered – false friends are no part of this study.

6.1.2 Orthographic Levenshtein distance

Lexical distance calculates the percentage of cognate words in a language pair, but it says nothing about how similar these cognates are to each other. When two language varieties have started growing apart a long time ago, sound changes may have changed both words in a pair beyond recognition, even if they stemmed from the same root. Levenshtein distance (Heeringa 2004) is a computational way of measuring the distance between two cognates on a phonetic or orthographic level.9 In this study, the distance was calculated based on the orthography only, focusing on written intelligibility.

9 Although it is technically possible to calculate the distance between two non-cognates, it does not make a

Referenties

GERELATEERDE DOCUMENTEN

The synthesis and characterization data for formazanate aluminum complexes presented in this chapter provide scope for further exploration of the reactivity of these compounds

We radio‐tracked Golden Plover chicks from hatching to fledging to study their habitat selection, diet and food abundance in a Fennoscandian breeding population. Here, only graphs

This is a functional and typological comparative study of lexical ergativity in English, German, French Dutch and Danish built around a selection of frequently used verbs and

Alhoewel dit aspect wel in de achtergronden in de Gids voor Goede Praktijken wordt beschreven zou het daarnaast ook in de richtlijnen en uitwerking daarvan opgenomen kunnen

Wat is de betekenis van Nota Landschap en Structuurschema Groene Ruimte (meer specifiek de beleidscategorieën Nationaal Landschapspatroon en Gebieden Behoud en Herstel

Deze scriptie komt daarmee tot de conclusie dat een privaatrechtelijke boete een waardevolle aanvulling zou kunnen zijn voor het Nederlandse aansprakelijkheidsrecht en het bancair

The present chapter, thus, narrates the tale of two European doctors employed by the Dutch East India Company in the trading post of Nagasaki and serves as a counterpoint

The manipulation pins are connected to the robotic system [ 23 ], and the navigation system gener- ates the motion commands to physically reduce the fracture based on the