From Jasnil to Hometwoli: a weighted alternative to the straight lexicostatistics of Yokuts lects

(1)

(2)

(3)

Content

Content...3

0. Introduction...5

About the Yokuts people...5

About the Yokuts languages...8

About some of the lects...8

1. Recap of BA thesis...12

1.1 Yokuts-specific pros and cons of lexicostatistics...12

1.2 Lexicostatistics: general pros and cons...16

1.3 Nonetheless: the lexicostatistical results...21

1.4 Glottochronology...22

2. Newer alternatives...27

3. New methods...33

3.0 The treatment of partial matches...33

3.1 Lexicostatistics without the one-variant words...36

3.2 One eleventh: number of variants divided by highest number of variants...37

3.3 Chances: Positive only...38

3.4 Chances: Positive and negative...39

3.5 Heggarty-inspired...42

4. New results...43

4.1 Lexicostatistic without the one-variant words...43

4.2 One eleventh...44

4.3 Chances: positive only...46

4.4 Chances: positive and negative...47

4.5 Heggarty-inspired...50

5. Conclusion...52

(4)

Data tab...53 Swadesh tab...54 Sources...54 Missing words...55 Sources...56 Unpublished sources...58

Sources for the map...58

Sources for the raw data...59

(5)

0. Introduction

This thesis continues the project that I once started for my bachelor thesis: a lexicostatistical analysis of 34 Yokuts lects (a lect can be either a

language or a dialect). Such an analysis had previously been done by Smith (s.a.) for 28 lects, but that was before a wordlist of Telamni by Clinton Hart Merriam had been discovered. Doing a new lexicostatistical analysis to confirm to which language Telamni belonged became the core of my bachelor thesis. I also added some data from other lects.

The introduction on the Yokuts people and the languages has been modified from my BA thesis, as has the database with the raw data and the source list of those data (except for the addition of Toltichi).

About the Yokuts people

The Yokuts Amerindians lived in present-day California, in an area of which the northern border is about fifteen kilometres north of Stockton and the southern border some thirty kilometres south of Bakersfield. The total valley is from the northwest to the southeast about 400 kilometres long. De town Chowchilla in the northwest is one of the few places in the area that is named after the people who used to live here. The name is, via the local river name, derived from the people whose dialect is referred to as Chawchila in this thesis.

Because Telamni played such a key role in my BA thesis, I chose the

history of this people to illustrate the circumstances under which (most of) the Yokuts became extinct (also see Cook 1960, as cited in Smith 2010). Chukchansi, Yawlamni and Wikchamni are the only lects of which there are still speakers alive today (p.c. Smith).

Between the present-day places Corcoran and Kettleman City used to be the large Tulare Lake. This lake got its water from the Tule and Kaweah rivers and the Kings River; all of them were surrounded by swamps.

(6)

Part of this swampy area, in the Southern Valley, was the home of the Telamni Amerindians, who lived near the current town of Visalia (see map). It was in the year 1776 that the first white man arrived in their village: a Spanish missionary, Father Francis Carcés, who noted of their existence, calling them Telam or Toram (Coues 1900, as cited in Smith 2010). The next observation of them was made in 1804, and two years after that there was an expedition led by Moraga. He describes the large oak forest that he and a couple of soldiers passed through. They found a village with about 600 people and about 800 Telamni in total. But after this visit, the number of Telamni plummets. In 1815 a Spanish soldier, Juan Ortega, mentions that the population of the village has scattered as a

consequence of high mortality and famine. That was also the year the Spaniards first started to bring Telamni people to the missions; at first a couple, three years later another one, and around 1830 another 28

(Milliken 2009). According to a report by Barbour, a negotiator of the failed Californian Indian treaties, there were only 280 Telamni left in 1851

(Hoopes 1937, as cited in Smith 2010). Eight years later one counted fifty men and forty-five women and describes them as being in bad health (Phillips 2004, as cited in Smith 2010). Around 1862 there is a measles outbreak, which proves fatal for many Telamni. Some of the survivors then move out of the area, to an abandoned Nutunutu village (the Nutunutu being another Southern Valley tribe) near the Kings River in the north-west (Latta 1930, as cited by Smith 2010).

When roughly 48-year-old Tilly Wilcox, the Telamni informant, helped Merriam fill in a wordlist of her lect in 1903, she was one of the last of her people.

(7)

Figure 1: Map of the Yokuts area and the lects spoken here, after the map Yokuts and neighbouring languages by Kenneth Whistler (November 15, 1984). In the

top corner we see Delta Yokuts: Yachikamni, Jasnil and Coconoon; Northern Valley Yokuts: Chawchila, Hewchi-Eyulahua, Ta-kin (exact location unknown), “Kings River” and Nopthrinthre; Northern Hill Yokuts: Chukchansi, Dumna and, if legit, Toltichi (in black); Gashowu; Kings River Yokuts: Choynimni,

Michahay, Ā-te-pitch, Ayticha, Chukaymina and Kocheyali; Southern Valley Yokuts: Wechihit, Tachi, Nutunutu, Telamni, Wo’lasi, Chunut, Choynok, Koyeti,

Wowol, Yawlamni and Tinlinne; Tule-Kaweah Yokuts: Gawia, Wikchamni and

Yawdanchi; Poso Creek Yokuts: Palewyami; Buena Vista Yokuts: Tulamni and Hometwoli.

(8)

About the Yokuts languages

The Telamni are just one of the about 80 tribes that must have existed (Smith 2010). From about half of them some linguistic material has been recorded, either a text, or a wordlist or some few words.

In this thesis, I compared the vocabularies of thirty-five Yokuts lects. Every tribe had its own dialect; the differences between neighbouring dialects were often limited to minor differences in the lexicon and morphosyntax. In each area, these dialects form a language, which in turn is a member of the bigger Yokuts language family (Kroeber 1907, Newman 1944, Golla 2011).

Because some of the ‘dialects’ are in fact languages of their own, I use the word ‘lect’ when talking about any of my thirty-five Yokuts varieties.

The syntax of all of these lects was fairly similar – though not much

syntactic analysis has been done – and the phonological differences were limited (Weigel 2005). Some exceptions that I’ve come across were the Tule-Kaweah lects, which replaced the /l/ by /t/ (sometimes transcribed as d), and Jasnil and Coconoon, that have /b/ instead of /m/ and /d/ for /n/. A morphosyntactic variation in my data is the suffix –al, that the Buena Vista lects sometimes use for body parts, other examples are primarily found in the verbs (p.c. Smith).

About some of the lects

Some lects need further explanation. Most lects are fairly well

documented, by which I mean that there is no confusion about which lects are on which lists, who the informants were and which lect or lects they spoke, and which names refer to which tribes/lects. Unfortunately this isn’t true for every lect, so I’ll give a short explanation for those lects that need it (Smith 2010, p.c. Smith if there’s no other source mentioned):

(9)

Jasnil: This Delta Yokuts lect is also known under a few different names, such as Chalostaca, Latrūdud, Lower San Joaquin and some variations of Jasnil, such as ‘Atsnil’. Therefore it could happen that there were two lists from, it seemed, different informants, taken at the same place, however. It turned out these informants were most likely one and the same woman – a Jasnil woman – which made it possible to responsibly combine the lists of Chalostaca and Latrūdud into one Jasnil list (p.c. Milliken via Smith).

“Kings River”: This lect is a bit of a mystery, because there is no additional information on the informant. The name itself is strange as well: it’s called ‘Kings River’, but the words themselves show this must have been a

Northern Valley lect. It’s possible this list was made with the help of an informant who, after years in one of the Spanish missions, had ‘returned’ to a settlement (Amerindians weren’t allowed to settle just anywhere), probably the Fresno Indian Farm1_{much further south. If it was Fresno} Indian Farm it may have been Hewchi. There certainly is no connection to the language called Kings River, hence the quotation marks.

Ā-te-pitch: Also known as Che’osh-she-shoo and Yunab’be, this lect was first recorded from the “Drumm Valley band” (Cutler s.a.); apparently these Amerindians were related to the western Mono Amerindians that were known as the Entimbich – indeed the same name as Ā-te-pitch. Gayton thought they were Yokuts, Merriam thought they were Mono, but they might have simply been bilingual. Although this lect is generally considered to be a Kings River lect (the language Kings River), there is good reason to consider it a mixed lect, since its affinity with Tule-Kaweah (especially Gawia) is also very high.

Gashowu: This lect is closely related to the Kings River lects and to a lesser extent Northern Hill and Northern Valley – which makes it a mixed lect – but in addition has some unique – qualitative – features that make it a separate language.

(10)

Hewchi-Eyulahua: This is a combination of two lects, spoken by tribes that lived next to each other (the Eyulahua slightly further west than the

Hewchi). The informant is a child from a marriage between an Eyulahua father and a Hewchi mother, who sadly died when he was only six. Because of this, he wasn’t aware that there were different – although probably very similar – lects, nor did he know what his tribe was (Eyulahua) (Smith, in progress).

Tinlinne: Considered a dialect of Yawlamni, this lect was added as a

separate lect because there were some differences. The name was derived from the very southern Amerindian township Tinliu, close to the ranch Tejon Viejo, where Merriam recorded this list.

Kechayi: Although this lect was recorded by Kroeber, it turned out he used an informant who is known to be Dumna by others who used him as an informant. We must therefore conclude that the lect given as Kechayi by Kroeber is Dumna as well. For this reason, I have had to exclude Kechayi; for Dumna I primarily used Latta’s list.

Ta-kin: This lect was recorded at the most northern location of all Northern Valley lects, Knights Ferry, on the Stanislaus River, but it’s clear the tribe itself must have lived closer to the other Northern Valley tribes. We just don’t know exactly where.

Toltichi: New in this thesis is Toltichi. According to Kroeber, the tribe of the same name lived the furthest up the San Joaquin River. The last person to actually speak this lect was said to have died thirty years prior to the recording of this list, so the transfer is rather indirect (and the differences in pronunciation certainly very exaggerated). In addition to that, this lect provokes even more questions: the words are clearly northern Yokuts, but there have been some rather extraordinary sound changes and, more importantly for this thesis, Toltichi has words for one and two that aren’t related to the forms all of the other lects have (something like yet and ponoi, respectively). Gamble (1980) proposes an alternative explanation

(11)

for the deviant numerals. Although Toltichi’s nās and bis are not related to the modern Yokuts numerals, an informant of Chunut told John Harrington that there was also an “old time” numeral system in his language, taught to him by his grandmother and brother. The words for one and two in this system were nasa and pesu, respectively – clearly related to the Toltichi forms! This “old time” origin makes it justifiable to leave the numerals out of the data.

The doubtful status of Toltichi does have some consequences. For my alternative comparative methods, all forms play a role when computing the weigh factor for that particular word – including Toltichi’s, if there is one. For this reason I was forced to make two versions of each of these methods: one that doesn’t include Toltichi when computing the results, and one that does. In most cases the differences are minimal – certainly without the numerals – but if this lect does not belong here, I don’t want it to influence my results.

(12)

1. Recap of BA thesis

My BA thesis began as I mentioned above as a lexicostatistical analysis of the various lects of the Yokuts language. In this chapter I will describe what exactly a lexicostatistical analysis is, what the advantages and

disadvantages of this method are and I will discuss the difference between lexicostatistics and glottochronology.

1.1 Yokuts

-

specific pros and cons of

lexicostatistics

That takes us to the issues with lexicostatistics for Yokuts. As I described in the introduction, the circumstances for the Yokuts Amerindians were

rough: their numbers quickly declined due to the diseases Europeans had brought with them, they were often taken out of their original villages to live in the Spanish missions and then lived there mixed with people of multiple tribes. Women and children ought to speak the lect of their husband/father, although children often learned both lects (Smith 2010, p.c. Smith).

These were some of the informants upon whom the makers of the lists depended to record the vocabulary of the Yokuts lects, when, decades later, one finally became interested.

Then there were these list makers themselves. They often weren’t linguists, but missionaries, biologists, soldiers and anthropologists that visited the area for various reasons, but were, out of personal or

professional interest, willing to make a list and fill it in with the help of informants. Their native language was Spanish, French or English, and the non-linguists were mostly unskilled in phonetic transcription, which caused much inaccuracy when filling in the lists.

Their knowledge of Yokuts was extremely varied – Merriam didn’t even speak Spanish – and they could only hope their informant, or the

(13)

interpreter if they used one, correctly understood which word they meant to translate (round, like a sphere?).

Their native language then influenced how they perceived the string of unfamiliar speech sounds they heard and how they tried to transcribe it: the spelling of the words in such lists often clearly show the native tongue of the author.

Stanley Newman was an exception: he was a linguist with a good ear for the foreign sounds, who made detailed transcriptions. A.L. Kroeber was an anthropologist who, hindered by his own German dialect, made

transcriptions of the vocabulary of 21 lects (Whistler & Golla 1986; p.c. Smith).

When I set up my database, I didn’t change the notation of my various sources (sometimes I’ve included two different ones), except for the fact that I couldn’t replicate all the details of Newman’s and Kroeber’s

notations (superscript is a problem, as are some diacritical marks). I added notes on exact notations, form and meaning of Yokuts words to the source list of the raw data (see attachments).

Besides this, when there were multiple sources for the same lect, I have, after the first one, often only looked for words that were still missing, so it’s possible I have missed occasional synonyms.

Kroeber had his material typed up and published, but the other wordlists, such as Merriam’s and unfortunately also Latta’s, are often still only handwritten.

(14)

Nowadays, some hundred to hundred and fifty years later, a part has been scanned, a part is on microfiches – but cannot be searched digitally – and a part may be still not catalogued in the Bancroft Library in Berkeley, California.

The extensive archive of brilliant but peculiar linguistic fieldworker

Harrington is also located there and has been published (p.c. Smith), but unfortunately I did not have access to it.

The unavoidable consequence of all this is that only two out of thirty-five lects have a complete sample of a hundred words: Yawlamni and

Wikchamni. Eight lects have between 90 and 99 words, on the other hand, there are six lects of which I have fewer than 40 words available. The average is only 68 words (rounded up); in total, 1056 words are missing, out of 3500 – over 30%. There is some regularity in the gaps – some words are nearly always absent, others rarely – but a lack of overlap between the words that are available means that the number of words available for the comparison is frighteningly low in some lectduos (any combination of two lects). Although Nopthrinthre has 48 words and Kocheyali 22, there are only 13 words they both have. The average number of words used for the comparisons (i.e. that are available in both lects in any random lectduo) is therefore only 51 (rounded off).

When this is the base of a lexicostatistical analysis with a Swadesh list of 100 words, numbers of missing words as high as these can skew the

(15)

results enormously, and as a result it’s possible that a lect of which we know where it was spoken and which language it was a dialect of, is – according to the lexicostatistics – barely more closely related to its

neighbouring lects than to a random other Yokuts lect, or even seems to be closer related to a lect that is part of a different language.

For an important part, that’s caused by the fact that some words are the same in all lects, such as water which is always something like ilik,

whereas others indeed have a great number of different forms. If many of the 13 words in lectduo Nopthrinthre & Kocheyali are words like water, the percentage of matches will easily skyrocket by several dozens of

percentage points.

Where there are many of those words that have many different forms, the comparisons will produce a result much lower than they should.

In this case, the comparison is based on the numerals, one and two, and eleven body parts. In six of those words, the lects (almost) unanimously have the same form, about tongue, 26 of them agree, and for hair 22 have the same form.

For the last five words, sixteen or seventeen lects have the same form, although the total number of different forms is sometimes as big as ten, which might explain why the ultimate lexicostatistical result is still a proportionate 53%.

To gain some insight in the differences between the samples that exist in this area, I had Excel calculate the average amount of forms per Swadesh word, for those Swadesh words every lect has (see Notes on the

attachments, also see the Excel attachment, Swadesh tab for a summary): the Discrimination Index.

Yawlamni and Wikchamni have all 100 words, so they get to set the norm: 4,03.

The average is slightly above this – 4,06 – which indicates that, strikingly, the more basic (those with fewer forms) words indeed tend to be lacking more often.

(16)

The lower this number, the higher the proportion of common basic words that are the same in (almost) all lects and the higher the comparison with other lects will turn out to be. Examples are Ā-te-pitch with 3,81 and Chawchila with 3,86.

The higher this number, the lower the proportion of basic Yokuts words and the lower the comparison with other lects. Examples are Nopthrinthre with 4,23 and especially Ta-kin with 4,57.

The more complete a sample is, the closer this Discrimination Index will get to the aforementioned 4,03, but that doesn’t mean that a small sample will by definition deviate far from that norm: Gawia only has 39 words, yet that small sample is very well proportioned: 4,05. That is even better than Chunut, that with 97 words still scores 4,07!

These effects are inherent to the quality of the samples and will be very hard to completely erase by using alternative computing methods.

1.2 Lexicostatistics: general pros and cons

We’ve seen how data gathering was done for Yokuts; this paragraph is about the general pros and cons of the lexicostatistical method itself. The method was invented by (and named after) Morris Swadesh, who in the 1950s compared the vocabularies of various (mainly European) language groups to find out if there are certain words – concepts – that tend to remain the same – that is, they don’t tend to be replaced by a completely different form – over the course of centuries (sound changes are permitted).

He indeed found such words, so he went on to make a list of usable words to compare languages: at first one of 500 words, that he then kept fine-tuning and shortening, until he had the list of 100 words that I use (Oswalt 1971, Campbell 2004).

(17)

In principle, that means one has a number of lists of 100 words, which one compares to each other and gives symbols to, after which one counts 1 for every match and 0 for every difference. For example:

all A al(le) A 1 claw A klauw A 1 heart A hart A 1 leaf A blad B 0 sand A zand A 1 tail A staart B 0

Four matches out of six comparisons gives a percentage of 66% between English and Dutch: simple, straight-forward, uncomplicated; just a

pleasantly easy method to compare the vocabularies of related lects, partly because it gives one single number that tells one to what extent the lects are related.

However, there are many objections. (I’ll loosely follow Campbell’s (2004) list of them.)

The list itself, with its basic concepts, is straight away the first basic assumption Campbell (2004) criticizes in his article: the assumption that there is a universal, culture-free basic vocabulary, while every single concept on the list turns out to have been borrowed by at least one language on the planet.

In the small sample above, I could have included mountain:

mountain A berg B 0

The word mountain, however, was borrowed from French and therefore cannot be included.

Borrowings in the data are a serious problem. What one tries to do when comparing vocabularies, is reconstruct the protoform of all forms within a concept that are cognates or at least assume that a protoform can be reconstructed. If a language with such a cognate borrows another word for

(18)

that concept, one loses that information. Even worse: the false information one gains from a borrowing can give confusing results.

Imagine you are doing some genealogical research into a family with a rare genetic disorder and you want to know which branch of the family the current generation inherited it from. Of course in that case you’re not interested in the DNA of a Chinese adopted child – but it’s much harder to find out that the great-grandfather wasn’t actually a biological son of his father.

In my Yokuts database, there are some words for seed that I omitted because they turned out to be loanwords from Spanish, though I have to admit I have little overview on this matter. We’re talking about a collection of lists made in the nineteenth or early twentieth century of words of mostly extinct Amerindian lects of which the tribes often lived closely together and later on often with other tribes on Spanish missions – it’s entirely possible that words have been borrowed between lects.

The next problem is that the list was only made in one language: English. That means the first assumptions have already been made, because the concepts aren’t specified any further than that, even though that is

necessary. Even within English, to lie is already ambiguous, Dutch has two subtly different translations of to know, and there are many languages that have two variants of the concept you: a formal and an informal one.

In Yokuts, we see there are several partial concept in the concepts louse, where Yokuts distinguishes between head louse and body louse; round, for which Yokuts specifies ‘round like a ball’ and ‘round like a circle’; and we, which has four distinctions, namely whether it’s dual or plural, and

whether it’s inclusive or exclusive (is the listener included?).

Within these concepts, I therefore had to choose which exact concept I wanted to add to my database, and I will readily admit I let the first few lects that I put in there decide which partial concept I’d use, which are Telamni, Chunut, Wowol and Yawlamni, in that order. The only version of we I had available for Telamni was the dual exclusive, and because that was the – already small – wordlist I wanted to compare to the other, better-known wordlists, I chose that one.

(19)

The Telamni sample doesn’t include words for louse or round, but the Chunut/Wowol list only asked for ‘round (like a ball)’, making the choice for me. The words for louse were either both there or both missing in those first few lects, so I simply picked one and went with it. Looking back, I might have had a more complete database if I had picked (one of) the other partial concept(s).

In the lists that specifically distinguish the various partial concepts, you know you found the right translation, but in less specific lists the only way to know that is when that new form is a cognate of a form you’ve already seen before. But if that new form is something completely different, there is no way to know which partial concept it is, so you have to omit it. As a result, lects with more specific lists and lects with cognates of the forms on those specific lists have an advantage over the lects with non-specific lists and non-cognates: the cognate being there means they will have more in common with other lects, whereas lects with non-cognates do not get that opportunity.

Incidentally, these are the concepts of which it’s clear that there are multiple meanings within the concept, but there are more Swadesh words with many different forms that don’t adhere to language borders.

Still, I’ve chosen to allow myself some liberties when filling in my

database: for path, I also used road and trail (Yokuts seems to only have one concept for all three English words); for not I also used no; to find to die I also checked dead, and there are numerous individual cases for which the meaning given in the source list isn’t exactly the same as the word on the Swadesh list, for example because it is a plural, an imperative or a sentence containing the intended verb. That is no problem as long as it’s clear which part is the lexical stem.

The opposite problem exists, too: languages that have the same word for two different items on the Swadesh list.

Many Yokuts lects have the same word for black and smoke, and as Campbell (2004) points out, that wrongly causes higher comparison

(20)

results: when two lects have the same word for black and smoke, this one match now counts for two.

If there are also lects that do have a different form for one of the two concepts, it might be defensible, but if all lects have the same word for both Swadesh concepts, then I’d say the comparison is essentially done with only (the information gathered by comparing) 99 words and should be computed as such, though this is open for debate.

A problem with the method itself is that it assumes that forms are either related or not, without any synonyms and without the ability to specify for certain features, such as a specific sound that only appears in one specific closely related group of lects, but not the other lects, while the rest of the cognate is identical.

In my example of English and Dutch I could have included neck, but that particular example word didn’t fit in the straight-forward sample I wanted to use, because Dutch has two different translations: nek is primarily for ‘back of the neck’, hals is primarily for the front and sides.

neck A nek, hals AB 0,5

Needless to say, in practice things aren’t this clear-cut. Heggarty (2010), too, notices that working with merely ones and zeros is polarising rather than quantifying, which only skews reality. He mentions this more precise way of counting matches as one way to improve the traditional

lexicostatistical method.

In my BA thesis I’ve refined this “binary straight-jacket”, as Heggarty (2010) calls it, further already by including the possibility of half matches. In this thesis the possibilities are 1, 0,75, 0,66, 0,5, 0,33, 0,25 and finally 0 – also see paragraph 3.0 on partial matches for the exact calculations. Yokuts, too, has an abundance of examples of lects with two forms for one concept, that, like in the example, get two symbols instead of one.

Another reason to give two symbols for one concept is when a concept involves a large group of cognates, and within that group there is smaller

(21)

group that has a compound of which the first part is the cognate, plus an affix.

I could give all of them the same symbol, but then the distinction would be lost. I could give them different symbols, but then I’d fail to acknowledge the fact that they’re partial cognates. So to do justice to the relation on one side and the difference (and even closer relation within the group) on the other, I have given them a double symbol, AZ: A for the broader

cognate, Z for the derivational suffix.

One final note before we move on to the results: lexicostatistics is meant to find the degree to which languages are related. It is not a suitable method to demonstrate if they are related in the first place (Heggarty 2010). Poser (2004) puts it like this: “[…] we've learned that massive borrowing does occur, that grammatical structures can be borrowed, and that borrowing of basic vocabulary is more common than we thought. This means that we have to be more concerned than we used to be about the possibility that non-chance similarities between languages are due to borrowing rather than common descent. This problem [becomes] more severe the farther back we go both because the total amount of evidence becomes smaller and because the farther back we go the less likely we are to know anything about the external history of the languages, that is, who was in contact with whom and what the nature of the contact situation was. As a result, at great time depth we are in a poor position to

distinguish genetic affiliation from diffusion. Of course, borrowing can also skew subgrouping, so our improved knowledge of language contact

phenomena poses a problem there too.”

1.3 Nonetheless: the lexicostatistical results

Despite the issues, this doesn’t necessarily mean lexicostatistics cannot be useful to compare groups of lects with mainly lexical variation – like Yokuts.

(22)

Because the Yokuts lects differ mainly in their vocabulary – Silverstein (1978, p.446) describes them as ‘remarkably homogenous’ – this is

despite everything a viable basis for a comparison which seeks to find the closely related groups within the larger group of lects. Moreover, cognates in Yokuts are not difficult to identify and the cultural differences between the Yokuts tribes are small, and partially of a (probably) recent nature (p.c. Smith).

Therefore, I present below the results of my lexicostatistical analysis of, including Toltichi, thirty-five Yokuts lects (excuse the small image with hashtags; the actual table with readable percentages is in the Excel attachment, tab Swadesh).

The first thing that stands out, are the two big yellow-orange blocks: the Northern and Southern Valley lects cling together, and so do the Tule-Kaweah and Kings River lects plus Gashowu. In addition to those, there are the smaller orange blocks of Delta Yokuts (dark green), Northern Hill and Toltichi (pink) and Buena Vista (purple).

I put some thicker lines around the lects that belong to the same

languages to show right away what this method gets right and where it fails.

For example, compared to neighbouring lects, it’s immediately clear that Jasnil, Coconoon and Yachikamni form a unit, but even so, the percentage between Yachikamni and Jasnil is only 67%. Likewise, although the lect “Kings River” is close to its fellow Northern Valley lect Nopthrinthre, it only has 60%, 69% and 62% in common with the other Northern Valley lects. Table 1: Lower left corner: the results of the lexicostatistic comparison, Toltichi

included.

Top right corner: (a quick indication of) the number of words available per comparison.

(23)

Yet many lects from very different languages score percentages in the 60s as well (see Dumna & Gashowu with 60%, or Choynok & Kocheyali with 65%). There are even out-of-place blocks of over 70%, even though I considered that the border between dialects and languages: above 70% two lects would be dialects of the same languages, under 70% they would not. But here we see Hewchi-Eyulahua & Ā-te-pitch with 71%, and the same for Chunut and Gawia – not to mention the shocking 92% match

(24)

between Ā-te-pitch and Gawia, though that is partly due to Ā-te-pitch being a mixed lect.

Qualitative analysis shows that Gashowu is a language of its own, but here we see it’s close to Kings River Yokuts, with percentages of often over 70. Even worse: Toltichi, that has the most in common with Dumna (84%), also shares a lot with Nutunutu, Wechihit and Wo’lasi: 80% - just as much as with Chukchansi!

Partially these unwanted results the effect of low numbers of available words when comparing lectduos, and partially they’re caused (or worsened) by lexicostatistics’ lack of nuance.

The important question is whether I can find a simple alternative that uses the same data, yet gives a more nuanced result.

(25)

1.4 Glottochronology

Then there is the matter of glottochronology. Comparing languages by comparing their vocabularies wasn’t the only thing Morris Swadesh came up with. He also believed he’d discovered that the rate by which these Swadesh words were replaced by a different form was the same in all languages, so one could now compute around which point in history two languages had split up. This is what he called glottochronology (Campbell 2004).

According to Campbell (2004), the terms lexicostatistics and

glottochronology are usually used interchangeably. I do not do that. To me, lexicostatistics is the part where one compares wordlists to get

percentages as a result, whereas glottochronology uses those percentages to compute when each lectduo ceased to be the same lect.

The only thing I do, is group the lects as well as possible based on their vocabulary – there is no depth, I’m not trying to infer a tree from my data, much less compute dates of divergence or say anything about when Proto-Yokuts was spoken.

Glottochronology was first received with enthusiasm (by archaeologists), and then heavily criticized (by linguists). In paragraph 1.1, I mentioned the basically wrong assumption that there is a universal, culture-free basic vocabulary, but Campbell (2004, p.202-203) identifies three more such assumptions. These specifically concern glottochronology:

2) the rate at which words are retained is the same through time

3) the rate at which words are replaced by a different form is the same for all languages

4) these data can be used to calculate how many millennia ago two

languages split up, using only variables C (the percentage of matches) and r (every 1000 years, only 14% of the lexicon is replaced: assumptions 2 & 3)

(26)

Campbell (2004) points out that even though sound changes are very predictable – not surprising, given that speech sounds are more limited in their variation possibilities, by the lips, tongues and teeth that form them as well as by the ears that hear and distinguish them – nothing gives us any reason to suspect the same is true for the lexicon. And indeed:

Icelandic retains 97,3% of its vocabulary over the period of which we have written Icelandic texts, but for English the same percentage is only 67,8%. The assumption that on average 86% of words remain the same for 1000 years turns out to be meaningless.

In addition to that, Campbell (2004) notes, languages don’t split up one day, never to have contact with each other again: there is often still a lot of contact between speakers of the two new different lects, so even after that moment, innovations can spread from one language to the next, making it seem like they split up (much) later than they actually did. Some linguists still think glottochronology is a convenient way to gain some first insight in the subgrouping of language families with a large amount of lects, but Campbell (2004) emphasizes that glottochronology is not suitable for this goal and other – qualitative – analyses are still

necessary.

There are also people who use glottochronology to get a preliminary idea of which languages split off from the common ancestor first and which ones did so later, but Campbell (2004) points out again that the data glottochronology produces are unreliable.

For what it’s worth, my own data, too, show that neither glottochronology nor lexicostatistics are suitable to produce any reliable order in which languages split off from Proto-Yokuts or the larger subgroups later on: we’ve already seen how closely Gashowu seems to be related to Kings River Yokuts, even though, as I mentioned above, qualitative analysis shows it must have split off at an earlier stage (Whistler & Golla 1986) (also see the note on Gashowu in the Introduction).

(27)

Atkinson & Gray (2003, p.436) add this to the discussion: “the clustering methods used [in glottochronology] tend to produce inaccurate trees when lineages evolve at different rates, grouping together languages that evolve slowly rather than languages that share a recent common ancestor.”

Summarizing, glottochronology has lost any credibility as a scientific method (which doesn’t mean it’s never being used anymore…).

(28)

2. Newer alternatives

In chapter 1 I explained that Swadesh’s glottochronology had some important downsides. In this chapter I discuss some modern attempts to show the degree to which lects are related and calculate dates of

divergence by using, still, the lexicon.

In 2003, Atkinson & Gray came up with a new method. Their point was the similarity in development between languages and biological species:

evolution, but with languages.

The biology field has developed computational methods to reconstruct which species are in which way related to each other, by using DNA. The idea is to use these same methods for the vocabularies of languages, or, as Pagel, one of the proponents of this idea, put it: “Darwin asserted that languages, like biological species, evolve by a process of descent with modification. If correct, we can expect human languages to form into family trees, known as phylogenies, which chart the history of their evolution in a manner analogous to that for biological species. (…) This raises the possibility that (…) we can use the combination of phylogenetic trees of language along with statistical models of how languages evolve to detect and characterize the signature of these historical processes.” (Pagel 2009, p. 406)

These are the main disadvantages of lexicostatistics, paraphrased from Atkinson & Gray’s article (2003):

1) by summarising the data as percentages, much information is lost 2) the old methods don’t account for the fact that different branches can evolve at different rates, causing them to group the lects that evolve slowly, rather than the ones that actually the closest related

3) a high amount of borrowings between lects means trees aren’t the best way to approach the data

4) there rarely exists a strict ‘glottoclock’, therefore dates of divergence are rarely correct

(29)

But, Atkinson & Gray claim, recent developments in the computational phylogenetic methods could avoid these issues:

1) trees can be inferred from the data directly, instead of working with percentages

2) models of evolution are better at grouping lects; if different branches evolve at different rates, it’s better to use maximum-likelihood models, or better yet, Bayesian Markov chain Monte Carlo (MCMC), instead of

distance (in any case a set of differences between lectduos) and parsimony models (least amount of steps).

(If I understand MCMC correctly, it’s like seeing a chessboard some time mid-game, and then using the computer to compute in which ways the players could have reached the situation currently on the board: which pieces have been moved when, which pieces captured which, et cetera. That results in thousands of different scenarios, which MCMC then groups based on common traits, and summarises in a proportional sample. The most likely moves are also more

common in these scenarios, and all previous scenarios are

considered in the analysis, so all in all an ever more refined idea of the chess game so far rises from the data.

The Bayesian element is in this analogy that the computer omits the scenarios with very dumb moves, even when those give the lowest amount of moves until the current board, or other advantages of a more mathematical nature.)

3) Borrowing can be omitted, and there are computational methods that don’t force the data into a tree, so one can still see if there have been any influences in the data that suggest a different development than one of evolution, with an ancestor and offspring.

4) The assumption of a strict glottoclock (Campbell 2004’s rate of

retention through time) can be relaxed by using rate-smoothing algorithms (Atkinson & Gray 2003)

Of course, this too was criticized by many people in the actual field of linguistics; one of them is Bill Poser. Poser (2003) admits that Atkinson & Gray (2003) have managed to circumvent a number of the basic problems,

(30)

but points out the unreliability of any method that is based on merely lexical changes, since these, unlike sound changes that are systematic, are far more likely to be influenced by cultural factors (this was also mentioned by Campbell 2004). Also, the biological version of this method uses DNA as its basis, which in principle contains all information, while the lexicon is just one part of a language.

Poser (2003) is not satisfied with the rather summary explanation about rate-smoothing algorithms to soften the assumption of the strict

glottoclock. After all, the differences in rates can be enormous, it’s actually too soon to tell exactly just how much, and Atkinson & Gray are not very clear about what their algorithm exactly is (Poser 2003).

Atkinson & Gray then use aforementioned methods to analyse the data of 87 languages – an existing database made by linguist Isidore Dyen – which results in a tree that in terms of subgrouping matches what we already know of Indo-European and the most likely development of its various branches. Since MCMC presents multiple scenarios in proportions relative to their respective likelihood, they write, they can estimate more

accurately when the diverse split-up moments must have occurred, with a probability interval. Also, they can incorporate pre-existing knowledge by using that as a filter which only the scenarios that match these already known relations between individual branches.

For the next variant, Atkinson and Gray (2003) omit all uncertain cognates in their data, to eliminate any contaminating influences of cognates that are potentially not actually cognates, and that of borrowings that aren’t recognized as such.

Finally, they variate with assumptions on whether or not Hittite was the root of the tree and missing words in their Hittite and Tocharian A and B samples, but in short, they present with a fair degree of certainty their conclusion that the age of the Indo-European languages must be between 8700 and 10100 years ‘before present’, meaning before 1950, the year of reference for carbon dating (Atkinson & Gray 2003).

(31)

Poser’s (2003) next objection is about Atkinson & Gray (2003)’s analysis of 2449 cognate sets, of which they don’t explain how they got that number. Poser (2003) guesses they mean all groups of cognates and can see the logic in that, but, as he says: “you can’t use binary characters based on such cognate sets as the input for clustering algorithms because

characters like “has a cognate of ursus as its word for ‘bear’” are not independent. If, for example, a language has a cognate of ursus as its word for ‘bear’, it doesn’t have a cognate of medved.”

(I’m not so sure about that – simply the fact that synonyms exist shows that languages can have two words for the same meaning.)

His last point is rather metaconceptual, since it concerns the way it has been published: why would they send a letter to a journal specializing in biology instead of writing a full length paper and send it to a journal that specializes in historical linguistics, so the people in that field understand fully what exactly they did and can ‘kick it around’? (Poser 2003)

Additionally, Poser (2004) points out just how little we know about the circumstances in which languages change – see also his quote at the end of paragraph 1.2.

Mark Libermann (2003) puts the same argument in these words: “if two languages A & B share 80% of the Swadesh 200 list, then if the underlying rate is .8 retention per millennium, A and B probably separated about 1000 years ago; but if the underlying rate is the .34 retention per millennium documented for East Greenlandic, then A and B most likely separated about 210 years ago; and if the underlying is the .976 retention per millennium documented for rural Icelandic, then the most likely time for the separation of A and B is about 9200 years ago. These look like pretty big uncertainties; and the number of cases for which we have good calibration of historical "glottoclock rates" is not very large; and there are almost certainly significant effects of speech community size, extent of contact, type of social organization and so on, which are not very well varied in our sample of calibration cases.”

(32)

Evans, Ringe & Warnow (2004) are a statistician, a linguist and a computer scientist who wonder if statistics and computational methods of the kind Atkinson & Gray (2003) use are at all suitable to approach this kind of linguistic problem. Their conclusion is that, before going on to calculate dates of divergence, one should first find out more about the change process: “Therefore we propose that rather than attempting at this time to estimate times at internal nodes, it might be better for the historical

linguistics community to seek to characterise evolutionary processes that operate on linguistic characters. Once we are able to work with good stochastic models that reflect this understanding of the evolutionary dynamics, we will be in a much better position to address the question of whether it is reasonable to try to estimate times at nodes. More generally, if we can formulate these models, then we will begin to understand what can be estimated with some level of accuracy and what seems beyond our reach. We will then have at least a rough idea of what we still don't know.” (Evans, Ringe & Warnow 2004, p.19)

Mark Pagel continues on the path that Atkinson & Gray (2003) proposed (see for example Pagel 2009), but additionally does exactly what Evens, Ringe & Tarnow suggested: he concerns himself with lexical replacement and attempts to find a way to better predict which words are replaced often and which ones aren’t. He also argues that languages don’t have a constant rate of change in the first place, but rather have short phases in which they evolve quickly (Pagel 2008). According to him, languages, like biological species, are more common near coast lines and live there – at least their speakers do – closely together. That may cause people to

choose to change the way they speak to distinguish themselves from other groups in their vicinity and bind their own group. As a result, languages that have split up often differ far more from their ancestor than languages that remained one for centuries. His way is to analyse data about word frequencies from languages from all over the world, and conclude from those that the words that are used most often, are also the ones that are least likely to be replaced (Calude & Pagel 2011), or, as Pagel (2007) puts it: “[…] we suggest that the frequency with which different meanings are

(33)

used in everyday language may affect the rate at which new words arise and become adopted in populations of speakers. If frequency of meaning use is a shared and stable feature of human languages, then this could provide a general mechanism to explain the large differences across meanings in observed rates of lexical replacement.”

By doing that, he circumvents the old assumption that all words are

replaced at the same rate throughout history, but instead he assumes that the frequency with which words are used nowadays is the same as in, for instance, 5000 B.C., which is difficult to prove, and makes additional assumptions about cultural factors.

On the other hand, the methods I introduce in the next chapter do no such thing: I assume that the words that changed little of nothing over time must mean it’s frequent and always has been, and if there are ten

different forms for one concept, it’s probably not as frequently used. Then again, I don’t attempt to calculate dates of divergence, making this point slightly moot.

(34)

3. New methods

I’ve tried various ideas to try and find an alternative to the oversimplistic lexicostatistical methods, which I will discuss in the upcoming paragraphs. But before I move on to the new methods, there is the issue of partial matches. As an inherent part of the data, these reoccur in every method; therefore I will first explain how they’re calculated.

3.0 The treatment of partial matches

The traditional method assumes the result of any comparison between two words is always either 1 or 0. As we’ve seen in chapter 1, this isn’t in fact the case. In my BA thesis, I solved the problem of the half-matches rather amateuristically by adding the additional halves to each lectduo manually. For this thesis, I dug deeper in the available Excel formulas until I

discovered a combination that worked. Ideally, I would have liked to find one formula that works for every possible combination of symbols, but alas, I didn’t get that far: I’ve kept the old, simple formula for comparisons between one symbol per lect (i.e. A & A, or A & B), for example:

=IF(GL$3=GL468;1;0)

Additionally, I have put together several formulas for situations in which at least one of the lects in a lectduo has multiple symbols. (Unfortunately, that does mean I have had to add these formulas specifically in those places where they applied, which makes it rather tricky to ever add more words to the database, should the occasion arise!)

For situations with two symbols, I’ve combined the scenario with two synonyms (i.e. AB, CF) with that of a root-affix type of situation (AZ, BY). This formula first compares the two Excel cells as such (AB & AB then results in 1 right away). Is there no match, it then continues to compare the first symbol of the Row 81-lect (Nopthrinthre) to both symbols in the first lect (Row 3, Telamni). If that’s a match, the result is 0,5. Added to this, also for half a point, is the comparison between the second symbol of

(35)

Nopthrinthre and both symbols of Telamni (that way, a comparison between AB & BA would still result in 1):

=IF(J$3=J81;1;IF(OR(MID(J81;1;1)=MID(J$3;1;1);MID(J81;1;1)=MID(J$3;2;1) =TRUE);0,5;0)+IF(OR(MID(J81;2;1)=MID(J$3;1;1);MID(J81;2;1)=MID(J$3;2;1 )=TRUE);0,5;0))

This formula also works for combinations like AZ & A, but not for A & B, because this formula considers the empty second space as a symbol of its own – two empty places then wrongly result in half a match.

With a lect with 3 (on one occasion even 4!) forms, a more extensive version is needed: =IF(D$4=D328;1; (IF(OR(MID(D328;1;1)=MID(D$4;1;1);MID(D328;2;1)=MID(D$4;1;1);MID(D3 28;3;1)=MID(D$4;1;1))=TRUE;1/3;0)+IF(OR(MID(D328;1;1)=MID(D$4;2;1); MID(D328;2;1)=MID(D$4;2;1);MID(D328;3;1)=MID(D$4;2;1))=TRUE;1/3;0)) )

Again, a full match gives 1 right away. The second stage now consists of three parts: first, all three symbols of the Row 328-lect (Chukaymina) are compared to the first symbol of Row 4 (Chunut), then to the second Chunut symbol (if there is one) – each match gives 1/3 point.

This formula works when comparing three symbols in one lect to one or two symbols in another (there was no situation where two lects had three symbols for the same word; therefore I haven’t extended the formula unnecessarily to include that option). Because of the asymmetry it is important to make sure the right cell reference leads to the right lect, or the third symbol will not be considered at all!

The last situation was that of synonyms, one of which is a simple symbol and the other is a double symbol, such as A AZ. I needed a formula that could compare these with A AY, or A, or AY:

(36)

=IF(GP4=GP$3;1;IF(OR(MID(GP4;1;1)=GP$3;MID(GP4;1;3)=MID(GP$3;1;3)) =TRUE;0,75;IF(MID(GP4;3;2)=GP$3;0,5;IF(MID(GP4;3;1)=MID(GP$3;1;1);0, 25;0))))

In these cases I have retained the space between the synonyms, as not doing so only made things unclear. After the initial simple comparison of the two cells (for instance A AZ & A AZ), this formula checks if the first symbol of Row 4 (Chunut again) matches to Row 3 (Telamni again) (A AZ & A), or if the first three symbols (including the space!) are a match to the first three of Telamni (A AZ & A AY) – this results in a three-quarter match (half, plus half of the other half). If no, there might still be half a match between the last two symbols of Chunut (AZ) and Telamni. Finally, the formula compares the third symbol of Chunut to the first symbol of Telamni. At this point, the matches that get a numerically larger result have already been filtered out, so this is just for cases like A AZ & AY.

Partial treatments in the calculation of the weighted methods

For the traditional lexicostatistical method, I now no longer needed to add points manually. All I had to do was count the number of cells in the row that contained a higher value than 0:

=SUMIF(D5:GZ5;">0")

The new weighted methods were more difficult. In the old situation, these formulas checked if there was a match, if yes, then that 1 or 0 was the number that was summed up. This is manageable with a simple SUM.IF formula, in which the weight factor is in the sum range. But for a partial match, of 0,5 for example, the weight factor needs to count for only 0,5 as well, or the difference between partial and whole matches is lost again. To do that, ideally you would like to be able to multiply the match –

whether it’s 1, 0,75, 2/3, 0,5, 1/3 or 0,25 – with the corresponding weight factor, and then sum all hundred outcomes of that, all in one cell. Excel has the SUMPRODUCT formula for this, that would have worked just fine,

(37)

except that I’ve had to avoid letting the missing words from undergoing the comparison calculations by using #N/B, and just like those comparison calculations, SUMPRODUCT does not know what to do with them (which is to say, it just replicates the error instead of pretending it’s a 0).

Fortunately, the number of different partial matches is very limited, so I simply added five extra rows under each row of weight factors that contain the same weight factor multiplied by 0,75, 2/3, et cetera. Then I used a string of SUM.IF formulas to use them as the sum range for each type of partial match separately:

=(((SUMIF(D5:GZ5;1;D$652:GZ$652))+ (SUMIF(D5:GZ5;"0,75";D$653:GZ$653))+ (SUMIF(D5:GZ5;"2/3";D$654:GZ$654))+ (SUMIF(D5:GZ5;"0,5";D$655:GZ$655))+ (SUMIF(D5:GZ5;"1/3";D$656:GZ$656))+ (SUMIF(D5:GZ5;"0,25";D$657:GZ$657)))/ (SUMIF(D5:GZ5;">=0";D$652:GZ$652)))*100

3.1 Lexicostatistics without the one-variant words

The simplest idea was to just omit the meaningless matches and retain the rest. This is in principle still the same lexicostatistical method of paragraph 1.3, but without the Swadesh words that only have one form (eighteen out of 100). All meaningless matches are now filtered out and the comparison is now based on, at most, 82 words. To that end, I use this formula:

=(SUMIFS(D5:GZ5;D$689:GZ$689;1;D5:GZ5;">0"))/ (SUMIF(D5:GZ5;">=0";D$689:GZ$689))*100

Because this is a non-weighted method, the sum range is still simply the row with the ones and zeros of the actual comparison (Row 5, column D to GZ). But there’s two conditions to be satisfied before any match is

(38)

means this Swadesh concept has more than one form, in other words, not all lects have symbol A). The second condition is more technical: only values greater of equal to 0 are to be summed, otherwise Excel tries to total #N/B, which the formula cannot cope with.

3.2 One eleventh: number of variants divided by

highest number of variants

The main objection to lexicostatistics is that every match and every

difference is considered equal, even though in concepts that only have one form, matches mean much less – nothing, to be exact – than matches in concepts with eight or ten forms. After all, these are the matches that group the lects together, not the word for one.

This variant purely checks the number of different forms per concept and divides that by the highest number of forms found in the entire Swadesh list: eleven, hence the name.

With water, we only find one form, so every match for water counts as 1/11 = 0,09.

With feather, we find those eleven forms, so a match here counts as the full 1.

This is the formula:

=(((SUMIF(D5:GZ5;1;D$655:GZ$655))+ (SUMIF(D5:GZ5;"0,75";D$656:GZ$656))+ (SUMIF(D5:GZ5;"2/3";D$657:GZ$657))+ (SUMIF(D5:GZ5;"0,5";D$658:GZ$658))+ (SUMIF(D5:GZ5;"1/3";D$659:GZ$659))+ (SUMIF(D5:GZ5;"0,25";D$660:GZ$660)))/ (SUMIF(D5:GZ5;">=0";D$655:GZ$655)))*100

For every 1 – a full match – in Row 5, the corresponding weight factor in Row 655 is summed, for every 0,75 match that is Row 656, et cetera (see paragraph 3.0). The sum of those weight factors is then divided by the sum of the (full) weight factors corresponding to every Swadesh word

(39)

available for the comparison between these two lects (Telamni and Chunut, in this case).

This is a basic way to weight a comparison and we will see in the next chapter that the table with the results looks much more blue, now that meaningless matches only count as 0,09 – one eleventh.

A disadvantage is that now all weight factors are influenced by one word with the highest number of forms, even though that number, given the data and the possibility of multiple submeanings in one Swadesh word, isn’t necessarily reliable. In that case, one mistake may get a lot of

influence. We assume feather has 11 different forms, but what if this is a hidden case if multiple meanings within one concept, or what if I or someone else made a mistake and wrote up a word that doesn’t mean feather?

Another disadvantage is of a more practical nature: you need to know the number of different forms for each Swadesh word, but should forms like AZ and BY count as different forms (which they do now)?

3.3 Chances: Positive only

The next variant goes one step further. I decided to add a weight factor based on the chance of a match in each given Swadesh word. I let Excel count all matches per Swadesh word and divided that number by the total amount of comparisons made for that Swadesh word. The weight factor for a match then equals 1 – [said chance]. In other words, the weight factor for a match is the same as the chance of a difference.

With the word water, that means the following: there are 33 – 34 including Toltichi – lects for which I have a translation of water. The total number of comparisons made is therefore 33 * 32 / 2 = 528. Because there is only one form of this word – something like ilik – the number of matches is also 528 (and the chance of a difference is thus 0):

(40)

528/528 = 1 1 – 1 = 0

In other words, a match between two lects for the word water counts as 0 in the weighted comparison, because a match means nothing here.

With tree, the situation is more or less the opposite. With this word, 325 comparisons have been made, of which there are only 48 matches (there are 10 different forms in 26 lects):

48/325 = 0,1477  chance of match

1 – 0,1477  chance of difference = weight factor for match

Each match with the concept tree thus counts as 0,8523, but this differs from Swadesh word to Swadesh words, and would change if more forms were ever added to the database. Other Swadesh words with 10 forms have other chances of matches and therefore have other weight factors than tree as well.

This is the formula:

=(SUMIF(D5:GZ5;1;D$694:GZ$694)+SUMIF(D5:GZ5;"0,75";D$695:GZ$695 )+SUMIF(D5:GZ5;"2/3";D$696:GZ$696)+SUMIF(D5:GZ5;"0,5";D$697:GZ$6 97)+SUMIF(D5:GZ5;"1/3";D$698:GZ$698)+SUMIF(D5:GZ5;"0,25";D$699:G Z$699))/(SUMIF(D5:GZ5;">=0";D$694:GZ$694))*100

The principle is the same as in the One Eleventh method, but with different weight factors.

3.4 Chances: Positive and negative

As I mentioned above, matches in words that have the same form in every lect mean nothing, as opposed to a match in a word that has eight or ten forms. For a difference, something similar is true: if a word has many different forms, a difference means far less than in a situation with only two forms.

This is a variant of the previous method, picking up where that method left off: now the differences count, too, and they do so negatively.

(41)

The chance of 0,1477 of a match in the example given above, now becomes a weight factor of -0,1477 for every – all but meaningless –

difference between two lects with the comparison between their words for tree.

Summarizing, a match in tree counts positively as 0,8523, and a difference in tree counts as -0,1477:

48/325 = 0,1477  chance of match = (negative) weight factor for difference

1 – 0,1477 = 0,8523  chance of difference = weight factor for match Then I had Excel sum all positive and negative weight factors for each lectduo comparison: =SUMIF($D5:$GZ5;0;$D$711:$GZ$711) plus =((SUMIF(D5:GZ5;1;D$705:GZ$705))+ (SUMIF(D5:GZ5;"0,75";D$706:GZ$706))+ (SUMIF(D5:GZ5;"2/3";D$707:GZ$707))+ (SUMIF(D5:GZ5;"0,5";D$708:GZ$708))+ (SUMIF(D5:GZ5;"1/3";D$709:GZ$709))+ (SUMIF(D5:GZ5;"0,25";D$710:GZ$710)))

The advantage of this method is that it’s weighted, the weight factor is different for each Swadesh word and directly depends on the chance of a match or difference, and that differences play a role as well. We now have a fascinating result in which everything is connected to everything and the average of the matches and differences is a perfect, balanced 0.

This is caused by the same principle as 0,5 * 2 having the same outcome as 5 * 0,2. Compare it if you like to a teeter-totter with a long arm (the 0,8523) and a short one (0,1477). Of course the long arm is much heavier than the short arm, but there are only 48

(42)

are 277 lectduos (the differences), so the teeter-totter is still balanced (48 * 0,8523 equals 277 * 0,1477 (with minor rounding differences)).

The downside is that it’s difficult to get the final results on a meaningful scale. You’re going both up and down from an average that isn’t a

meaningful point, so there is no scale from 0 up to a higher number, like 100, but one that has the 0 floating somewhere in the middle.

It’s not clear what this average of 0 means, exactly (it is somewhat under the cut-off point that we would like to use to separate dialects from

languages), much less what any random number like -5,07 should mean. What score should a lectduo have to count them as dialects of the same language?

To solve the scale problem to at least some extent, I converted the results to a 0-100 scale by adding the difference between the lowest number and 0 to every result, and then divide that by the highest result (including that difference), times one hundred.

=(HW5+($HR$638/-1))/(HW$639+($HR$638/-1))*100

HW5 contains the result of the comparison between Telamni and Chunut (the sum of both the negative and positive weight factors as described above). HR638 gives the lowest number among the results, HR639 the highest.

First, the range of the results ran from -16,57 up to 35,19. Because the lowest result is negative, I multiply it by -1 and add this (now positive) number to all results. The range has now been moved up to run from 0 to 51,76. Then I divided every result by 51,76, which stretched out the range to 0 to 100.

That means there is a 0 and a 100 in the table (see paragraph 4.4). The immediate downside of this is that the results, unlike the results in the other methods that only work with positive matches, are stretched out over the entire scale, which has a polarizing effect. There is no minimum of 33% caused by words with only one or two forms (like we see in the

(43)

lexicostatistical results); there is no conceivable maximum like you’d have in a lectduo that has 100 matches, because the result is different for every possible combination of available words. The average has now been

moved to a meaningless 32,9.

The alternative is to simply add 50 to all results. In that case, 50 is the new average and lects deviate from that positively or negatively, without ending up below 0 or over 100.

Something else to keep in mind is that we don’t know for sure if every Swadesh word contains one concept or multiple subconcepts. In case of the latter, any difference in the comparisons means a rather harsh penalty – although this effect is slightly muffled because although I counted the partial matches as partial matches, I failed to notice that these, of course, are also partial differences…!

3.5 Heggarty-inspired

Heggarty (2010) has various suggestions to make the traditional

lexicostatistical method more effective, one of them being the idea to use longer Swadesh lists, not shorter ones, to take on the problem with

loanwords. When you have a bigger sample and know which words are stable and which ones are not, you can see much better if languages have a common ancestor and are therefore diverging (many matches among the stable words), or if they are not, or very distantly, related and

converging through loanwords (many matches among the unstable words). If one assumes the stable and unstable words in Yokuts reveal themselves through their number of different forms – this is thus Calude & Pagel

(2011) in reverse: it’s not the frequency that tells something about the replacement rate, but the replacement rate that says something about the frequency – a calculation method for this is easily obtainable by using the weight factors of the previous method, without reversing them. Put

(44)

rewarded the least and a low chance of a match was rewarded the most. Heggarty (2010), however, argues to do the opposite: when two lects have ilik for water, and the chance of a match in water is 1, then the match should count as 1. On the other hand, the chance of a match in tree is only 0,15 and now only counts as 0,15.

(45)

4. New results

In this chapter, I will discuss and compare the results of the various methods I introduced in chapter 3 (also see the map of the area in the Introduction!).

4.1 Lexicostatistic without the one-variant words

Table 2: The lexicostatistic results without words that had only one form in all

(46)

Compared to the actual lexicostatistical results, much out-of-place green and yellow has disappeared from between the larger lect groups (the Valleys and Northern Hill on one side, Tule-Kaweah and Kings River plus Gashowu on the other), enhancing the contours of those groups. This shows right away that many results leaned for an important part on the 18 words that only had one form in all lects.

This is already a much better result, but: the matches still all have the same weight in the comparisons.

(47)

4.2 One eleventh

Here we see that the Northern Hill lects Dumna and Chukchansi plus Toltichi form a neat, orange square, without as many orange high scores between them and other lects (Toltichi with various Southern and Northern Valley lects) as we’ve seen in the lexicostatistical variants.

The same is true for Buena Vista Yokuts (Tulamni and Hometwoli in purple) and to a lesser extent Tule-Kaweah (light green) and Kings River (yellow). With these last two languages that Gawia and Ā-te-pitch are still very close (83,9%), and the percentages between Gashowu and Ayticha as well as Michahay are still rather high with 71,7% and 73,6%, respectively.

But within the Valley languages and Delta Yokuts, something else happens: here, the percentages between the less closely related lects are now much lower. Jasnil and Coconoon now only have a 49% match and the

percentage between Jasnil and Yachikamni has dropped under 70% (67%). We saw earlier that “Kings River” wasn’t as close to fellow Northern Valley lects Ta-kin, Chawchila and Hewchi-Eyulahua, now those percentages are only in the forties and fifties; the percentages with the latter is as low as 38,3%.

It doesn’t get that extreme with the Southern Valley lects, but here, too, we see the green of percentages in the fifties within the Southern Valley square, especially Tinlinne has relatively little in common with lects like Nutunutu and Choynok.

On the other hand, Choynok is still close to Hewchi-Eyulahua – 80% - and while the out-of-place high scores between Toltichi and a few Southern Valley lects had dropped under 70%, Chawchila still scores above that percentage with two of those same lects: Nutunutu and Wechihit.

Table 3: A weighed method in which the weigh factor is the number of different

(48)

4.3 Chances: positive only

Table 4: Method with a weigh factor based on the odds that a lectduo has a

(49)

The one thing that stands out here is how similar these results are

compared to those of the previous method! Again we see orange squares with Northern Hill plus Toltichi, Tule-Kaweah, Kings River and Buena Vista Yokuts, and again the same cannot be said about Delta Yokuts and

Northern and Southern Valley Yokuts. The contours of the larger Valley-block are a bit more distinguished (except for “Kings River”) and especially the percentages between lects that have little in common have dropped