Reverse-engineering the language of Thought

(1)

Replication of Kinship categories across

languages reflect general communicative

principles

Jochem van Oorschot

Supervised by: Milica Deni´

c and Jakub Szymanik

February 26, 2021

(2)

1 Abstract

Within the field of cognitive sciences exists the hypotheses of the Language of Thought. This hypothesis states that even tough humans speak in different languages, we all share some basic concepts of how we describe the world around us. These universal and primitive concepts form the language of thought. The spoken words are labels for these concepts or combinations of the concepts. In

(3)

this project, we will try the reverse-engineer some of these concepts by using languages from all over the world and a property of natural languages (which are for example: English, Dutch etc.). This property is that natural languages have proven to support efficient communication by having a near optimal solution to the trade-off between complexity and informativeness. The complexity is how difficult it is to remember or learn a language. Some languages might require a lot of different terms making it possible to describe objects or events in the world very precisely but also hard to remember or learn. A language that can describe something very specific is very informative, in other words, has a high level of informativeness. The trade-off between these properties is that if a language is not very complex, it usually makes it harder to be specific about something, decreasing the informativeness.

A study of C. Kemp and T. Regier has shown that natural languages are in-deed near-optimal solutions to this trade-off problem. To measure the complex-ity of the languages, they proposed a set of primitives that formed the Language of Thought. However, this set is not guaranteed to give the optimal solution to the trade-off. In this research we aim first, replicate the research of Kemp and Regier and second, use a different set to calculate the complexity. If this new set results in a better trade-off between complexity and informativeness, that set is more likely to be part of the Language of Thought.

2 Introduction

The language of thought hypothesis derives from Herbert Simon and Allen Newell’s “Physical symbol system hypothesis” [Ber14]. Simon and Newell were two researchers who worked on the foundation on AI [Dia02] by using logical systems to mimic problem solving skills of humans1_{. Their hypotheses states}

that “A physical symbol system has the necessary and sufficient means for gen-eral intelligent action” [Ber14]. One claim that comes with this hypothesis is that quote: ”nothing can be capable of intelligent action unless it is a phys-ical symbol system”. This means that humans are symbolic systems as well. What Simon and Newell had in mind when talking about a symbolic system is, in short, a system that can combine symbols into more complex structures of symbols (‘expressions’) by using rules which specify how symbols can be ma-nipulated. For example computers do this by combining ones and zeros with different operations creating new construct. For example ”0 1 1” in binary lan-guage can be interpreted as three. The way that humans use this system would be by for example saying: all plants are living organisms. Another sentence would be: all flowers are plants. Humans can than use logic and say: all flowers are living organisms.

In 1975, Jerry Fodor took this hypothesis and applied to the human mind [Res19] in his book called the Language of Thought [Fod75]. In here he stated

1_{Eventually Simon and Newell succeeded in creating a program that could prove 38 of the}

first 52 theorems in Principia Mathematica with one Theorem that actually improved on the original one.

(4)

that in the mind processes information like a Symbolic System it starts with basic concepts, or primitives, that are combined using logical rules. This results in more complex structures, or concepts, that refer new objects in the world. If we want to refer to a particular cup we can say: “the red cup”, combining the symbol for “cup” and the symbol for “red”. It also can be a method of acquiring new information as demonstrated in the ”flowers are living organisms” example in the previous paragraph. The system of primitives and rules is called the language of thought because of the resemblance of how thought is structured and normal languages (i.e. English, Dutch etc.) are structured.

With his book Fodor worked out the Language of Thought Hypotheses (the LOTH), also referred to as “Mentalese”. This resulted among other things in articles from other researchers that tried to discover what primitive symbols are part of the LOT and by which rules these are combined. Doing this is not a straightforward task because the symbols cannot be directly observed. They need to be reverse-engineered from the data that follows from it (for example: how we describe the world in natural languages). The approach we will be tak-ing in this thesis is by looktak-ing at the relation of complexity and informativeness in natural languages (i.e. English, Dutch, French, etc.). Zipf and Hawkings [Zip16], [Haw04] have suggested that natural languages are near-optimal solu-tions to the trade-off between these two components. T. Regier et al. [Ter15] supported this by saying that the function of a language is efficient commu-nication. Good languages are therefore, according to them: “simple, which minimizes cognitive load, and informative, which maximizes communicative ef-fectiveness”. They propose that semantic systems in the world’s languages tend to achieve a near-optimal trade-off between these two constraints. Another re-search from C. Kemp et al. [Cha18] also confirms that the optimization of this off relation comes from the need for efficient communication. This trade-off is illustrated in the following example: Having a small amount words for describing animals makes it easier to remember all the animals (low complex-ity) but harder to specify which animal you mean exactly (low informativeness). Every horse-like animal can be called horse but if you want to refer to a zebra specifically you would have to say the horse with black and white stripes making the communication less efficient. Assuming the hypothesis that natural language are near-optimal solutions we want to find a set of symbols (primitives) that enable us to recreate languages that are near optimal. The set that gives the optimal solution will most likely come from the LoT.

C. Kemp and T. Regier [Cha12] have shown that natural languages, are in-deed near optimal solutions to the trade-off problem in the kinship domain. S. Steinert-Threlkeld did a similar research called: ”Quantifiers in natural language optimize the simplicity/informativeness trade-off” [Ste20] where he argued that quantifiers are a result of optimizing the trade-off relations in languages. Y. Xu, E. Liu and T. Regier showed that this also applies to numeral systems across cultures [Yan20]. Milica Denic et al. [Mil20] “build on previous work to establish the meaning space and featural make-up for indefinite pronouns, and show that indefinite pronoun systems across languages optimize the com-plexity/informativeness trade-off. This demonstrates that pressures for efficient

(5)

communication shape both content and function word categories”. Saying essen-tially the same thing as S. Steinert-Threlkeld. Kemp and Regier applied theory to the kinship domain by computing how complex the systems of kinship rela-tions are in different languages. To do so, they required a set of symbols from the LoT that represented the kinship knowledge of humans. Kemp and Regier used a particular set but different sets are possible for measuring the complexity which might produce different results. The goal of this thesis is twofold, first to replicate the paper of Kemp and Regier using Python and add a description of the code and the procedure of Kemp and Regier. For the replication we will rely primarily on the original paper, the supplementary materials to the original paper, and the released matlab code that can be found on their website [KR]. The code and methods from here will be referred to extensively for the rest of this thesis. This will enable other researchers to easily experiment with different hypothesized languages of thought themselves. The second part is to try out different sets of primitives ourselves to find the set that optimizes the trade-off relation the most. That set will most likely be part of the LoT. Unfortunately the last goal had to be left to future research due to the time limit.

3 Background

The goal of C. Kemp and T. Regier was to show that natural languages are near-optimal solutions to efficient communication. They define efficient com-munication as being the optimal ratio in the trade-off between complexity and informativeness. For this they define how to measure these (respectively: num-ber of rules and summation of the cost of the individuals, see section 3, Method) and show for approximately 500 natural languages that they are indeed near-optimal. They acquire their languages from the Murdock Corpus which contains kinship terms from different languages.

3.1 Murdock dataset

The research of Kemp and Regier is based on the Murdock dataset [Mur70]. It contains kinship terminology of different natural languages, in other words, it is a dataset of how different languages from all over the world refer to family members. It is divided into eight sets: grandparents, grandchildren, uncles, aunts, nephews and nieces (male speaking), siblings, cross-cousins, and siblings-in-law. The way this is stored is by classifying different patterns. Most languages only have two terms for grandparents (grandmother (GrMo) and grandfather (GrFa)), which gives a pattern like GrMo - GrFa - GrMo - GrFa. Instead of storing this pattern for all of these languages Murdock calls it A and refers to the pattern this way. By doing this for all patterns that are found he decreases the size of the dataset significantly. An example of how Murdock describes the patterns is in figure 1. Using these patterns it is possible to describe a whole language as is shown in figure 2. Because the order of terms is fixed the letters for patterns are reused. A in the first column means pattern A from

(6)

”A” Bisexual Pattern. Two terms, distinguished by sex, which can be glossed as ”grandfather” and ”grandmother”. With the following variant:

Aa–with separate terms for GrFa (ms), GrFa (ws), and GrMo. 2

Figure 1: Kin term pattern description by Murdock

grandparents, in the second column it means pattern A from grandchildren etc. Later on we will talk about comparing natural languages to artificial languages and for generating these artificial languages Kemp and Regier rely heavily on these codes.

Figure 2: Two lines from the Murdock dataset.

This file can be found, digitized, in the code from Kemp and Regier under the name ”At10.cod” and ”kinterm origin.txt”. How these files represent the murdock data together is explained in subsection 3.1.2.

3.2 Family trees

The kin classification systems from Murdock’s dataset are used to refer to kin types. This is visualized in the family tree where every kin type is indexed. The first position is linked to the maternal grandmother, the second to maternal grandfather etc. The numbering starts from the female tree (1 - 56) continues through the male tree (57 - 112) and ends with Alice and Bob (respectively 113 and 114). Alice and Bob being the female and male centers of the tree. This order is shown in figure 3 as well.

[Female tree (1/56) – Male Tree (57/112) – Alice (113), – Bob (114)] Figure 3: The order of a partition. Note that Alice and Bob are not included in the male and female tree. Alice is from the female tree (she is 113), and Bob is from the male tree (he is 114)

Having a link between the position of a term and the family member makes it possible to process the terms computationally. For example, if Alice refers to her mother, that term can be found in the partition at index 15. Finding that index in the tree shows what relation the term mother holds towards Alice.

The reason there are two trees is because in certain languages the reference to a family member is dependent on the gender of the person making the reference. There might be a different word for mother when referred to by a female than when referred to by a male. There also happens to be more information on the

(7)

male lineage which will be discussed later. The trees can be found in figure 4 and 5 with the numbering of the members.

Figure 4: Female Tree

Figure 5: Male Tree

3.3 Primitives and rules

The primitives are the symbols that Kemp and Regier assume come from the Language of Thought, they refer to patterns and objects in the physical world. They say that some concepts as ”Parents” and ”Older” are primitive concepts of the LoT and can be combined via rules into more complex concepts such as “grandmother”. These rules are proposed in figure 6.

For the application of the LoT the combinations of the primitives are for-malized as logical rules. This is shown in equation 1.

C(x, y) ⇔ A(x, y) ∧ B(x)

M other(x, y) = parent(x, y) ∧ f emale(x) (1) The right part of the equation is the ”intension”. This is defined by Kemp and Regier as: ”A definition constructed using the resources in figure 6”. It

(8)

is a formula from the language of thought constructed by the primitives that are combined by the rules. This can be used to describe concepts in the world, which is what the left side is called: a ”concept” (interchangeably used with ”category”). This concept has an ”extension” which is a set of all the elements that can be described with the concept. The extension of ”mother” for example, will contain pairs like (1,5), (1,6), (2,5) etc (relate back to figure 4 to see what those codes refer to). For our purposes will have unary and binary extensions. An unary extension describes concepts applicable to single individuals like fe-male or fe-male while a binary extension describes relations between individuals like sister of father. These terms are formalized as follows: The extension of unary concept X = {x| x is an element in the real world and X applies to x}. For a binary concept Y, the extension of Y = {(x,z)| x and z are elements in the real world and x stands in relation Y to z}.

In creating new concepts it is possible to use both primitive and previously created non-primitive concepts, see 2 which can lead again to new concepts.

daughter(x, y) = child(x, y) ∧ f emale(x)

sister(x, y) = ∃zdaughter(x, z) ∧ parent(z, y) (2) However, this can also become an infinite loop because it can possibly create the same concept, as is shown in equation 3. The concept of mother can remain the same but the intension can vary infinitely. To prevent this a filter is applied which will be explained in depth in the method section.

mother = mother ∧ f emale (3) The last important term is the ego-relative extension. This is a set that contains all the individuals that have a direct relation with the ego, or center, of the trees. In the words of Kemp and Regier: it is the set of all relatives x in the trees such that mother(x, Alice) or mother(x, Bob). Formalized we say the ego-relative extension E = {(e)| e is a member of the family tree and E applied to e}.In practice the ego-relative-extension mother contains individuals 15 and 71. These can be found in figure 4 and 5.

4 Method

4.1 Overview of the code from Kemp and Regier

4.1.1 Important files

The most important files provided by Kemp and Regier will be listed here. This is a small description which will be explained more in depth later in this section. • ”corpusstatistics oct10.xls”: Ccontains the need probabilities for the cate-gories. It is an excel sheet with the English and the German probabilities in the first columns which are combined using formula’s in excel into the third column.

(9)

Figure 6: Section A contains the primitives and section B the rules used to combine them. The last three rules are respectively: Inverse, Symmetric Closure and Transitive Closure.

• ”At10.cod”: Contains the keys for decoding the languages in ”kinterm adjusted.txt” and ”kinterm orig.txt”.

• ”kinterm adjusted.txt”: Contains the usable encoded languages per province and society. Derived from ”kinterm orig.txt” which contained more lan-guages but had too much missing data.

• ”mastertree.pdf”: Displays the family tree for Alice and Bob. It shows which indices belong to which family member.

• ”rwpartitions.txt”: The partitions from natural languages that will even-tually be measured.

• ”componentbase.m”: A matlab file that provides all the relational matrices for the primitives and most common categories.

• ”makecomp.m”: A matlab file that combines the categories and filters the duplicates. The order in which the categories are combined is very specific and should be taken into account when replicating.

• ”reln2exm.m”: Extracts the ego-relative extension from a relational ma-trix.

• ”runanalysis.m”: The main matlab file. Here are all the options for code specified (i.e. create partitions, run parallel, etc.).

• ”setps.m”: Here are the storage parameters specified (i.e. how much memory is allocated, where required files are located etc.).

4.1.2 Tree representation

As mentioned before, the index of the trees shown in figure 4 and 5 correspond with the columns of the partitions (i.e. the first element of the partition refers to the first member of the tree). Every row in the “rwpartitions.txt” file is

(10)

one partition. The zero index, or first element of the line, is the frequency of the partition in the murdock dataset. For example, the first nine elements of a partition are [2 1 2 1 2 1 3 1 3]. The first element means that this particular pattern appears twice in the dataset. The other eight correspond with the line of grandparents in the female tree. The actual number is the name that the partition has for that particular family member. In this case the partition has a different name for maternal grandfathers (2) and paternal grandfathers (3). Grandmothers are named the same regardless of the line they belong to (1). The ego’s of the trees are not included in the partition explaining why they contain a 113 elements (with the first element being the frequency).

These partitions can be decoded using the ”At10.cod” and the ”kinterm adjusted.txt” files. ”At10.cod” contains all the different patterns that are found in the

mur-dock dataset for the different sections of the family tree, i.e. Grandparents, Aunts etc. ”Kinterm adjust.txt” contains the information on how the patterns are distributed for each language. To generate the partition for a language one can take the line of numbers that belong to a language (a partition) in ”kin-term adjust.txt” (1 1 2 3 2....) and match those accordingly with ”At10.cod, the first one being the grandparents major pattern, the second the grandparents minor pattern, the third the grandchildren major pattern and so on. Doing this would result in a partition of more than 112 members that it is now. The 16 patterns with 8 members per pattern result in a 128 members. However, Kemp and Regier decided to leave out the siblings-in-law in creating the partitions because it became computationally intractable, bringing the total amount of patterns down with 14 resulting in a 112 members.

4.1.3 Primitives representation

To be able to apply logical rules to the primitives while simultaneously storing the semantic values of the primitives they were represented as matrices. For this, Kemp and Regier made a distinction between Unary and Binary primitives, unary being the ”female” and ”male” primitives, binary being the ”parents”, ”children”, ”older”, ”younger”, ”same sex” and ”different sex” primitives. Keep in mind that these primitives are the ones used for the research of Kemp and Regier, but are still a choice made by them. An unary primitive refers to members of the family tree and a binary primitive refers to a relation between members of the tree. An empty unary matrix contains only zero values and has size 114x114, each position representing a member of a partition. From this, the female unary matrix is created and starts of by setting all the x-as indices, or columns, corresponding to a female member, to a 1. Unfortunately, these indices had to be selected manually by looking at the family tree because the indexation of the tree is too irregular regarding what members are female. Writing a function that catches all the exceptions would have been more work. After setting all the necessary columns to one the primitive female(x) is ready and looks like figure 7. Converting it to female(y) requires only a transpose. The male(x) primitive is represented as a matrix of 114x114 filled with ones minus the female(x) matrix and male(y) is once again the transpose of male(x).

(11)

1 0 1 0 1 0 1 1 . . . 1 0 1 0 1 0 1 1 . . . 1 0 1 0 1 0 1 1 . . .

..

. ... ... ... ... ... ... ...

Figure 7: Matrix representation primitive female(x).

A one in the binary primitive matrix means that there is a relation between the indices. Say that at P arents(1, 5) = 1, then we know that 1 is the parent of 5. If there is no relation between the members the outcome will be 0. There will now follow an instruction on how to create each of the primitives:

Parents/Children: All the parents have a certain range of children. Par-ents 1 and 2 only have child 9 but parPar-ents 9 and 10 have the range of children from 13 until 17. Kemp and Regier manually wrote down these pairs and put them in a matrix where the row indices corresponds to the parents and the column indices to the children. As explained earlier, every index of this matrix where a parent relation exists gets the value 1, every other 0. For example: Parents[9][13] = 1 but Parents[9][10] = 0. To create the children primitive we transpose the parents matrix.

Older/Younger: For this Kemp and Regier manually linked the individ-uals to their generation. Then, the members of each generation are mapped to every member that is in a lower generation resulting a matrix with similar functionality as the one created for parents.

Same sex/Different sex: For creating the primitives Female and Male Kemp and Regier had already manually selected all indices of the members of the same sex. This primitive got a one at every intersection of the female members which is done by looping over the list of female member indices twice. Different sex can be created by either creating a matrix of 114x114 and subtracting the Same sex matrix, or by changing one of the loops to a loop over the male member indices.

4.1.4 Applying rules to primitives

After creating the primitives, they need to be combined using the rules from figure 6B. The first six rules are self-explanatory and have a straight forward implementation, however the last four will be highlighted in this section since their application to the matrices is not obvious.

The first one is the rule: C(x, y) = ∃zA(x, z) ∧ B(z, y) or transitivity. Ma-trix C can be created by first taking the maMa-trix product of A ∗ B. Then the conjunction is taken with a matrix of ones of the same size minus the identity matrix. This ensures that every transitive relation between A and B is found and by taking the conjunction all the self-references are removed. To give an

(12)

example:   0 1 0 0 0 0 0 0 0  (A) ∗   0 0 0 0 0 1 0 0 0  (B) =   0 0 1 0 0 0 0 0 0  (C)   1 1 1 1 1 1 1 1 1  −   1 0 0 0 1 0 0 0 1  =   0 1 1 1 0 1 1 1 0     0 0 1 0 0 0 0 0 0  ∧   0 1 1 1 0 1 1 1 0  =   0 0 1 0 0 0 0 0 0  

Figure 8: Matrix from of transitive rule C(x, y) = ∃zA(x, z) ∧ B(z, y). The boxes show how the pair of A is carried through B.

By taking the conjunction the diagonal is set to 0 and the self refer-ences are removed.

The matrix product ensure that is Cxy = 1 where Axz = 1 and Bzy = 1.

To give a conceptual idea on how this rule is applicable there is the example on how it can be used to define siblings, shown in eq. 2.

The second rule is C(x, y) = A(y, x) or the inverse. This can be achieved by transposing the matrix. By switching all the columns with the rows every pair will be reversed. Related to the real world: we can consider the child relation as the inverse of the parent relation.

The third rule is C(x, y) = A↔(x, y) or symmetric closure. For a relation to be symmetrically closed every pair needs to have an inverse present. So if a matrix contains the relation (x,y) then it should contain (y,x) as well after applying this rule. Matrix C can be created by taking the disjunction matrix A with its inverse. There are languages where for example grandchildren and grandparents refer to each other with the same term. For these cases this rule exists. It only works if the ego is one of these two. If that is not the case this relation can be captured by transitive closure.

Which is the final rule : C(x, y) = A+(x, y). This is a combination of transitivity and symmetrical closure. In practical terms it means that if set A contains A(x,y), then all other possible paths from x to y should be in set A as well. Kemp and Regier achieve this in four steps. First the diagonal of the matrix in question is stored for later. Second, they take the disjunction with the identity matrix to set the diagonal to one. Third, they take the matrix to the power of twelve. This is the maximum depth for which new paths from x to y will be created. Finally, the diagonal is replaced with the original diagonal and all the nonzero indices are set to one to maintain the binary property of the matrix. In figure 9 is an example on how taking the power of a matrix results in new paths from x to y to give more of an intuition on how this method works. This rule is required for systems where for example a grandparent and parent are referred to with the same term as mentioned before.

(13)

    0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0     ∨     1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1     =     1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1         1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1     3 =     1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1     ∗     1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1     ∗     1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1     =     1 2 1 0 0 1 2 1 0 0 1 2 0 0 0 1     ∗     1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1     =     1 3 3 1 0 1 3 3 0 0 1 3 0 0 0 1     =     1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1    

Figure 9: Matrix from of transitive closure rule C(x, y) = A+_{(x, y)}

4.2 Informativeness measure detailed

As mentioned earlier, the informativeness of a language reflects how accurate the communication is within a language. In other words, how likely it is that in a conversation the two parties refer to the exact same member by using some word in their language. Having one term for all grandparents is not very informative since it requires extra information to reference to for example a grandfather. In formalizing this measure, Kemp and Regier took a slightly different approach by calculating for each language the communicative cost of referring to each member of the family tree. Languages with a higher communicative cost are less informative because it costs more to refer to a member. This is slightly nuanced with the use of need probabilities.

4.2.1 Need probabilities

The idea behind these probabilities is that not every member in the family tree is of the same importance. Some might be referenced more than others in general. This could affect the informativeness of a language because, for example, if one language has two different words for grandparent and other has two different words for aunt while being identical in every other aspect, then the first language would be more informative because grandparents are more

(14)

often referred to then aunts. The difference between grandparents will make communication more efficient then the difference between aunts. The reason why they are called ”need” probabilities is because it stands for how often a term is needed for a reference.

Kemp and Regier acquired these by analyzing two corpora, the Corpus of Contemporary American English [Dav] and the German Reference Corpus [M K]. These were filtered for the use of every member in the family tree including variants of the word like mother such as mum and mam. The probabilities were individually calculated for each corpus after which they were added together resulting in the probabilities found in the ”corpusstatistics oct10.xls” file. Not all members are included in this file due to three reasons. First, because cousin analysis was suspended since it became computationally intractable to involve them. So there is no data on these members gathered in the first place. This decreases the size of the trees from 56 to 40 members. Second, there was little data on great grandparent so they are left out as well which changed the the amount of members from 40 to 32. And third, there was not enough data on the niblings3 section of the female tree which is why they are removed from only that tree in calculating the informativeness, shrinking the female tree from 32 to 24 members. The data on the male tree was sufficient and therefore kept and used.

4.2.2 Calculating the informative measure

In formalizing the informativeness, Kemp and Regier used a formula from in-formation theory in 4 that calculates the additional communicative cost for referring to an individual. This cost is based the amount of members with the same name and the need probability of these members.

ci= −log2    pi P zj=zi pj    (4)

Where ci is the additional communicative cost of referring to an individual,

piis the need probability of the individual i, pjis the probability of the member

i that is referred to with the same term as i. The variable z is the vector that represents the family members that are referred by the same term as i. Hereby it is important to distinguish the member that is referred to and all the members that have the same name. If there was only the term grandparent for all the grandparents (and not the distinction grandfather/grandmother) they would not all have the same need probability. The need probabilities refer to a location in the tree. To get the total communicative cost of a language one needs to calculate ci for all the family tree members. The variable pj is used

as as follows: if i is the first member in the family tree (grandmother) then for each member of the tree pj collected of all the members with same name. In

3_{Niblings are children of brothers and sisters, not the confuse with cousins. These are}

(15)

the English language this would be the probability of the maternal grandmother plus the paternal grandmother since these share the same term. Formalized this look like equation 5.

c1= −log2 p(1) p(1) + p(3) (5) The variable z is defined by Kemp and Regier as follows: ”z is a vector that represents a partition of the 244_{individuals into categories, where z}

irepresents

the kinship category used to label individual i. For example, if Alice is an English speaker, then z1 will equal z3 because individuals 1 and 3 are both

grandmothers.”

To get the full communicative cost of a language Kemp and Regier calculated total cost for each of trees (male and female) and took the average of those two.

Cf emale= 24 X i=1 pici Cmale= 32 X i=1 pici

Ctotal= Cf emale+ Cmale

(6)

The information of the cost of each language was stored in a separate file to plot it later against the complexity of the languages.

4.3 Complexity measure detailed

If a language contains many different words for different family members, this language is generally considered to be a complex language. Kemp and Regier formalize this as follows:

”Assuming that kin classification systems are mentally encoded in a representation language, and that the complexity of a system corresponds to

the length of its shortest description in this language”

The shortest description of a language is the least amount of intensions that are required in defining all terms in a language. This does not necessarily mean that the amount of different words is also the complexity. If a language has for example a word for sister but not for daughter, there is still an extra rule required for the language to create the concept of daughter. How the concept of sister is created is described in figure 2. The final description of the English language from Kemp and Regier can be found in figure 10.

To be able to appoint an intension to every term Kemp and Regier first generated a set of intensions. This was done by combining the primitives in

4_{Because, as mentioned before, there was not enough data for need probabilities on the}

(16)

figure 6 and repeating this with the newly generated intensions. Having created the set of intensions, they tried to assign the shortest matching intension to each term.

Figure 10: Description of the English language by primitives and rules.

4.3.1 Generating categories

The first step in generating the categories is to combine all the primitives in every possible configuration. This is simply done by taking each primitive and iterating over all the set of primitives applying every possible rule with each combination. The results of the combinations are stored in another file and will be called upon to create other categories. Kemp and Regier do this three times because as they said: ”A depth-three expansion generates virtually all of the attested categories in the Murdock data, but a depth-two expansion does not adequately cover the space of attested categories”.

A small implementation detail that helps understand the code of Kemp and Regier is that when they apply the two binary rules for conjunction and disjunction (A(x,y) ∧ B(x,y) and A(x,y) ∨ B(x,y)) they leave out the application to itself (A(x,y) ∧ A(x,y) and A(x,y) ∨ A(x,y)). In other words, they ensure that the primitives that have been used for the first argument (A(x, y)) of the rule are not used for the second argument (B(x, y)). This is done by comparing the current indices of the primitives. If the index of the second argument is higher then the first argument it means that the primitives are not the same and have not been combined in the past. If they were combined in the past it would result in a category that already exists because the conjunction and disjunction have commutative properties (i.e. A ∧ B = B ∧ A). A coded example can be found in figure 11. It would be possible to not take this extra measurement and generate all the categories then filter them later. However, by not generating

(17)

for i in binary_categories: for j in binary_categories:

if j > i:

apply conjunction rule apply disjunction rule apply transitive rule

Figure 11: Coded example how Kemp and Regier decreased the generating of duplicate categories.

the duplicate categories in the first place the time required by the program is decreased significantly. For the first depth this is not significant because there are only six categories (only the unary and the binary primitives) but for the third depth there are around 7700 primitives. Now, instead of generating 7700 ∗ (3 ∗ 77000) ≈ 17, 8 ∗ 107_{rules it becomes 7700 + 7700 + 2 ∗ (7699 + 7698 +}

... + 2 + 1) ≈ 59, 2 ∗ 106. This becomes especially valuable for removing all the duplicate categories.

4.3.2 Removing the duplicate categories

After the execution of each depth, Kemp and Regier filtered the set of categories for duplicates (i.e. categories with the exact same extension). This meant comparing all the categories which was computationally exhaustive, keeping the set small was therefore of great importance. Whenever a category had to be removed the costs were compared. The cost of a category is calculated by how many primitives are added to form the category. For example the category that is generated by parent ∧ parent has a cost of one and will be removed when compared to the category parent (which has zero cost) since their extensions are identical but the first one has a higher cost. In case of equal scores the first category was picked.

In depth three the categories are filtered by their ego-relative extension be-cause this extension is only necessary for the next step which applying the categories to the partitions. The reason this is not done already after depth one or two is because this extension is a small part of the matrix. This means that there are more duplicates of a matrix found that might not actually be the same if we looked at the full matrix. This will be explained more in depth in section 3.3.4.

4.3.3 Naming

The matrices that are generated using the rules and primitives automatically get a name assigned. Kemp and Regier manually created around seventy matrices that described different relations but this does not cover all the relations at all. They did this so that these matrices could be brought to the beginning of the

(18)

list and they were matched first when applying the categories to the partitions which was more efficient since these categories were often used meaning that the optimal solution was found earlier. The names of the matrices that were not manually coded were saved in small 2 x 2 x n matrix where n corresponded to the total amount categories. Meaning that for every category there is a place on the z-axis. This way the z-axis keeps track of the relation between the categories and the names ((x, y, 2) gave in in the names matrix the name that belonged to (x, y, 2) in the categories matrix). The 2 x 2 space was filled with the actual name. The left side indicated if the primitives that were combined were unary or binary and the left indicated which one it was. The primitives were encoded as numbers: • Unary 1. Female 2. Male • Binary 1. Parent 2. Children 3. Older 4. Younger 5. Same Sex 6. Different Sex

After combining the resulting matrix would look like figure 12. This does not include how the primitives were combined (i.e. what rule was applied).

1 2 2 1

Figure 12: An example on how names were stored by Kemp and Regier. Assum-ing the conjunction rule was applied this would have been the name of category that represented all mothers. The rule that is used can be reconstructed by applying the rules to the primitives given and comparing the result to the rela-tional matrix that corresponds to this name.

In naming our own matrices we used the same system to encode the prim-itives as Kemp and Regier but stored them differently. The names were put together with their rule in a list which got a new list every time it was com-bined with something else. If initially primitives f emale∧parents are comcom-bined into a category the name would be [11, 21, conj]. If these are then combined with another category say older ∧ dif f erentsex ([21, 26] the result would be [[11, 21, conj], [21, 26, conj], conj].

(19)

              · · · a(1,113) b(1,114) · · · ... ... · · · a(56,113) b(56,114) · · · c(57,113) d(57,114) · · · ... ... · · · c(112,113) d(112,114) · · · e(113,113) f(113,114) · · · e(114,113) f(114,114)              

Figure 13: Obtaining the ego-relative extension. This is the right side of a relational matrix. It is stored in a 1-dimensional array in the order: [a, d, b, c, e, f ]

4.3.4 Applying categories

This is the process of using the categories to get the complexity of a language. The goal of this process is to get the smallest set of rules needed to define a language as is done in figure 10.

Step one is to match the generated categories with the members of the par-tition of a language. This can be done by comparing the ego-relative extension of a category. From now on a category will be described as a relational matrix because it gives a better idea of what it is in a technical context.

The Ego-relative extension is acquired by taking the last the columns of the relational matrix. By taking these the y in Relation(x, y) will always be either 113 or 114 and the x will therefore describe how to the person is related the ego of the tree. Knowing this you can find the relation of each member of the tree towards the ego in these columns. Parent(x, 113) will give you the parents of 113 and Children (x, 113) will give you the children of 113. If you would want to know what 113 is a child of you would have to invert the parents matrix, however, this is irrelevant in getting the ego-relative extension of a relational matrix because the information required is the relation towards the ego, not vice versa. For the filtering with ego-relative extension Kemp and Regier also take to the relations from the male family tree members towards Alice and vice versa. These relations are irrelevant when comparing ego-relative extensions but it is relevant to know that this happens. This method of filtering allows for multiple ego-relative extensions to be applicable. How this is stored exactly can be found in figure 13. The combination of vectors a and d is already the ego-relative extension. It contains all relations from the female tree towards Alice (the a-column in figure 13) and all the relations from the male tree towards Bob (the d-column in figure 13). However, the columns b c e f are also incorporated by Kemp and Regier. In generating the categories the duplicate categories were removed by comparing the ego-relative extension ([a, b, c, d, e, f. In assigning the categories we only need [a, b] making it possible to assign multiple categories to a term.

(20)

For assigning the categories to the terms of a partition Kemp and Regier use a specific algorithm. First they create an active set of all terms in a parti-tion that needs to be assigned a category. Then they assign a randomly fitting category and append any terms that are in this category and need a definition (such as ”daughter”) to the active set. The term that has a category assigned is then removed from the set and stored in the set of definitions. This process is continued until the active set is empty. The amount of rules can be counted and this becomes now the best complexity for the partition. Then they backtrack through the set and use different categories to define the terms until all possible options are used or the manually chosen threshold (the maximum number of systems of rules) was reached. The threshold for this tree of Kemp and Regier and our replication was 10.000. A higher threshold became computationally to expensive but can be easily adjusted if need. They also did analyses on for ex-ample only grandparents where they set it to 100.000.000. If the threshold is not reached it is certain that the minimum complexity is found. The backtracking helps the following case: A partitions that uses the category ”sister” requires the category ”daughter”, see example figure 2. This means that the partitions needs the extra rule for daughter (to create sister). Now if the partition already needed the category ”daughter” it does not requires an extra rule as it is already in the partition. If it does not need ”daughter” however, the partitions gets this rule anyway because ”sister” requires it and the partitions becomes more com-plex. It might not have been necessary because ”sister” might be possible to create from other categories in the partition as well. These other options should be tried as well to ensure we get the minimum complexity of a partition.

The way used for this thesis is different in one aspect. Usually when a new term comes from an intensions the term is appended to the active set and defined later. However, due to the different naming process we use we can immediately see how a term is defined. Because of this, we do not append new terms to the active set but add it to the set of defined terms immediately because defining it later would require more operations. This is the speed up the process.

4.4 Tests

To check whether the results of the informativeness were correct they were plotted as can be seen in figure 14. It is a plot of the informativeness computed (cf. Section on informativeness measure) for each of the 500 partitions in the files from Kemp and Regier. It is set out against the word count of the corresponding partition. This plot should have a higher informativeness when the word count is higher because that means there are more different words to describe members of the family tree. Following the same logic a low word count should result in a low informativeness.

The results from the self generated categories were compared per depth with the ones from Kemp and Regier. Our categories were generated in exactly the same order as theirs which enabled straight forward comparison of all the matrices. This way any abnormalities or differences were addressed efficiently. Both sets of categories for depth one and two were exactly the same. Depth

(21)

three was not tested because generating the ones of Kemp and Regier it would have taken between 48 and 72 CPU hours. It was however important to compare this depth because of the extra layer of ego-relative filtering. To check whether our filtering process was equal to theirs we applied their ego-relative filtering to their depth one and two and compared that to our ego-relative filtered depth one and two. These sets were identical as well.

4.5 Using the code

For the execution of the code of the thesis, four files have to be called. Here is a general overview of what each of the files does. A more in depth explanation is in section 3 and the files themselves in the extensive comments.

• get informativeness.py: A class that calculates the informativeness of ”rwpartitions.txt” file. It uses the need probabilities and formula for in-formativeness provided by Kemp and Regier and stores it in ”communica-tive cost.txt”. In the initialize part of the class can the all the paths to the files to be used specified.

• generate concepts better names.py: Generates the concepts (or cat-egories) based on the primitives specified in ”primitives improved.py”. It takes in an argument if it should store the data in the preset folder ”con-cepts”. It combines the primitives and categories into new categories using the ”rules.py” file.

• compute complexity scores.py: Applies the categories generated in the file above to the partitions given by Kemp and Regier. It stores the result automatically in the categories folder.

• plot tradeoff.py: Plots the data generated by the previous programs. After calling these files there should be the file communicative cost.txt and ”complexity score.txt”. These contain the data required for the final plot that shows the trade-off between informativeness and complexity. For this thesis we created a small file called ”plot tradeoff.py” which only plots the two data files in one plot.

There are also the files ”pseudo2” and ”get matlab data.txt”. The pseudo file contains a small explanation in pseudo code on how the backtracking for matching the categories with partitions works. The matlab file contains code that can be used to extract the generated categories from the matlab code of Kemp and Regier. It also explains how to use it and where is was placed.

The artificial languages in figure 15 have not been processed because of com-putational restrictions but can be generated using Kemp and Regiers code. To do this the ”dancinlinks.c” needs to be recompiled first with the local c-compiler and the pathway needs to be specified within ”enumeratepartitions dlink.m”. Then they can be generated by following the steps given by Kemp and Regier. The languages should have the same format as the partitions from the Murdock data and can therefor be processed the same as natural languages. Note that

(22)

the artificial languages probably have one element less (the frequency as the first element) and that this is taken into account.

5 Results

Figure 14: The test results of the communicative cost plotted against the word count of the partitions. The graph drops as the amount of words in a partition increases.

(23)

Figure 15: Kemp and Regiers informativeness against complexity for natural -and artificial languages. The black circles represent the natural languages -and the gray dots the artificial languages. Note that the natural languages all lay within the lower left corner where the informativeness and complexity are at their lowest.

Figure 16: Informativeness plotted against the complexity for natural languages from this thesis.

(24)

6 Conclusion

In the comparison between figure 15 and 16 both the communicative cost and the complexity have higher value than those of Kemp and Regier suggesting that we have done something different or perhaps an error somewhere. The differ-ence in communicative cost can be explained by the extra procedure Kemp and Regier put these to adjust for a stable population. Due to the time limit this section has not been replicated. The difference in complexity is more difficult to find out. One possible solution might be that it is due to the lack of compu-tational power. This forced us to use a smaller set for depth three generation of categories and to take again a subset of what it produced. Better solution with a lower complexity are therefore overlooked in most cases which leads to a higher complexity. Another solution is that the method used to calculate the complexity is flawed. Since the method used is different from the ones of Kemp and Regier it is impossible to test parts of the code, only the result can be compared.

7 Discussion

7.1 Achievements

The second part of the thesis, testing a different set of primitives, is, as men-tioned in the introduction already, not accomplished due to the lack of time. The replication of the paper, calculating the informativeness and complexity of languages from the Murdock dataset, was more successful in the sense that the categories were generated correctly. The communicative cost and complexity measurements need more work but have been tested on smaller sets and proven successful there. The debugging process for larger sets therefore, hopefully, within reach. After that, testing new primitives is only a matter of running the code and should requires very little knowledge of the inside mechanics. The code for this will be published on github which can facilitate other researchers in the future for testing different primitives themselves.

7.2 Future research

There were however a few questions raised during the thesis. First of all, in the ”At.10.cod” file there is a number for missing data in each category. It did not have any effect on the results as far as we know but since it is unclear where it comes from we also do not know if it could have been useful. It was ignored due to the time limit but it is something that should be looked in future research.

Second, a part of the data that has been ignored by Kemp and Regier is the geographical information that is available for each language. Since the goal of the thesis is oriented towards a universal language it was not included. It might however be interesting to explore.

Finally, Kemp and Regier used artificial languages as comparison to natural languages. The complexity of these artificial languages can be calculate with

(25)

the same procedure as for natural languages used in this research.

7.3 Limitations

One issue with the naming of the matrices was that the rules were not included. As mentioned in section 3.3.3 the rules that combine the primitives are not saved. The effect of this is that in counting the rules for calculating the prim-itives, we only count how many combinations of primitives have been applied regardless of the rules. Primitives 1 and 2 can be combined using a conjunc-tion or a disjuncconjunc-tion but that does not matter for the complexity, resulting in a complexity that is lower than it should be for certain languages. We chose to use the rules in the first attempt of calculating the complexity. There are two files with the suffix ” better names.py” that are a quick implementation to takes into account the rules that are used. This is however an added feature to the original implementation so a complete overhaul of this mechanic would probably be more efficient. This feature also has not been tested extensively due to the time limit.

The last issue is that the assigning of categories is not optimal. This issue was already described by Kemp and Regier and was inherited by us initially since we are replicating their code. Some categories that had a high complexity were pruned in the generating process but might decrease the complexity for some languages. If ”sister” was in the partition but ”daughter” was not that would mean it needs an extra rule. However, if ”sister” might be created by using a combination of primitives and intensions that is bigger than ”daughter” but can be created by using only categories in the partition, the resulting com-plexity would have been lower. Kemp and Regier describe this in their appendix section 7.1 as well.

One of the reasons we ran out of time to complete the second goal is because of a wrong estimation in how long it would take to understand the code of Kemp and Regier. There was documentation on how to run the code but very little on how the code worked. Therefore, replicating their research meant reading through all their code. The aspects that slowed down this process the most were the lack of comments, minimalistic variable names and various data structures which had to be deciphered.

Bibliography

[Mur70] George Peter Murdock. “Kin Term Patterns and Their Distribution”. In: Ethnology 9.2 (1970), pp. 165–208. doi:

UniversityofPittsburgh-OftheCommonwealthSystemofHigherEducation.

[Fod75] Jerry A. Fodor. The language of Thought. Harvard University Press, 1975.

(26)

[Dia02] Jeremy M. Norman Diana H. Hook and. “Origins of Cyberspace: A Library on the History of Computing, Networking and

Telecommunications”. In: Novato (2002).

[Haw04] John A. Hawkins. “Efficiency and complexity in grammars”. In: Oxford University Press on Demand (2004).

[Cha12] Terry Regier Charles Kemp. “Kinship categories across languages reflect general communicative principles”. In: Science 336 (2012), pp. 1049–1054. doi: 10.1126/science.1218811.

[Ber14] Jos´e luis Berm´udez.

Cognitive Science: An introduction to science of the mind. (Second Edition). Cambridge University Press, New York, 2014.

[Ter15] Paul Kay Terry Regier Charles Kemp. “Word Meanings across Languages Support Efficient Communication”. In:

The handbook of language emergence 85 (2015). [Zip16] George Kingsley Zipf.

Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books, 2016.

[Cha18] Terry Regier Charles Kemp Yang Xu. “Semantic typology and efficient communication”. In: Annual Review of Linguistics 4 (2018), pp. 109–128.

[Res19] Michael Rescorla. “The Language of Thought Hypothesis”. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer 2019. Metaphysics Research Lab, Stanford University, 2019. [Mil20] Jakub Szymanik Milica Denic Shane Steinert=Threlkeld.

“Complexity/informativeness trade-off in the domain of indefinite pronouns.” In: (2020).

[Ste20] Shane Steinert=Threlkeld. “Quantifiers in natural language optimize the simplicity/informativeness trade-off”. In: (2020), pp. 413–522. doi: InProceedingsofthe22ndAmsterdamcolloquium. [Yan20] Terry Regier Yang Xu Emmy Liu. “Numeral Systems Across

Languages Support Efficient Communication: From Approximate Numerosity to Recursion. Open Mind”. In: Open Mind 4 (2020), pp. 57–70. doi: https://doi.org/10.1162/opmi_a_00034. [Dav] M. Davies.

The Mannheim German reference corpus (DeReKo) as a basis for empirical linguistic research. url: http://www.ids-mannheim.de/cosmas2/.

[KR] _{C. Kemp and T. Regier. Kinship categories across languages. url:} http://www.charleskemp.com/kinship/.

[M K] H. Keibel M. Kupietz.

The Corpus of Contemporary American English (COCA): 400+ million words. url: http://www.americancorpus.org..

Reverse-engineering the language of Thought

Replication of Kinship categories across

languages reflect general communicative

principles

Jochem van Oorschot

Supervised by: Milica Deni´

c and Jakub Szymanik

February 26, 2021

Contents

1

Abstract

2

Introduction

3

Background

3.1

Murdock dataset

3.2

Family trees

3.3

Primitives and rules

4

Method

4.1

Overview of the code from Kemp and Regier

4.2

Informativeness measure detailed

4.3

Complexity measure detailed

4.4

Tests

4.5

Using the code

5

Results

6

Conclusion

7

Discussion

7.1

Achievements

7.2

Future research

7.3

Limitations

Bibliography