Machine learning for Semantic Universals

(1)

Bachelor Informatica

Machine learning for

Seman-tic Universals

Zi-Long Zhu

June 17, 2019

Supervisor(s): Shane Steinert-Threlkeld, Jakub Szymanik

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In natural languages there are many linguistic properties that are universal. Semantic universals have been formulated for generalized quantifiers and are usually a binary prop-erty, i.e. a quantifier has universal properties or not. Furthermore it has been shown that quantifiers with universal properties are easier to learn than quantifiers without these prop-erties. However, in a recent study on connectedness of quantifiers a ternary classification of the semantic universal monotonicity has been defined, where each class has different learn-ability. This thesis provides a more fine grained classification of connectedness, to define a measure, which can ascribe even more levels of monotonicity to a quantifier. Then by using machine learning it will be computationally shown that the different levels of monotonicity have different learnability.

(4)

(5)

Introduction

The natural languages of mankind look and sound different, but when closely examined large amounts of similarities are found in all linguistic substructures between the different languages. Two intuitive examples are that all languages have vowels and consonants1_{and lexical similarity,}

i.e. having words that represent the same concept.2_{Because, (almost) all existing languages share}

some properties, there is a strong indication that the space of possible language are constrained by universal linguistic properties i.e. universals. By identifying these constraints it becomes conceivable to observe to what extent human languages cover the logically possible languages that humans can speak.

So to better analyse the boundaries of human language it is necessary to be able to differ-entiate between universals and non-universals, i.e. non universal linguistic properties. In recent work (Steinert-Threlkeld and Szymanik In press) it has been hypothesised that learnability is a distinguishing property of universals, concerning quantifiers. This was shown by training a Re-current Neural Network (RNN) to learn quantifiers with and without universal properties. Three different universals were investigated: monotonicity, quantity and conservativity. The RNN was able to learn the quantifiers with the first two universal properties significantly faster than the quantifiers without. In this thesis monotonicity will be investigated further and will be explained in more detail in the next chapter.

In another recent study (Chemla, Buccola, and Dautriche 2018) three distinctions have been made between quantifiers: monotone, connected and not-connected. They have shown by exper-imentation with human participants that monotone quantifiers are the easiest to learn, then the connected quantifiers and the hardest the non-connected ones. Examples of a monotone quanti-fier is ‘at least five’, of a connected quantiquanti-fier: ‘2 to 4 ’ and of a non-connected one: ‘exactly 1, 3, or 6 ’. In context to the firstly mentioned work the non-universal was split into two different categories: connected and not-connected and shown to have different learnability rates. Natu-rally, the question arises whether this classification of three types quantifiers can be separated further in different grades of learnability.

1.1 Problem statement

The main goal of this thesis is to investigate the idea of a more finer-grained classification for monotonicity via connectedness. The first step is to take the ternary classification of Chemla, Buccola, and Dautriche 2018 under the loop to find a constructive way to dissect the three classifications into more. A mathematical and psychological argument will be given on how to to achieve this. After the new classification is established it is used to measure how monotone a quantifier is. The second step is then to determine the learnability of the quantifier under this measure of monotonicity. This will be done computationally as in the work of Steinert-Threlkeld

1_{See Hyman 2008.}

2_{See Holman et al. 2011, there is no lexical similarity between languages that has a value lower than 0, so} every language share words that have the same meaning.

(8)

and Szymanik In press. The same RNN architecture will be used, where the learning model will learn a multitude of quantifiers. The expectation is that more monotone quantifiers are learned at significantly faster rates.

Furthermore, a different data generation algorithm with a bias will be proposed that impacts training in a positive way. In particular, the training is expected to be faster, but does not affect the eventual result. This new algorithm will be extensively compared to the one used by Steinert-Threlkeld and Szymanik In press to see if it has its intended effect.

1.2 Thesis outline

The thesis is structured as follows. In Chapter 2 an introduction to quantifiers, universals and connectedness will be given. Subsequently, justification of the newly proposed classification of connectedness is given and the classification itself will be introduced. Then using this classifi-cation a measure for monotonicity will be defined. In Chapter 3 the implemented RNN and its specific architecture for the experiments are explained and lastly the data and data generation algorithms for the learning model are discussed. Next in Chapter 4 the experiments and their results will be shown. In Chapter 5 there is some comments on the significance of connected-ness. Lastly, Chapter 6 contains the conclusion of the thesis where the results of the thesis are summarised.

(9)

CHAPTER 2

Measure of Universality

As mentioned in the introduction, the focus of the thesis will be universals concerning quantifiers. A quantifier is a type of determiner, such as ‘every’, ‘some’, ‘most’, ‘five’. It is a device that takes a noun to make a Noun Phrase (NP), where it is used to indicate a quantity of members in the domain of the noun that adhere to some property. For example, consider the sentence: some apples are green. This sentence indicates that some objects that belong in the domain of ‘apple’ have the property of having the colour green. A common distinction made between the determiners are simple and complex, where the previously named examples would be considered simple and determiners such as ‘atleast five’, ‘odd ’, ’two to seven, ‘at most three’, would be considered complex. The goal is to have a more fine-grained classification for certain types of quantifiers, which will be given in Section 2.3.

To describe determines as set-theoretic objects, it is assumed that they are monadic type h1, 1i generalized quantifiers (Barwise and Cooper 1981). Which means that the quantifier for each universe M describes a relation between two subsets A, B ∈ M . Figure 2.1 gives a visual representation of the sets M, A and B. The subsets A ∩ B, A \ B, B \ A and M \ (A ∪ B) are also labelled in this figure. These subset are of importance because they characterise the relation between M, A and B. M A B A ∩ B A \ B B \ A M \ (A ∪ B)

Figure 2.1: Venn diagram of a model hM, A, Bi

Formally, a quantifier Q will be defined as the set of models that are true under some relation that characterises Q. Where a model is denoted by hM, A, Bi. Furthermore A will be called the ‘scope’ of the model and B the ‘restrictor ’. For example:

Qsome= {hM, A, Bi : A ∩ B 6= ∅}

Qmost= {hM, A, Bi : |A ∩ B| > |A \ B|}

Qat most 4= {hM, A, Bi : |A ∩ B| ≤ 4}

(10)

2.1 Monotonicity

In mathematics the definition of monotonic function is that the function of ordered sets preserves or reverses the given order. In the context of numbers, a function f preserves order if for two numbers x and y: x ≤ y, then f (x) ≤ f (y). This is called an upward monotone function, since f (x) never decreases, when x increases in value. To extend this idea of preserving order to quantifiers, the following order between models will be introduced in definition 2.1:

Definition 2.1. hM, A, Bi ≤ hM, A, B0i if and only if B ⊆ B0_.

Now let χQ be the characteristic function of quantifier Q which returns 1 if hM, A, Bi ∈ Q and

0 otherwise. Then χQ can only be upward monotone if for subsets B, B0, with hM, A, Bi ≤

hM, A, B0_{i, then it holds that χ}

Q(hM, A, Bi) ≤ χQ(hM, A, B0i). This implies that if a model is

true under a quantifier and the quantifier is upward monotone, then any generalisation on the restrictor of the model, it should still be true under the quantifier. An example in English would be:

(a) Many students can program in Python.

(b) Many students can program.

It is obvious that the restrictor ‘can program’ is more general than the restrictor ‘can program in Python’. And so if sentence (a) is true then sentence (b) must be true as well. Furthermore notice that choice of scope (‘students’) has no effect on this upward monotonic relation. Swap ‘students’ for example with ‘software engineers’ then the upward monotonic relation still holds. This brings the following definition for an upward monotone quantifier3:

Definition 2.2. A quantifier Q is upward monotone if and only if whenever hM, A, Bi ∈ Q and B ⊆ B0, then hM, A, B0i ∈ Q

Now, consider the situation where the characteristic function χQ reverses the ordering. Which

means that for subsets B, B0_{, if hM, A, Bi ≤ hM, A, B}0_{i, then χ}

Q(hM, A, Bi) ≥ χQ(hM, A, B0i).

Here the relation is inverted. A model that is true under a quantifier, is still true when the restrictor is specified. Or χQ(hM, A, Bi) never increases as the restrictor is generalised, which

gives rise to the name downward monotone. An example in English would be the determiner ‘few ’:

(i) Few students can program in Python.

(ii) Few students can program.

Observe that if sentence (ii) is true then sentence (i) must be true as well. Which was the other way around in the first example of (a) and (b). The definition of a downward quantifier reflects this inverted nature, since the only difference with definition 2.2 is that the relation between B and B0 is flipped:

Definition 2.3. A quantifier Q is upward monotone if and only if whenever hM, A, Bi ∈ Q and B0_{⊆ B, then hM, A, B}0_{i ∈ Q}

And so a quantifier is monotone if:

Definition 2.4. A quantifier Q is monotone if it is either upward or downward monotone.

Using this definition Barwise and Cooper 1981 have proposed the following universal for quan-tifiers:

Monotonicity Universal. All simple determiners are monotone.

3_{See Steinert-Threlkeld and Szymanik In press, the definitions of upward and downwards monotinicity come} from this work. Section 2.1 Monotonicity.

(11)

Which claims that there are no simple determiners that are not monotone. However, if a determiner is monotone, it does not mean its a simple determiner. The determiner ‘At least five’, for example, is not simple, but it is monotone.

A sentence with a complex determiner, that is not monotone is for example: • At least 5 or at most 2 students can program in Python easily.

‘At least 5 or at most 2’ is not monotone, because suppose that the above sentence is true, specifically, that two students can easily program in Python. And let there be three students such they ‘can program in Python’, which is false under the given quantifier. This shows that there is a generalisation of the restrictor that makes the sentence false. So, the quantifier cannot be upward monotone. If the sentence was:

• At least 5 or at most 2 students can program

And there are five students who can program, then there exists a subset, e.g. there are three students that ‘can program in Python’. This is false again under the quantifier ‘At least 5 or at most 2 ’. Therefore, it cannot be downward monotone and thereby not be monotone.

2.2 Connectedness

In the previously mentioned study of Chemla, Buccola, and Dautriche 2018 complexity or learn-ability of a quantifier is measured by its connectedness. Where less connected quantifiers are harder to learn. The definition of connectedness for quantifiers is given as

Definition 2.5. A quantifier Q is connected if and only if for any sets B, B0 and B00, if B ⊂ B0 ⊂ B00 _{or B}00_{⊂ B}0 _{⊂ B, hM, A, Bi ∈ Q and hM, A, B}00_{i ∈ Q, then hM, A, B}0_{i ∈ Q}

Using this definition three classifications of connectedness are given for a quantifier Q. The first one is where Q is connected and its negation ¬Q is connected as well. This classification is equiv-alent for Q being monotone (Chemla, Buccola, and Dautriche 2018). The second classification is where Q (exclusive) or ¬Q is connected. The third and last classification is that Q and ¬Q are both disconnected. Which will be called the CBD classification of connectedness. Examples for each class would be ‘at least 4 ’, ‘5 to 7 ’ and ‘even’, respectively.

These classifications can be interpreted as a measure for monotonicity. Where the first clas-sification is the most monotonic and the third being the least monotonic. Intuitively, the second class is ‘somewhat’ monotone. Take for example the quantifier ‘atleast 8 or at most 4 ’. It is clearly not upward of downward monotone, since there is a gap in the interval of where the quantifier is considered true. However, this quantifier is a construction of two quantifiers ‘atleast 8 ’ and ‘at most 4 ’ and separately they are monotone. Also its negation ‘5 to 7 ’ is ‘somewhat’ monotone, since it is easy to make it monotone by shrinking the domain of discourse M to |M | = 7. The quantifier in this context becomes equivalent to ‘at least 5 ’, which is monotonic. These arguments, however, are not applicable to the third class, because these quantifiers are constructions of multiple disconnected quantifiers, where at least one of those are not monotonic, which make them intuitively the least monotonic. Examples are: ’exactly five or at least seven’ and ‘even’.

Upon closer inspection of the classifications stronger arguments will be found for measur-ing monotonicity by connectedness. Additionally, it will become apparent that the third CBD classification encompasses a far larger amount of quantifiers than the first two classifications.

2.2.1 Cardinality of the three classifications

For this inspection two assumptions are made for convenience sake: 1. Only quantifiers that are defined by a relation with |A ∩ B| are considered. 2. The cardinality of M, A and B can be infinite.

Now let us define the monotonic quantifiers of the first CBD classification as a family of quantifiers by Q1and Q2, where Q2is the negation of Q1.

Q1(n) = {hM, A, Bi : |A ∩ B| ≤ n}, (2.1)

(12)

Then the set of first class (monotonic) quantifiers is

M = {Q1(n) : n ∈ N} ∪ {Q2(n) : n ∈ N} (2.3)

where N = {0, 1, 2, . . . }. Which means that there is an infinite amount of quantifiers in the first class. However it will be shown that there is only countably many. A set is countable and countably infinite under the following two definitions:

Definition 2.6. A set S is countable if there exists an injective function f from S to the natural numbers N = {0, 1, 2, 3, . . . }.

Definition 2.7. A set S is countably infinite if a bijective function f exist such that f : S → N, where N = {0, 1, 2, . . . } the set of natural numbers.

And the following theorem4 of countably infinite sets will be helpful:

Theorem 2.8. The union of two countably infinite sets is countably infinite.

Lemma 2.9. For all n ∈ N, Q1(n) and Q2(n) are connected and M is countably infinite

Proof. (Q1 is connected): Lets take any triple of B, B0 and B00, such that B ⊂ B0 ⊂ B00 or

B00⊂ B0 ⊂ B, hM, A, Bi ∈ Q1(n) and hM, A, B00i ∈ Q1(n). So, |A ∩ B| ≤ n and |A ∩ B00| ≤ n.

However this means that |A ∩ B0| ≤ n, since |B0_{| < |B}00_{| or |B}0_{| < |B|, because either B}0 _{⊂ B}00

or B0⊂ B, respectively. Therefore, hM, A, B0_{i ∈ Q}

1(n) and so by definition 2.5 Q1is connected.

(Q2 is connected): Lets take any triple of B, B0 and B00, such that B ⊂ B0 ⊂ B00 or B00 ⊂

B0⊂ B, hM, A, Bi ∈ Q2(n) and hM, A, B00i ∈ Q2(n). So, |A ∩ B| > n and |A ∩ B00| > n. However

this means that |A ∩ B0| > n, since |B0_{| > |B}00_{| or |B}0_{| > |B| and because either B}00 _{⊂ B}0 _or

B ⊂ B0, respectively. Therefore, hM, A, B0i ∈ Q2(n) and so by definition 2.5. Q2 is connected.

(M is countably infinite): The family of quantifiers Q1(n) and Q2(n) both have one parameter

n. Since, n ∈ N and N is countably infinite, it means that {Q1(n) : n ∈ N} and {Q2(n) : n ∈ N}

are countably infinite. Additionally, it follows from Theorem 2.8 that M is countably infinite as well.

Which means that the first class consists of quantifiers that only accept a domain of models where |A ∩ B| is a single connected interval and that there are countably infinite many of them. The second CBD class of quantifiers can be defined as a family quantifiers by Q3 and Q4,

where Q4is the negation of Q3

Q3(n, m) = {hM, A, Bi : n ≤ |A ∩ B| ≤ m}, (2.4)

Q4(n, m) = {hM, A, Bi : |A ∩ B| < n} ∪ {hM, A, Bi : |A ∩ B| > m} (2.5)

where n ∈ N, m ∈ {1, 2, . . .} and n ≤ m. Then the set of second class quantifiers is: C2=

4

[

k=3

{Qk(n, m) : n ∈ N, m ∈ {1, 2, . . .} and n ≤ m} (2.6)

To proof that the second class is countably infinite as well, the following theorem5 will be used

Theorem 2.10. The Cartesian product of (finite) countably infinite sets is countably infinite.

Lemma 2.11. For all n ∈ N, m ∈ {1, 2, . . .} and n ≤ m, Q3(n, m) is connected, its negation

Q4(n, m) is disconnected and C2 is countably infinite.

Proof. (Q3 is connected): Lets take any triple of B, B0 and B00, such that B ⊂ B0 ⊂ B00 or

B00⊂ B0_{⊂ B, hM, A, Bi ∈ Q}

3(n, m) and hM, A, B00i ∈ Q3(n, m). So, n ≤ |A ∩ B| ≤ m and n ≤

|A ∩ B00_{| ≤ m. This means that n ≤ |B}0_{| ≤ m, since |B| < |B}0_{| < |B}00_{| or |B}00_{| < |B}0_{| < |B| and}

4_{See Fletcher and Patty 1996, Foundations of higher mathematics Proposition 7.12.}

(13)

because either B ⊂ B0 ⊂ B00 _{or B}00 _{⊂ B}0 _{⊂ B, respectively. Therefore, hM, A, B}0_{i ∈ Q} 3(n, m)

and so by definition 2.5 Q3 is connected.

(Q4 is disconnected): Lets take B00, such that hM, A, B00i ∈ {hM, A, Bi : |A ∩ B| > m} and

hM, A, B00i ∈ Q4(n, m). For B00 there exists a subset B0 such that n ≤ |B0| ≤ m. For this B0

there exist a subset B such that |B| < n and so hM, A, Bi ∈ Q4(n, m). So there exist a B0 such

that B ⊂ B0 ⊂ B0_{, but hM, A, B}0_{i /}_{∈ Q}

4(n, m). Therefore, Q4 is disconnected under definition

2.5.

(C2 is countably infinite): The family of quantifiers Q3(n, m) and Q4(n, m) both have two

parameters n and m, where the possible values are defined by the set S = {(n, m) : n ≤ m} \ {(0, 0)}. Where S ⊆ N2

and from Theorem 2.10 it follows that N2

has a bijection to N. Because S is a subset of N2

, there must be an injection from S to N. And so S is countable and since S is also evidently infinite, S is also countably infinite. And because |S| = |{Q3(n, m) :

n, m ∈ N}| = |{Q4(n, m) : n, m ∈ N}| and Theorem 2.8 it follows that C2 must be countably

infinite.

As a result, the first and second class are both countably infinite. The difference is that the second class consists of quantifiers that only accept a domain of models that have a single connected interval of |A ∩ B| values, but where the negation of quantifiers have two connected intervals of |A ∩ B| values. This difference between the classes emerges from the fact that the second class has more parameters for defining the ‘true’ interval for |A ∩ B|. So, the two classes are also defined by their number of parameters. Another way to view the parameters is that they determine when the ‘true’ interval begins or ends. Because of this the number of switches from ‘true’ to ‘false’ and vice versa correspond to the number of parameters.

Then lastly the third CBD class of quantifiers can be defined as a family of quantifiers by Q5, Q6, Q7and Q8. Q5(n1, m1, . . . , nN, mN) = N [ i=1 {hM, A, Bi : ni≤ |A ∩ B| ≤ mi} (2.7) Q6(n1, m1, . . . , nN, mN) = ¬Q5 (2.8) Q7(n0, . . . , nN −1, mN −1) = {hM, A, Bi : |A ∩ B| ≤ n0} ∪ Q5(n1, m1, . . . nN −1, mN −1) (2.9) Q8(n0, . . . , nN −1, mN −1) = ¬Q7 (2.10)

Where N ∈ {2, 3, . . .}, ni, mi ∈ N, ni ≤ mi and nj+ 1, mj+ 1 < ni if j < i. Then the set of

third class quantifiers is

C3= 8

[

k=5

{Qk(. . . |Nk) : Nk∈ {2, 3, . . .} s.t. ni≤ mi and nj+ 1, mj+ 1 < ni if j < i} (2.11)

Notice that Q5, Q6, Q7and Q8can have an infinite amount of parameters. As N grows so do

the number of parameters. The notation of Qk(. . . |N ) represents this fact, that the number of

parameters are dependent on N . For the Q5and Q7type of quantifiers the number of parameters

(NOP) are:

N OP (Q5) = 2N (2.12)

N OP (Q7) = 2(N − 1) + 1 (2.13)

This indicates that the cardinality of C3 should be larger than that of M and C2, because N is

allowed to be (countably) infinite. This causes infinite NOP to be possible.

Lemma 2.12. For all N ∈ {2, 3, . . .}, ni, mi∈ N, ni ≤ mi and nj+ 1, mj+ 1 < ni if j < i, Q5,

Q6, Q7 and Q8 are disconnected and C3 is uncountably infinite.

Proof. (Q5 is disconnected): Any Q5 quantifier must have at least two subsets defined as

{hM, A, Bi : ni ≤ |A ∩ B| ≤ mi}, since N > 1. Lets take the two subsets with the two

(14)

the ni ≤ mi and mj+ 1 < ni if j < i restriction forces an increasing order. Then this ‘partial’

quantifier is as

Qpartial= {hM, A, Bi : n1≤ |A ∩ B| ≤ m1} ∪ {hM, A, Bi : n2≤ |A ∩ B| ≤ m2}

Now lets take B00, such that, hM, A, B00i ∈ {hM, A, Bi : n2 ≤ |A ∩ B| ≤ m2}. Next take the

subset B0 ⊂ B00_{, such that |A ∩ B}0_{| = n}

2− 1. This means that hM, A, B0i /∈ Qpartial, since

m1 < n2− 1 and n2− 1 < n2. Lastly take B ⊂ B0, such that n1 ≤ |A ∩ B0| ≤ m1. So,

hM, A, Bi ∈ Qpartial. Therefore, Qpartial is not connected under definition 2.5. Consequently,

Q5 is not connected as well.

(Q6is disconnected): For Q6 we again take Qpartial, but its negation

¬Qpartial={hM, A, Bi : |A ∩ B| < n1} ∪ {hM, A, Bi : m1< |A ∩ B| < n2}

∪{hM, A, Bi : |A ∩ B| > m2}

Now lets take take B00, such that, hM, A, B00i ∈ {hM, A, Bi : {hM, A, Bi : m1 < |A ∩ B| < n2}.

Next take the subset B0⊂ B00_{, such that |A ∩ B}0_{| = m}

1. This means that hM, A, B0i /∈ ¬Qpartial,

since n1 ≤ m1. Lastly take B ⊂ B0, such that |A ∩ B0| < n1. So, hM, A, Bi ∈ ¬Qpartial.

Therefore, ¬Qpartial is not connected under definition 2.5. Consequently, Q6 = ¬Q5 is not

connected as well.

(Q7and Q8are disconnected): The proof for Q7works the same as Q6. And for Q8the same

as Q5. The full proof can be found in Appendix A.

(C3 is uncountably infinite): To prove this Cantor’s diagonal argument6 will be used. Now

consider set T , which is the set of all possible assignments of all possible number of parameters.

T = ∞ [ N =2 {(n1, m1, . . . , nN, mN) : ni, mi ∈ N, ni≤ mi and mj+ 1 < ni if j < i} ∪ ∞ [ N =2 {(n0, n1, m1, . . . , nN −1, mN −1) : ni, mi∈ N, ni≤ mi and mj+ 1 < ni if j < i}

So |C3| = |T |. Then assume that T is countably infinite. This means that T can be enumerated

as following 0 ↔ t1= (0, 2, 2, 4, 4) 1 ↔ t2= (0, 1, 3, 7, 10, 20, 15, . . . ) 2 ↔ t3= (1, 1, 3, 3, 5, 7) 3 ↔ t4= (0, 4, 8, 12, 15, 17, 20, 23, . . . ) 4 ↔ t5= (2, 5, 9, 11, 24, 30, 32, 40, . . . ) 5 ↔ t6= (1, 2, 4, 5) 6 ↔ t7= (1, 5, 7, 9, 24, 31, 35, 41) .. .

Where t1, t2, ..., t6, . . . , tn, . . . each correspond to a single unique element in N. However using

the diagonal 0, 1, 3, 12, 24, , 35, . . . (numbers in bold) it is easy to construct a sequence that is not in this enumeration. To construct such a sequence t put a number on the i-th location that differs from the i-th number in the diagonal. Also make sure that the sequence adheres to the restrictions put on niand mi. If the diagonal has no value for the i-th location, any number can

be placed if it adheres to the previously mentioned restrictions. For example:

t = 1, 2, 4, 8, 12, 15, 17 . . .

Now by construction t by definition t ∈ T and differs from any ti in the enumeration, since they

each differ on the i-th digit. Furthermore, because t is not in the given enumeration it cannot uniquely correspond to an element in N. Which is a contradiction. As consequence, T cannot be countably infinite. That means T is uncountably infinite and therefore C3must be uncountably

infinite as well.

(15)

The following Theorem 2.13 can now easily be proven

Theorem 2.13. C3 has a larger cardinality than M and C2.

Proof. It follows directly from Lemma 2.9,2.11 and 2.12, since C3is uncountably infinite, and M

and C2 countably infinite.

2.2.2 Finer-grained connectedness

In the previous section it has become clear by Theorem 2.13 that the third classification encom-passes a far larger amount of quantifiers than the first two classes. This gives the justification to dissect the third class to create a finer-grained classification.

The question then arises is how to break up the third classification. This can be answered by looking at the first two classifications. In the previous section it has been discussed that the number of parameters is a unique property of the two classes. Where the first class has one parameter and the second class two. Since, the third class can have varying amount of parameters, it makes sense to split the class on this property. It is then easy to see that the proof of countability of C2with Theorem 2.10 can be extended to the new classes. So under this

new classification each class will have countably many quantifiers, except where N = ∞. Since, when N is infinite, Cantor’s diagonal argument can still be applied to proof uncountability. If this class is disregarded the newly proposed classification will have the nice property that each class has the same cardinality.

Another argument for this kind of classification is the work done by Feldman 2000. In this work it is shown that if concepts are described under ‘incompressible’ terms, that shorter description are easier to learn for humans. For example the quantifier ‘4, 5, 6 or more’, can be compressed to ‘atleast 4 ’ and is incompressible (in English). And since the quantifier ‘at least 4 or at most 2 ’ is incompressible and has a longer description it should be harder to learn than ‘atleast 4 ’. This measure of complexity can be extended to classifying quantifiers by its number of parameters. Which is another justification to break up the third CBD class, since it contains quantifiers with varying parameters. Whereas the newly proposed classification has an increasing and unique amount of parameters for each class that is less connected.

Definition 2.14. Fine-grained classification of connectedness: The class of connectedness of a quantifier Q is determined and represented by its number of parameters. The number of parameters define the intervals of values of |A ∩ B| such that the characteristic function χQ

returns 1. Quantifiers with a higher number of parameters are considered less connected.

A less formal definition can be given in terms of number of switches. As mentioned in the previous section this corresponds to the number of parameters. This is the case since the parameters determine when the truth interval of a quantifier starts and end, which are exactly the places where the switches take place.

In the next section a concrete measure of monotonicity will be introduced using its connect-edness classification.

2.3 A measure for Monotonicity

In the previous two sections the cardinality of M was allowed to be infinite, however from this section on this will not be the case. Because as mentioned in the previous section, when M is allowed to be infinite and when |A ∩ B| is infinite as well, the ‘infinite classification’ will still have an uncountable size of amount of classifiers. Furthermore, to define a normalised measure under infinity is not practical and only experiments under finite models is possible and accounted for. Due to these three reasons only models with a finite cardinality of M will be considered.

Consequently, the parameters of possible quantifiers in such models is bounded by the size of M and varies under different sizes. Take for example the quantifier ‘even’. From Figure 2.2 the total number of switches from ‘true’ to ’false’ and vice versa can be counted.

(16)

|A ∩ B|

0 2 4 6 8 10 12 14 16 18 20

|M1| |M2|

Figure 2.2: The true |A ∩ B| intervals of the quantifier ‘even’

Then under |M1| = 10, it has 10 parameters, but under |M2| = 20, it has 20 parameters.

While they should be equally ‘non-monotonic’. This problem can be mitigated by defining the quantifiers under the largest cardinality that the model is allowed to have. So, in the example ‘even’ would be defined under M2. Then by normalising the number of parameters by the largest

cardinality the following measure of monotonicity can be defined: Definition 2.15. Monotonicity M of a quantifier Q is defined by

MM(Q) = 1 −

N OP (Q) − 1

|M | − 1 (2.14)

where N OP (Q) is the number of parameters of quantifier Q and M the model with the largest cardinality from all the considered models.

In the fraction both the NOP and |M | get substracted by one, since the minimal possible number of parameters is 1.

Figure 2.3 shows ‘true’ intervals of different quantifiers which have different monotonicity measures and the specifics are given in table 2.1.

|A ∩ B| Difficulty |M | 0 2 4 6 8 10 12 14 16 18 20 Q4 Q3 Q2 Q1 Q0

Figure 2.3: Different quantifiers and their ‘true’ intervals for |A ∩ B|. • means inclusive and ◦ exclusive.

Q0 Q1 Q2 Q3 Q4

Name: At least 5 At least 8 or at most 2 periodic interval 4 periodic interval 2 even

M: 1.0000 0.9474 0.7895 0.5263 0.0000

Table 2.1: Quantifiers from Figure 2.3. The name ‘periodic interval 4 ’ and ‘periodic interval 2 ’ originates from the size and periodicity of the intervals. Because ‘periodic interval 4 ’ has a larger interval than ‘periodic interval 2 ’, it also has less number of parameters.

It is then expected that the quantifiers Q0, Q1, Q2, Q3 and Q4 each have an increasing learning

difficulty in the given order, due to their given monotonicity measure. From this expectation follows the hypothesis:

(17)

Hypothesis 1. Quantifiers that have a lower monotonicity measure under definition 2.15 are harder to learn.

(18)

(19)

CHAPTER 3

Learning Model

3.1 Recurrent Neural Network

Since quantifiers are defined as set-theoretic objects, the goal is to let a learning model learn which models belong to which quantifier. For this purpose a learning model is required which can take sequential input from the elements of a model. This is necessary since quantifiers related to the quantity universal are order sensitive. This is possible by using a recurrent neural network (RNN). Another advantage is that the raw model can be fed to the RNN in this manner. Usually for ‘vanilla’ neural networks (NN) the (expected) significant features will be extracted from the input and fed to the NN. This helps the NN to learn faster since it does not need to discover what features are important itself. In the case of this study the network should learn what is important by itself, since learnability is of interest. The two differences between an NN and an RNN is its structure and the training method. Firstly, the hidden layers of an RNN loop back onto themselves as they handle sequential input at each time step. This looping mechanism passes the state ht−1 of the previous time step to the next one. This causes the previous states to have

influence on the current one. Secondly, RNNs are trained by a technique called Backpropagation Through Time7, 8. In this algorithm the RNN gets unfolded into the number of time steps corresponding to that of the input sequence. Then the weights of the network are adjusted by the original backpropogation algorithm, backwards through the time steps.

x hh1i hh2i y Wx W_hh1i Whh2i Uhh1i U_hh2i unfold x1 hh1i₁ hh2i₁ y1 Wx W_hh1i Whh2i x2 hh1i₂ hh2i₂ y2 Wx W_hh1i Whh2i x|M | hh1i_{|M |} hh2i_{|M |} y|M | Wx W_hh1i W_hh2i hh1i₀ hh2i₀ U_hh1i U_hh2i U_hh1i U_hh2i . . . . . . U_hh1i U_hh2i U_hh1i U_hh2i

Figure 3.1: RNN with two hidden layers, folded and unfolded. Where U and W denote the weights of the network.

7_{This technique has been derived several times independently, see for example Robinson and Fallside 1987.} 8_{And Werbos 1988.}

(20)

Furthermore, notice from Figure 3.1 that the weights W and U are the same at each step, so the backpropogation through time algorithm adjusts the same weights at each timesteps, as opposed to different ones. Also notice that the hidden layers from the RNN in the first time step receives previous states h0. This is usually initialised as all zeroes, as will be the case in the

implementation for the experiments.

3.1.1 LSTM

A long short-term memory (LSTM) network9 is a type of RNN, where each node in the hidden layer keeps an additional ct state. This ct state is used as a memory cell to remember specific

features throughout every time step in the network. The value of ct is depends on ct−1, ht−1

and xt, where the relation is visually depicted in Figure 3.2:

σ

tanh

σ

×

ht−1 xt

×

ct−1

+

ct ft it cˆt ot

×

tanh

ht ht

Figure 3.2: An LSTM unit, the squares represent a neural network, where the label denotes its activation function. The circles represents element wise operation, where the label denotes the operator. Merging of arrows means concatenation of two vectors and splitting arrows represents copying

Formally the computations in the LSTM unit are defined by the following equations:

ft= σ(Wf· ht−1xt+ bf) it= σ(Wi· ht−1xt+ bi) ˆ ct= tanh(Wc· ht−1xt+ bc) ot= σ(Wo· ht−1xt+ bo) ct= (ft⊗ ct−1) ⊕ (it⊗ ˆct) ht= ot⊗ tanh(ct)

The intuition is that the LSTM unit works with three ‘gates’, where each gate regulates the flow of information. Each neural network node with an activation function σ is a gate in Figure 3.2. The gate that outputs ft is called the forget gate. It regulates what information of the

previous activation ht−1 and the input xt is unnecessary. The gate that outputs it regulates

what information of xt and ht−1 is relevant. This gate is therefore either called the input or

update gate. The last gate that outputs ot filters what information of the new memory state is

passed through the output of the LSTM unit and is called the output gate. ˆct represents the

(21)

new candidate update values for memory cell ctand these candidate values are computed by the

neural network node that outputs ˆct. So, training an LSTM network involves additional weights:

Wf_{, W}i_{, W}o _{and W}c_{. These weights determine how information flows through each unit and}

which information the network should remember or forget through each time step.

3.2 Data structure

Each data sample will be a tuple of a single model hM, A, Bi and its attached truth value. The model is represented as an sequence of vectors. Where each instance is an element oi, which will

be encoded as a one hot vector, i.e. a vector where one element is ‘1’ and the rest is ‘0’. The ‘1’ denotes the membership of oi in one of the following mutually exclusive sets: A ∩ B, A \ B, B \ A

and M \ (A ∪ B), such that:

oi= [1, 0, 0, 0] → oi∈ A ∩ B

oi= [0, 1, 0, 0] → oi∈ A \ B

oi= [0, 0, 1, 0] → oi∈ B \ A

oi= [0, 0, 0, 1] → oi∈ M \ (A ∪ B)

A model M is then represented as an array of these elements:

M = [o1, o2, . . . , o|M |]

Since, the goal is to let a single learning model learn all given quantifiers, each model needs to be denoted as to which quantifier Q it is intended for. Suppose n different quantifiers need to be trained, then the quantifiers will be encoded as a one hot vector, such that:

Q0= [1, 0, . . . , 0]

Q1= [0, 1, . . . , 0]

. . . = . . .

Qn = [0, 0, . . . , 1]

To indicate that a model hM, A, Bi is used to learn a quantifier Qj, the one hot vector

represen-tation of Qj will be concatenated to each element of model hM, A, Bi. So a model hM, A, Bi for

quantifier Qj is represented as:

hM, A, BiQj = [o1Qj, o2Qj, . . . o|M |Qj]

Where oiQj denotes that they are concatenated. An example of a model with |M | = 4 with two

possible quantifiers would be:

hM, A, BiQ0= [[1, 0, 0, 0, 1, 0] (3.1)

[0, 1, 0, 0, 1, 0] [1, 0, 0, 0, 1, 0] [0, 0, 1, 0, 1, 0]]

Then for each model it will be determined if it is ‘true’ or ‘false’, by hM, A, Bi ∈ Qj or not. True

and false are also encoded as one hot vectors, where [1, 0] and [0, 1] denotes ‘true’ and ‘false’ respectively.

3.3 Data generation

The data generation algorithm is taken from the work of Steinert-Threlkeld and Szymanik In press10_:

(22)

Algorithm 1: Data generation algorithm input : max len, num data, quants data ← [ ]

while len( data) < num data do

Choose N uniformly at random from between 1 and max len Choose Q uniformly at random from quants

cur seq ←N randomly chosen items from {A ∩ B, A \ B, B \ A, M \ (A ∪ B)} Pad cur seq up to max len with null vectors

if hQ, cur seq, cur seq ∈ Q ?i /∈ data then Add hQ, cur seq, cur seq ∈ Q ?i to data shuffle(data)

balance by undersampling(data)

return train split(data), test split(data)

So each data sample gets generated by first choosing a random number from between 1 and max len, where max len is the maximum cardinality of a model. The data sample will then be assigned a quantifier Q randomly and then the sequence is generated as explained in the previous section. Then it is padded up to the maximum length. So take for example the model (3.1) and max len= 6, then it would be padded as:

hM, A, BiQ0 = [[1, 0, 0, 0, 1, 0] (3.2) [0, 1, 0, 0, 1, 0] [1, 0, 0, 0, 1, 0] [0, 0, 1, 0, 1, 0] [0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0]]

This is necessary since the implementation of RNNs in tensorflow11 requires the maximum se-quence length to be predefined. Then by using a masking layer,12 the network can skip the padded time steps. This way, the network accepts inputs of models with different sizes.

After the padding the sequence will be added as a data sample if it is unique. After num data amount of unique samples are created the samples are shuffled in case of some unknown bias. Subsequently, the data gets undersampled to balance the number of models across quantifiers and the number of true/false for each quantifier. This is achieved by finding the quantifier with the least number of models under a certain truth value, i.e. whether it belongs to the quantifier or not. And selecting only that many models per truth value for each quantifier. This prevents the learning model to be biased by data towards a specific quantifier. Lastly, the data is split 70%/30% into a training and a test set, respectively.

3.3.1 Curriculum for the ‘even’ quantifier

However, in the case of the ‘even’ quantifier it might be convenient to bias the training data. Since, it is the least monotone quantifier it might take a significantly longer time to train, especially when |M | becomes several magnitudes larger. For this purpose curriculum training13

can be applied. The idea is to bring some kind of structure to the data in such a way that it biases on learning the ‘even’ quantifier. The required structure can be motivated by how potentially the network can learn the notion of ‘even’. Let the following sequence denote whether each element

11_{Tensorflow is an open-soure framework for machine learning. See https://www.tensorflow.org/} 12_{See https://www.tensorflow.org/api_docs/python/tf/keras/layers/Masking}

(23)

oi ∈ M is in A ∩ B:

[0, 1, 0, 1, 0, 1, 1]

Where ‘1’ denotes the membership of the element of its location. For example o1 ∈ A ∩ B and/

o2∈ A ∩ B. In this case the model is ‘even’ since it has an even number of ones. One could count

manually the number of ones, by (perhaps) the following thought process: look at each number one by one from left to right and count up by one each time a ‘1’ is observed, starting at zero. The thought at each time step could happen like so: zero, one, one, two, two, three, four, four is even, so the number of ones is even. Or one could alternate between odd and even, every time a ‘1’ is observed, starting with even: even, odd, odd, even, even, odd, even, so the number of ones is even. The intuition from these two examples is that to determine whether the number of ones is even, depends on what was previously observed. So, it is essential to correctly determine at each time step what the count is or whether the visited subsequence has even or odd number of ones.

Using this intuition the data can be structured in such a way that the network learns the quantifier ‘even’ on an increasing order of model sizes. This ensures that the network learns on smaller examples first. Furthermore, it is required that the network learns the smaller examples correctly. This will be achieved by making ‘duplicate’ examples possible. Take for example the model (3.2), but with different padding:

hM, A, BiQ0= [[1, 0, 0, 0, 1, 0] [0, 0, 0, 0, 0, 0] [0, 1, 0, 0, 1, 0] [1, 0, 0, 0, 1, 0] [0, 0, 1, 0, 1, 0] [0, 0, 0, 0, 0, 0]]

The padding on different location makes this new model a valid ‘duplicate’. This causes the data to contain many ‘duplicates’ for smaller models, but far less to no ‘duplicates’ for larger models. Because, the models are chosen randomly and the number of possible models grows exponentially when the size increases14_{. Consequently, the probability of choosing the same}

model becomes exponentially smaller as well, which causes there to be more training examples for smaller models. The last change to the data is to have a balanced number of samples for each model size. Where the number of samples for each size is determined by the following equation:

#samples(size) = min

size limit, num data max len

(3.3)

Where size limit is a rough under estimation of the total number of possible models. The estimation is done by calculating:

estimation = 4size·max len size

· num quants · 0.7

The binomial coefficient is required since the number of ways to pad a model can be seen as a combinatorial problem of stars and bars.15 _{The constant 0.7 is to firstly ensure that not all}

possible samples are used for training and secondly to undershoot the estimation. If num data samples are not generated when a model size of max len is reached, models of size max len are generated until the quota is reached. Algorithm 2 depicts this process.

Algorithm1(training data, 1−training data ratio) represents that the test data is gen-erated as in algorithm 1. However, only data samples are allowed in the test data, which are not in the training data. 1−training data ratio, is the ratio of test samples from num data.

From this algorithm arises the following hypothesis:

14_{To be exact with 4}|M |_{, since for each element 4 possible choices are possible.}

15 _{See “An introduction to probability theory and its applications”, page 38. It is the problem of how many} ways there are to order two kind of objects (e.g. stars and bars), when there is n of one type of object and r of the other.

(24)

Hypothesis 2. a) The given curriculum learning by algorithm 2, will only affect the learning of the ‘even’ and ‘odd’ quantifier and not any other quantifier, when compared to algorithm 1. b) The effect of the curriculum learning is that the ‘even’ quantifier is learned faster.

In the next chapter both data generation algorithms will be implemented to see if the curriculum has its intended effect.

Algorithm 2: Structured training data generation algorithm input : max len, num data, quants, training data ratio training data ← [ ]

num data ← num data · training data ratio N ← 1

generated ← 0

while len( training data) < num data do

Choose Q uniformly at random from quants

cur seq ←N randomly chosen items from {A ∩ B, A \ B, B \ A, M \ (A ∪ B)} Randomly pad cur seq up to max len with null vectors

if hQ, cur seq, cur seq ∈ Q ?i /∈ training data then Add hQ, cur seq, cur seq ∈ Q ?i to training data generated ← generated + 1

if generated ≥ #samples( N) and N 6= max len then N ← N + 1

generated ← 0 shuffle(training data)

balance by undersampling(training data) order by model size(training data)

test data ← Algorithm1(training data, 1−training data ratio) return training data, test data

3.4 Implementation

Now that all necessary components are explained, we continue with explaining the implemented learning model and how it is used in the experiments. The learning model consists of two parts. The first part takes the input, which is an LSTM network with two hidden layers, as depicted in 3.1. Only where each node is an LSTM unit as depicted in 3.2. Each hidden layer has a total of 40 units. The last output of the LSTM network y_{|M |}is passed to the second part of the learning model, the softmax layer. This layer has two neurons to match the prediction which has a vector length of two. Where first neuron corresponds to ‘true’ and the second to ‘false’. The layer has a softmax activation function16_{, which returns a probability distribution over the two possible}

outcomes. Then the prediction of the network is equal to that of the outcome with the highest probability.

(25)

Input hM, A, BiQj

LSTM

softmax y|M | _Prediction [1, 0] or [0, 1]

Figure 3.3: Visual representation of the learning model.

The whole network trains with a batch size of 32. Using the 32 training examples the network will output the corresponding predictions. Using these predictions and the actual data the error is calculated by using a cross-entropy loss function. The weights in the network are then adjusted by passing the error through network using Backpropagation (through time). A batch will be considered as a single training step or iteration. To measure the progress during training for the experiments, after each training step the network is fed with the test samples and its accuracy is measured. Furthermore, the Adam Optimizer17 with learning rate 10−4 was used for the gradient descent. The learning model is implemented using Tensorflow.

(26)

(27)

CHAPTER 4

Experiments

4.1 Learnability of quantifiers under monotonicity

The first experiment is to test hypothesis 1. For this purpose the introduced learning models learned five quantifiers from table 2.1, where the models had a maximum size of |M | = 20. A total of 350000 data samples were created using algorithm 1 for each trial of learning. Where in total 20 trials are learned by each learning model. The LSTM stopped learning after 35 epochs. To evaluate the difference of learnability of each quantifier, the convergence point of 95% accuracy is measured for each quantifier at each trial. A quantifier is learned more easily if its convergence point is significantly different than the quantifier it is compared to. To see if they are significantly apart a paired t-test is calculated on the convergence points between two quantifiers. This test will be done for each pair of quantifiers. The paired t-test is used, instead of the standard one, because only the differences of convergence points from the same network are of interest. Furthermore, a Bartlett’s test18(B) is performed to get a sense if the convergence points of different quantifiers have equal variance. For all statistical tests α = 0.05 was used.

Lastly, linear regression will be performed on the convergence points. This makes the linear relation between monotonicity and learnability apparent and gives a rough estimation of when the learning quantifiers with different monotonicity would converge to the 95% accuracy threshold.

4.1.1 Results

By visually inspecting the learning curves in Figure 4.1 the order of learnability is as expected. Quantifiers with higher monotonicity are learned faster than with lower monotonicity, using the LSTM network. The learning curves clearly converges at different rates to a 95% accuracy. This is statistically confirmed by the paired t-test in table 4.1. Where each quantifier significantly converge at different rates compared to each other. This result supports hypothesis 1 under the context of this thesis.

Q0 Q1 Q2 Q3 Q4 Q0 x t = −36.4 t = −38.8 t = −18.8 t = −21.9 p = 1.42 · 10−17 p = 4.99 · 10−18 p = 8.11 · 10−13 p = 6.55 · 10−14 Q1 x x t = −20.6 t = −16.6 t = −20.6 p = 1.88 · 10−13 _{p = 6.42 · 10}−12 _{p = 1.81 · 10}−13 Q2 x x x t = −14.1 t = −18.6 p = 8.27 · 10−11 p = 1.00 · 10−12 Q3 x x x x t = −9.68 p = 2.50 · 10−8 Q4 x x x x x

Table 4.1: Paired t-test results of the convergence points between each quantifiers.

(28)

Figure 4.1 also seem to show that the result of this experiment is not dependent on the choice of 95% threshold. A threshold at 90% to 99% should still produce the same outcome. This is shown in Appendix B.

Figure 4.1: The learning curves of the five quantifiers, the bold lines are the median values.

The linear regression of the convergence points fits the data rather well (R2 _{= 0.916). The}

fitted variables are: intercept = 104256 and slope = −103489. See the fitted line Figure 4.2.

Figure 4.2: Linear regression of the individual convergence points.

An interesting observation is that the variance of convergence points decreases when the monotonicity increases. The result of Bartlett’s ((B = 233, p = 2.25 · 10−49)) confirms that atleast their variances are not equal. A visual observation is that the individual distribution of the convergence points seem to be symmetric.

4.2 Curriculum learning of the ‘even’ quantifier

In this experiment hypothesis 2 is tested. To see if algorithm 2 only trains Q4 (‘even’)

differ-ently from the quantifiers in the previous experiment the five quantifiers will each be learned separately from each other. A single quantifier is learned by an LSTM by first using the original data generating algorithm, then on another LSTM the quantifiers is learned again but with the curriculum training. A single trial is when the quantifier has been learned by the LSTMs using both algorithms, where each algorithm generates 150000 data samples. For each quantifier a total of 10 trials were performed and from these trials the 95% accuracy convergence points are measured. Subsequently, the t-test was used to see if for each quantifier if the convergence

(29)

points from each algorithm significantly differ. The paired t-test is not used here, since we are comparing differences between convergence points that are learned by different networks.

Next, the exact same experiment was done as in the previous section, but with algorithm 2. This was to confirm that under the curriculum training it would not change the result of the previous experiment. A skew test19 _{(the statistics is denoted with s) is performed on the}

convergence points of each quantifier under each training regime. This is to see if the same kind of distribution applies under the curriculum training.

Lastly, the training with and without curriculum was applied to train multiple quantifiers at once. It was first trained with only Q0, secondly with Q0, Q4, then with Q0, Q2, Q4 and

lastly with Q0, Q1, Q2, Q3, Q4. The purpose of this test was to see if the increase of speed of the

curriculum training is constant or varies by the amount of quantifiers that the LSTM network is learning. The speedup is measured by dividing the mean of the convergence points from each type of training. Each case was repeated by 5 trials and the number of data samples was determined by num samples = num quant · 150000. Except, for the first and last one, since this the experiment is indirectly already performed for them. So, the results are taken from the previous parts of the experiment.

4.2.1 Results

Figure 4.3 show that the convergence points are very close to each other for the first four quan-tifiers Q0, Q1, Q2 and Q3. However, the result of the paired t-test for Q0 is (t = −5.66, p =

2.27 · 10−5). This shows that convergence points are actually significantly apart. For Q1:

(t = −2.03, p = 0.0572), Q2: (t = −1.27, p = 0.220) and Q3: (t = 0.306, p = 0.763) the difference

is actually insignificant. However, notice that for Q2 the training curves with no curriculum

diverge from the ones with curriculum after the 95% accuracy threshold.

Figure 4.3: Learning curves of Q0, Q1, Q2and Q3 with and without curriculum training.

(30)

For Q4 in Figure 4.4 the difference between the convergence points are obvious, in contrast to

that of the other four quantifiers. The result of the paired t-test (t = 5.48, p = 3.35 · 10−5) confirms this. Which means that under curriculum training Q4 is learned significantly faster by

the LSTM network.

Figure 4.4: Learning curves of Q4 with and without curriculum training.

For the second part of the experiment Figure 4.5 shows that the order of quantifiers learned by the LSTM network did not change under curriculum training. Furthermore, table 4.2 contains the paired t-tests between the quantifiers under curriculum training and it shows that the quantifiers still converge at significantly different rates. Which means that the curriculum training does not affect the result in the previous section. The main difference is that Q4 is learned faster than

under algorithm 1. Q0 Q1 Q2 Q3 Q4 Q0 x t = −17.7 t = −28.8 t = −23.4 t = −22.1 p = 2.26 · 10−12 p = 6.83 · 10−16 p = 2.23 · 10−14 p = 5.52 · 10−14 Q1 x x t = −24.1 t = −21.2 t = −20.0 p = 1.42 · 10−14 p = 1.19 · 10−13 p = 2.91 · 10−13 Q2 x x x t = −18.22 t = −17.7 p = 1.49 · 10−12 p = 2.15 · 10−12 Q3 x x x x t = −7.69 p = 6.28 · 10−7 Q4 x x x x x

(31)

Figure 4.5: The learning curves of the five quantifiers with curriculum training. The bold lines are the median value

The linear regression for the convergence points with curriculum training fits the data a bit worse (R2_{= 0.913). The fitted variables are: intercept = 93411 and slope = −89137. The fitted}

line is shown in Figure 4.6

Figure 4.6: Linear regression of the individual convergence points with curriculum training.

The same observation can be made in the previous experiment. The variance decreases as the monotonicity increases, but the distribution of the convergence points looks different for Q4.

Another Bartlett’s test (B = 172, p = 3.67 · 10−36) verifies that the variances of the quantifiers are not equal.

The results from the skew test shows that the distribution of Q0and Q4are different under the

two training regimes. For Q0with no curriculum training the skewness is significantly different

from 0 and with curriculum training insignificantly different. The reverse is true for Q4. For the

other quantifier it stays the same, where the skewness is insignificantly different from 0. The results of the skew tests can be found in table 4.3.

Q0 Q1 Q2 Q3 Q4 No curriculum s-value 3.12 −0.156 0.528 1.30 0.897 p-value 1.81 · 10−3 0.875 0.597 0.194 0.370 With curriculum s-value 1.17 1.81 0.473 0.511 3.25 p-value 0.241 0.0703 0.636 0.609 1.16 · 10−3

(32)

The first two parts of the experiments show that Q0and Q4are significantly affected by the

curriculum training. Which means part a) of hypothesis 2 is false.

Lastly, the result of the last part of the experiment is given in table 4.4:

number of quants 1 2 3 5

speed-up 1.30 1.30 1.31 1.13

Table 4.4: Speed up of Q4 ‘even’ by curriculum training under different amount of quantifiers

learned.

This at least shows that part b) of hypothesis 2 is true. The curriculum training speeds up the training of Q4, even when multiple quantifiers are learned. However, up to three quantifiers the

(33)

CHAPTER 5

Discussion

The results of the first experiment has shown that hypothesis has given substantial indication that hypothesis 1 is true. More monotone quantifiers are easier to learn by the implemented learning model. The linear regression shows that this relation is strongly linearl dependent with monotonicity. Consequently, it is also linearly dependent with connectedness since the measure of monotonicity (2.14) is linearly dependent on the number of parameters, which defines connect-edness. So, somehow connectedness is an effective predictor for the learnability or complexity of a quantifier.

The argumentation Chemla, Buccola, and Dautriche 2018 for this phenomenon comes from an important claim in the field of concept learning20, which is that conjunction is easier than disjunction. In their definition of connectedness, less connected quantifiers are more disjunct and therefore harder to learn. This is also the case in the more fine-grained definition of connectedness in this thesis. Because, the number of parameters determine the number of switches between ‘true’ and ‘false’ intervals of |A ∩ B| of a quantifier and the more switches the more disjunct ‘true’ intervals there are. This kind of reasoning also falls in line with Feldman 2000, because more disjunct quantifiers seem to have longer incompressible description length and these are harder to learn according to this work.

The idea of incompressiblity deserves further investigation. A description length is incom-pressible, if any attempt of further compression results in loss of information. This relates to an important theorem in information theory, the Shannon’s source coding theorem21 _{which is}

verbally stated as22_:

N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; conversely if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.

This shows that entropy determines a lower bound on compressibility. This means that simple quantifiers with shorter descriptions should have lower entropy than complex quantifiers with longer descriptions. In Carcassi, Steinert-Threlkeld, and Szymanik 2019 they have measured monotonicity by using its information entropy. To measure is given as23

mon(Q) = 1 −H(1Q|1

≺ Q)

H(1Q)

(5.1)

Where 1Q and 1≺Q are random variables, such that1Q is the value that a quantifier Q assigns

to a model hM, A, Bi. And, 1≺_Q the value if hM, A, Bi has a submodel that Q considiers true. H(1Q) quantifies the uncertainty (entropy) of the assigned truth value of a model by Q and

20_{See Goodman et al. 2008.} 21_{See Shannon 1948.}

22_{See MacKay and Mac Kay 2003, taken from page 81.}

23_{For the derivation of the equation see Carcassi, Steinert-Threlkeld, and Szymanik 2019, section Measure of} Monotonicity.

(34)

the conditional entropy H(1Q|1≺Q) quantifies the of the assigned truth value of a model by Q

uncertainty, given that it knows if the model has a submodel that is true under Q. A hM, A, B0i is a submodel of hM, A, Bi if B0_{⊆ B and A ∩ B}0_{⊆ A ∩ B. A monotone quantifier, e.g. Q}

0, gets

monotonicity 1.0, because if there is a submodel that the quantifier considers true, then by (the monotonicity) definition the quantifier must be true as well. So, knowing that a submodel is true removes all uncertainty and H(1Q|1

≺

Q)

H(₁Q) becomes 0. The other quantifiers get monotonicity of:

mon(Q1) = 0.996, mon(Q2) = 9.18 · 10−4, mon(Q3) = 2.01 · 10−5, mon(Q4) = 9.54 · 10−7. This

measure complies with the measure given in this thesis, i.e. less connected quantifiers are also less monotone. It can be argued that this is the case, because H(1Q|1

≺

Q)

H(1Q) represents the entropy

of a quantifier, and therefore determines the lower bound of compressibility. So, in this context mon(Q) measures how compressible a quantifier is under monotonicity and thus measures how complex a quantifier is by compressibility.

However, take for example the quantifier Q5 ‘atleast 10 or atmost 8 ’, its monotonicity is

mon(Q5) = 0.374. So, the monotonicity in this measure can be different under quantifiers that

have the same connectedness, since Q1has the same class of connectedness as Q5. Although, the

difference is of a smaller magnitude between quantifiers with different connectedness. It means that connectedness is not the only factor that influences compressibility, but it might indicate that connectedness is the strongest predictor of the learnability or complexity of a quantifier. This is however only based on a single example and should be investigated further in a future work.

(35)

CHAPTER 6

Conclusions

In this thesis a robust and fine grained classification of connectedness has been introduced, which was used to create a measure of monotonicity. This new classification emerged from another one introduced by Chemla, Buccola, and Dautriche 2018. The main realisation on the previous classification was that the the third class actually contained different types of quantifiers, determined by a different number of parameters. By splitting this class by the number of parameters gave rise to the new finer grained classification, where each class had the same potential cardinality. Furthermore, this classification had a basis in psychology. It was shown by Feldman 2000 that concepts with a smaller incompressible description length are easier to learn. This can be translated to that the number of parameters is incompressible for each quantifier.

Then by using an LSTM network the learnability of five quantifiers with different mono-tonicity were determined computationally and the result of the experiments have shown that quantifiers with different number of parameters, i.e. different class of connectedness, are learned at significantly different rates, where quantifiers that are monotone are learned faster than less monotone quantifiers. Furthermore, the convergence points between quantifiers with different monotonicity seem to follow a linear trend, which gives a strong indication that connectedness is an effective predictor for the complexity or learnability of a quantifier. Further evidence can be found by looking into information-theoretic properties of quantifiers. It seems that connectedness influences the entropy of a quantifier the most significantly, compared to other potential factors. These claims, however, need to be substantiated in future research.

Lastly, this thesis provided curriculum learning for the ‘even’ quantifier. This was achieved by biasing the data generation algorithm to create more examples on smaller models and by ordering the data by model size. The argumentation for the changes were purely intuitive: to be able to identify if a sequence contains a ‘even’ number of things, it is essential to be able to know if the subsequences were ‘even’ or not as well. The curriculum training had mostly its intended effect. Mainly the convergence points of the ‘even’ was effected and in effected in such a way that it is learned faster. Most importantly, it had no effect on the order of learning between quantifiers with different monotonicity.

(36)

(37)

APPENDIX A

Proofs

Here we finish the proof of Lemma 2.12 on page 14 and show that Q7 and Q8 are disconnected.

Proof. (Q7is disconnected): Any Q7quantifier must have at least one subset defined as {hM, A, Bi :

|A ∩ B| ≤ n0} and one subsets defined as {hM, A, Bi : n1 ≤ |A ∩ B| ≤ m1}. Then a ‘partial’

quantifier can be defined as

Qpartial= {hM, A, Bi : |A ∩ B| ≤ n0} ∪ {hM, A, Bi : n1≤ |A ∩ B| ≤ m1}

Now lets take B00, such that, hM, A, B00i ∈ {hM, A, Bi : n2≤ |A∩B| ≤ m2}. Next take the subset

B0 ⊂ B00_{, such that |A ∩ B}0_{| = n}

0+ 1. This means that hM, A, B0i /∈ Qpartial, since n0< n0+ 1

and n0+ 1 < n1. Lastly take B ⊂ B0, such that |A ∩ B0| ≤ n0. So, hM, A, Bi ∈ Qpartial.

Therefore, Qpartial is not connected under definition 2.5. Consequently, Q7 is not connected as

well.

(Q8 is disconnected): For Q8 we again take Qpartial, but its negation

¬Qpartial= {hM, A, Bi : n0< |A ∩ B| < n1} ∪ {hM, A, Bi : m1< |A ∩ B|}

Now lets take take B00, such that, hM, A, B00i ∈ {hM, A, Bi : {hM, A, Bi : m1< |A ∩ B|}. Next

take the subset B0 ⊂ B00_{, such that |A ∩ B}0_{| = m}

1. This means that hM, A, B0i /∈ ¬Qpartial,

since n1 ≤ m1. Lastly take B ⊂ B0, such that n0 < |A ∩ B0| < n1. So, hM, A, Bi ∈ ¬Qpartial.

Therefore, ¬Qpartial is not connected under definition 2.5. Consequently, Q8 = ¬Q7 is not

(38)

(39)

APPENDIX B

Learning rates at different thresholds

To show that the learning rates (from the experiment in section 4.1) are still significantly apart with accuracy thresholds from 90% to 99%, a paired t-test will be done at these two values.

Q0 Q1 Q2 Q3 Q4 Q0 x t = −33.9 t = −35.3 t = −21.0 t = −22.4 p = 4.81 · 10−17 p = 2.42 · 10−17 p = 1.36 · 10−13 p = 4.64 · 10−14 Q1 x x t = −27.4 t = −19.9 t = −21.68 p = 1.63 · 10−13 p = 3.19 · 10−13 p = 7.19 · 10−14 Q2 x x x t = −17.0 t = −20.1 p = 4.26 · 10−12 p = 2.84 · 10−13 Q3 x x x x t = −12.4 p = 6.20 · 10−10 Q4 x x x x x

Table B.1: Paired t-test results with accuracy threshold of 90%.

Table B.1 contains the results of the paired t-tests and it shows that the learning curves pass the 90% at significantly different rates. Figure B.1 is the same as in section 4.1.1, but with the threshold at 0.90.

Figure B.1: The learning curves of the five quantifiers, the bold lines are the median values.

(40)

Q0 Q1 Q2 Q3 Q4 Q0 x t = −16.7 t = −24.0 t = −20.7 t = −20.1 p = 5.54 · 10−12 p = 1.49 · 10−14 p = 1.36 · 10−13 p = 2.67 · 10−13 Q1 x x t = −21.3 t = −18.9 t = −18.8 p = 1.05 · 10−13 p = 7.83 · 10−13 p = 8.02 · 10−13 Q2 x x x t = −17.1 t = −17.4 p = 3.73 · 10−12 p = 2.93 · 10−12 Q3 x x x x t = −6.27 p = 8.54 · 10−6 Q4 x x x x x

Table B.2: Paired t-test results with accuracy threshold of 99%.

(41)

Bibliography

Barwise, Jon and Robin Cooper (1981). “Generalized quantifiers and natural language”. In: Philosophy, language, and artificial intelligence. Springer, pp. 241–301.

Bengio, Yoshua et al. (2009). “Curriculum learning”. In: Proceedings of the 26th annual interna-tional conference on machine learning. ACM, pp. 41–48.

Carcassi, Fausto, Shane Steinert-Threlkeld, and Jakub Szymanik (2019). “The emergence of monotone quantifiers via iterated learning”. In:

Chemla, Emmanuel, Brian Buccola, and Isabelle Dautriche (2018). “Connecting content and logical words”. In: Journal of Semantics.

D’agostino, Ralph B, Albert Belanger, and Ralph B D’Agostino Jr (1990). “A suggestion for using powerful and informative tests of normality”. In: The American Statistician 44.4, pp. 316– 321.

Feldman, Jacob (2000). “Minimization of Boolean complexity in human concept learning”. In: Nature 407.6804, p. 630.

Feller, William. “An introduction to probability theory and its applications”. In: 1957.

Fletcher, Peter and C Wayne Patty (1996). Foundations of higher mathematics. Brooks/Cole publishing company.

Givant, Steven and Paul Halmos (2008). Introduction to Boolean algebras. Springer Science & Business Media.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep learning. MIT press. Goodman, Noah D et al. (2008). “A rational analysis of rule-based concept learning”. In:

Cog-nitive science 32.1, pp. 108–154.

Hochreiter, Sepp and J¨urgen Schmidhuber (1997). “Long short-term memory”. In: Neural com-putation 9.8, pp. 1735–1780.

Holman, Eric W et al. (2011). “Automated dating of the worlds language families based on lexical similarity”. In: Current Anthropology 52.6, pp. 000–000.

Hyman, Larry M (2008). “Universals in phonology”. In: The linguistic review 25.1-2, pp. 83–137. Kingma, Diederik P and Jimmy Ba (2014). “Adam: A method for stochastic optimization”. In:

arXiv preprint arXiv:1412.6980.

MacKay, David JC and David JC Mac Kay (2003). Information theory, inference and learning algorithms. Cambridge university press.

Robinson, AJ and Frank Fallside (1987). The utility driven dynamic error propagation network. University of Cambridge Department of Engineering.

Schilling, Ren´e L (2005). Measures, Integrals and Martingales. Cambridge University Press. Shannon, Claude Elwood (1948). “A mathematical theory of communication”. In: Bell system

technical journal 27.3, pp. 379–423.

Snedecor, George W and William G Cochran (1989). “Statistical Methods, eight edition”. In: Iowa state University press, Ames, Iowa.

Steinert-Threlkeld, Shane and Jakub Szymanik (In press). “Learnability and semantic univer-sals”. In: Semantics and Pragmatics.

Werbos, Paul J (1988). “Generalization of backpropagation with application to a recurrent gas market model”. In: Neural networks 1.4, pp. 339–356.

Machine learning for Semantic Universals

Bachelor Informatica