Modelling the assertability of generic sentences using positive operators

(1)

Modelling the assertability of

generic sentences using

positive operators

(2)

Layout: typeset by the author using LA_TEX.

(3)

Modelling the assertability of

generic sentences using positive

operators

Barry Servaas 10764321

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. M.A.F. Lewis

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 904 1098 XG Amsterdam

(4)

Abstract

Generic sentences are sentences that express general truths, which are neither universal nor specific. Due to this lack of specificity, it is hard to determine whether such sentences are acceptable. We propose a new method to determine the assertability, based on the hyponymy relation between positive operators. This is done because finding association between two words based on hyponymy has proved successful in the past. The hyponymy relation will be determined here, between parts of the sentence instead of words. To calculate the final assertability of a sentence, an assertability metric will be used that is based on conditional probability. In the end, no strong correlation could be established between the calculated assertability of sentences and their assessed score.

(5)

Chapter 1 Introduction

Generic sentences are sentences such as ‘Dogs bark,’ or ‘Ducks lay eggs,’ which express general truths about the world. These truths are neither specific, such as in the case of ‘a dog barks,’ nor universal, as in ‘all dogs bark.’ [Carlson, 1982], [Carlson, 1989], [Carlson, 1995]

Much commonsense knowledge is encoded in these sentences. The validity of them is hard to determine using typical rules, however. This is because many of these valid sentences are more often false than true.

As an example, the sentence, ‘Ducks lay eggs,’ would generally be considered valid, despite the fact that only female ducks lay eggs under specific circumstances. Assuming the gender ratio of ducks is equal, this would then be less than half. If a simple majority rule for determining validity is followed, this sentence is invalid. In spite of this the sentence would still be considered valid, as would less probable generic sentences.

Because of this, it is difficult to validate a generic sentence based on observa-tions of reality. Kochari et al. [2020] propose a range of methods for quantifying the validity of generic sentences. These are based on finding alternatives to parts of the sentence; the first method is based on an alternative interpretation of the attribute part, and the second on finding alternatives to the subject that it can be compared to. However, Kochari et al. [2020] do not give concrete methods for building word or sentence representations from which the validity of these sentences can be calculated.

(6)

Here, this association will be measured by modelling the meaning of the words as vectors. The contents of these vectors are based on the usage of the words in natural language, and are used to represent their meaning. [Landauer and Dumais, 1997][Mitchell and Lapata, 2008] This is based on the assumption that words which occur in similar contexts, have similar meanings. These vectors are used because there are consistent methods to construct these vectors, and these vectors make it possible to calculate the similarity between two vectors. Furthermore, they have also been proven to be a successful method in the past. Here, these vectors will be used in a computational approach to measuring this association, and also in measuring the graded hyponymy between the two parts of a sentence.

A system which is capable of reliably predicting whether a generic sentence is true can be used in systems that need to fill gaps in their knowledge base by using a deductive approach. This is because basing the assertability of a sentence on word meanings simply reasons further, based on the knowledge it already has -this is in contrast to methods that are based on observing the natural world.

The goal of this thesis is to find a method of asserting whether a sentence is acceptable. The outline is as follows: in chapter 2 we discuss prior developments on research of the assertability of generic sentences, and on measuring the association based on positive operators. In chapter 3, the specific system developed for this work is laid out. In chapter 4, the resources used in the experiments are shown and the protocols for these experiments are laid out. In chapter 5, the results will be presented. Finally, in chapter 6, these results are discussed and a conclusion is drawn.

(7)

Chapter 2 Background

2.1 An Alternative-based Approach to Generics

In Kochari et al. [2020], a range of methods for validating generic sentences, based on conditional probability, are proposed. This is because a simple majority rule to these does not work for most generic sentences. For example, a sentence such as ‘Ducks lay eggs,’ would not pass under this rule as was stated in the introduction. Because of this, two new approaches are proposed to validate generic sentences.

The notion of a set of alternatives will be a key idea here. A set of alternatives is a collection of words that can be considered to be a substitute for a part of the sentence. One could have a set of alternatives to a noun, where if there is an animal for example, a collection of other animals could be put in that place. If the set of alternatives is to a verb phrase, similar verb phrases could substitute this.

The first method described here had to do with allowing for alternative mean-ings for the verb phrase, to make the sentence be more precise to what is assumed to be meant with it. E.g. an alternative to the sentence, ‘Birds lay eggs,’ would be, ‘Birds that are capable of giving live birth lay eggs.’ The equation that cor-responds to this method is shown in equation 2.1. To determine the truth of the sentence ‘Birds lay eggs,’ G would then be changed to “birds”, f to “lay eggs”, and Alt(f ) to the alternative “give live birth”.

P (f |G ∩[Alt(f )) > 1

2 (2.1)

Equation 2.1: Kochari et al. [2020] their first method of determining assertability. In the case of a sentence ‘Gs f.’ Here, G would be the subject of the sentence, f the attribute ascribed to G, and Alt(f) is the set of alternatives to f. This then becomes the conditional probability of verb phrase f corresponding to a subject G that is narrowed down to be considered eligible under the requirements of Alt(f).

(8)

This method is rejected however, as there are many cases where this is still unlikely to make a valid sentence. An example sentence for this would be, ‘Crocodiles die before they reach two weeks of age,’ which is generally true, but the sentence should be considered as invalid.

The second method described instead uses alternatives to the subject of the sentence. Here a generic sentence could be taken to imply that the subject of the sentence is more likely to make the sentence true than other, comparable subjects. To show an example, in the case of the sentence ‘Birds lay eggs,’ most birds do not lay eggs. However if compared to other animals such as dogs or horses, birds lay relatively more eggs.

The calculation for this then also takes an alternative sentence into account by having an alternative probability being measured against. The equation that corresponds to this measure is determined by equation 2.2. If the sentence, ‘Birds lay eggs,’ is used in this equation, G would have to be replaced with birds, f would have to be replaced with laying eggs, and Alt(G) would have to be replaced with the set of alternatives to birds: a list of other animals.

Assertability(‘Gs f0) = P (f |G) − P (f |[Alt(G)) (2.2) Equation 2.2: The used method of deciding assertability. Here, G represents the subject, f represents the attribute ascribed to G, and Alt(G) represents the alternatives to G.

A variant of this equation will be used in this project. Since there are no real prob-abilities used in this project, a substitute variable will be used instead of probability as will be shown in equation 3.1.

2.2 Representing words as positive operators

Representing words as vectors has many benefits, such as the possibility to calculate the similarity between two vectors. However, there is no clear way to combine these vectors into sentences, or to assess the validity of sentences with them. Therefore, we will represent words as positive operators as taken from Lewis [2019].

A positive operator is a matrix M with the following properties: • ∀v ∈ V h~v, M~vi ≥ 0

• M is self-adjoint.

To represent words as positive operators, firstly each word will be viewed as a col-lection of its hyponyms. A hyponym is a subdefinition of a word, where a hyponym is a more specific version of the original word. For example, “poodle” would be a hyponym of “dog.” For a word, a vector is then obtained of each hyponym and the operator of the word is then constructed by using equation 2.5. An example of such positive operator creation is shown in figure 2.1.1

1_{In the thesis, if there are no hyponyms present, a word is represented by its own vector after}

(9)

Once the words have a positive operator representation, these representations could be placed in a Löwner order, by means of the equation 2.3. Here, if D is a positive matrix, this means that A v B. This ordering is interpreted as hyponymy.

D = B − A (2.3) The graded hyponymy measure was introduced by Lewis [2019], where it is attempted to find a measure of hyponymy between two words that is graded instead of complete.

An issue with the regular usage of hyponymy is that D will not always be strictly positive, but it can also be partially so. This is because in natural language, a hyponymy relation between two concepts is usually not meant to be absolute. Therefore the hy-ponymy that is being used is likely to be graded, instead of absolute to reflect this real-life situation. A good example of this is that dogs are often pets, but not always - here the hyponymy would need to be graded. Because of this, a measure is needed to evaluate the degree of hyponymy.

The usage of positive operators is also described here. These word representations are constructed by taking all hyponyms of a word, and transforming the vectors of these hyponyms together into one positive operator.

Furry Pursuit Action Ears Barks Meows

Dog 4 1 0 5 10 0 Cat 8 0 0 5 0 10 ¯ dog =         16 4 0 20 40 0 4 1 0 5 10 0 0 0 0 0 0 0 20 5 0 25 50 0 40 10 0 50 100 0 0 0 0 0 0 0         ¯ cat =         64 0 0 40 0 80 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 25 0 50 0 0 0 0 0 0 80 0 0 50 0 100         ¯

animal = ¯dog + ¯cat =         80 4 0 60 40 80 4 1 0 5 10 0 0 0 0 0 0 0 60 5 0 50 50 50 40 10 0 50 100 0 80 0 0 50 0 100        

Figure 2.1: Example on how positive operators are obtained. In reality, the vectors used are far more complex than the ones used here, as they need to accommodate more definitions.

¯

v := ~v~vt (2.4) Equation 2.4: A simple method to obtain positive operators. ~v here is the vector repre-sentation of a word, and ¯v is the corresponding positive operator reprerepre-sentation.

(10)

¯

z = {~v, ~w, ~x} 7→ ¯v + ¯w + ¯x (2.5) Equation 2.5: A more general method to obtain positive operators. ¯v, ¯w, and ¯x are the positive operator representations of the hyponyms of the original word, and are obtained using equation 2.4.

As was stated before, this hyponymy measure was graded, and therefore needs to be evaluated further. To evaluate this hyponymy, Lewis introduces two new measures: kAB,

and kE. kAB is not used here.

kE works here as a stand-in for the term 1 − P (A∩¬B)_{P (A)} , which is used in calculating

conditional probabilities. The reason why this should work is explained further in section 3.1. kE is constructed by first diagonalizing B − A, which results in a real-valued matrix.

After that, a matrix E is made by setting all positive values of the former to 0 and turning all negative eigenvalues positive. This E is then sufficient to turn D = B−A+E positive. After this, equation 2.6 will be used to create kE.

kE = 1 −

||E||

||A|| (2.6) Equation 2.6: The equation used to make kE. Here, || · || indicates the Frobenius norm.

Finally, this results in kE as measure for the graded hyponymy. kE can range from 0

to 1, where 1 means a complete overlap between A and B, while 0 means no overlap. Lewis [2019] also introduces several equations to combine two words into a single operator. These are equations 2.7 through 2.9.

[word1 word2] = [word1] [word2] (2.7) [word1 word2] = [word1]12[word2][word1]

1

2 (2.8) [word1 word2] = [word2]12[word1][word2]

1

2 (2.9) Here, the first equation is one of point-wise multiplication between the operators. The second and third equations are methods of matrix multiplication where it doesn’t lose its self-adjoint property.

2.3 Other related works

The expectation for this research is based on the thesis of Bosscher [2020], where asso-ciation between words is examined based on the hyponyms of those words, in order to capture implicit biases. This is done by examining the graded hyponymy between the two. This method will also be used here - the difference is that the validity of generic sentences will be examined here, instead of the association between two words.

(11)

Chapter 3 Method

3.1 Theory

As was stated before, to determine the assertability of a sentence an alternative measure of conditional probability kE(A, B)is used here. This will be done using equation 2.2 by

Kochari et al. [2020]. This works because the calculation of the conditional probability and kE work very similar. In this section, this will be elaborated on further.

In the case of a conditional probability,

P (B|A) = P (A ∩ B) P (A) . This can then be rewritten as

P (A ∩ B) P (A) = P (A ∩ B) + P (A ∩ ¬B) − P (A ∩ ¬B) P (A) = P (A) P (A)− P (A ∩ ¬B) P (A) = 1 −P (A ∩ ¬B) P (A) .

The kE measure is made by taking the matrix that results from B − A, where A and

B are the positive operators of the subject and the verb phrase respectively, and making E the smallest sufficient matrix that turns D = A − B + E positive. This is because E is in reality the error term that is supposed to make B − A positive - therefore E is also upper bounded by the size of A, as B − A + A is always positive. Therefore, E is roughly equivalent to P (A ∩ ¬B), which means that

P (B|A) = 1 −P (A ∩ ¬B) P (A) ≈ 1 −

||E||

||A|| = kE(A, B).

(12)

A B A ∩ B A ∩ ¬B A B A ∩ B E

Figure 3.1: This diagram shows that the background of the kE measure is similar

to that of the regular conditional probability.

3.2 Assembling the sentence representations

Once both the subject and sentence operator are composed, the hyponymy relation be-tween them can be determined. This is done by placing them in a Löwner order as was described before in section 2.2. [Löwner, 1934] This order is determined by subtracting one matrix from the other as shown in equation 2.3.

An important note on the hyponymy measure used here is that this was made with the intention to be used on individual words, as in Lewis’ example. In this work it is assumed that this can also be used to evaluate parts of sentences, and these can be hyponymous to another. For example, in equation 2.2, the quantity P (f|G) is calculated, where G is a noun, such as “birds”, and f is a verb phrase, such as “lay eggs”. In order to apply the measure kE to “birds” and “lay eggs”, a single operator needs to be generated

for “lay eggs”. To do this, the operators for “lay” and for “eggs” are combined. This follows equations 2.7 - 2.9.

The kE that is acquired from this process is the equivalent to the P (f|G) from

equation 2.2.

3.3 Evaluation

While kE could already be used as the assertability of a sentence, it does not have all the

properties that are desired here. This is because the assertability of a sentence ‘Gs are f’ should be high even if the probability of it being true under the given circumstances are low, if the probability of ‘Gs are f’ is still higher than in every other circumstance.

To account for this, alternative sentences ‘Alt(Gs) are f’ are generated for every sentence ‘Gs are f’. The intended effect here would be that a sentence such as ‘Dogs bark,’ has as alternative ‘Animals bark.’ In accordance with equation 2.2, we then calculate P (bark|dogs) − P (bark|animals).

(13)

After this, the alternative sentence is processed as normal and its kE is obtained as

well. After this, the assertability is calculated using the following equation: Assertability = kE(A, B) − kE(

[

Alt(A), B) (3.1) A problem with the above equation is that the assertability can turn out lower than might be intended if the kE of the alternative subject is close to 1. Therefore, another

assertability measure is proposed to account for this possible low potential: Relative assertability = kE(A, B) − kE(S Alt(A), B)

1 − kE(S Alt(A), B) (3.2)

For the assertability, there is a value range between -1 and 1, while the relative assertability takes values from −∞ to ∞. For both, however, lower values mean a low assertability and higher values mean a high assertability.

In the end, there are three measures to evaluate the effectiveness by; conditional probability kE, assertability, and relative assertability.

(14)

Chapter 4 Experiments and Resources

4.1 Experiments

The results of most of these experiments will be measured with Spearman’s rank corre-lation coefficient, in order to discover the correcorre-lation between the result and the assessed score. The ROC/AUC measure will also be used, but to a more limited degree.

4.1.1 DummyKB

A dummy set was made for testing purposes, to test whether the system was functioning properly.

The contents of the dummy set are sentences of the form ‘X is X,’ for true sentences, and ‘X is Y,’ where X is a concept unrelated to Y for sentences that are false. These sentences have a truth value of 0.999 and 0.001.

The reason this is included in the experiments is that the dataset is much smaller than GenericsKB and therefore it gives more oversight on its results. Every operator combination will be used on it. It is expected that this will have much stronger correla-tions than any other method, as all sentences in this dataset are either trivially true or trivially false, while every other dataset has many sentences with a questionable truth value. This also makes it easier to understand if a result is unpredictable. Because of the small size of the dataset, it is also viable to work with higher dimensional GloVe vectors. Therefore, these are also tested on the dummy dataset, unlike with the other datasets.

The full Dummy dataset is written down in Appendix A.

(15)

4.1.2 GenericsKB

Generic sentences used in the system are obtained from GenericsKB, [Bhakthavatsalam et al., 2020] where it is evaluated based on the GenericsKB set and the GenericsKB-best set. These sentences are supplied with their source, term, optional quantifier, score, and the sentence itself. For each sentence, the supplied term, sentence, and truth value are used. The term here seems to correspond closely to its grammatical subject and is used as such. All sentences that are longer than three words are removed, as that would require more complex usage of grammar. Furthermore, all sentences with a truth value of 1.0 are removed. This is because these are usually - possibly always - automatically generated and this would skew the results, as the system was made with natural language as intention.

The operator values used for the words in this system are obtained from GloVe. [Pennington et al., 2014] The values used here, are taken from the 50-dimensional and the 100-dimensional set. This is because higher-dimensional sets made the computer take too long to process the results.

The hyponymy and hypernymy relations used in this model are obtained from Word-Net. [Miller, 1998] These are used in this system by going down one layer of hyponymy to construct word operators, and by going up two layers to determine the scope where it searches for alternative subjects to be used in the alternative sentences.

4.1.3 Higher quality sentences

Included among the GenericsKB datasets is a set consisting of sentences that were picked for being of high quality. These are picked for having a score that is higher than 0.23. These are evaluated separately, and are expected to create a stronger correlation.

4.1.4 WordNet and TupleKB generated sentences

Earlier it was stated that the sentences with a score of 1.0 were discarded, as these were machine-generated. Despite this, they will still be evaluated separately to obtain more insight on what data correlates well with the system. Here, it is expected that a stronger correlation will be found than with the normal settings. This is because the machine-generated sentences appear to mostly be bare plural sentences and are therefore much closer to the intentions of the system than the normal dataset.

4.1.5 Different methods of operator combination

Three methods were introduced to combine operators into a sentence operator. (Eq. 2.7 - 2.9) All three of these will be experimented with. Only the pointwise method will be used on the WordNet/TupleKB method due to time constraints. No reason has been thought of to expect any method to perform better.

(16)

4.1.6 ROC/AUC

A final test will be performed on a separately made dataset that only has data with either a score of above, or equal to 0.6 - or below, or equal to 0.1. The intention behind this is to test the performance of the system at dividing the results into binary categories of true and false. The values from this will therefore also be compared to a true or false value, where true sentences have a score above or equal to 0.6, and false sentences below or equal to 0.1. The set of sentences will then be assigned a ROC-AUC score. Due to time constraints, this will only be done for the pointwise multiplication combination method. Important to note for this dataset is that a large majority of the sentences in the dataset are labelled as false.

4.2 Resources

The data used to determine the hyponymy relation between the two parts of the sen-tence is obtained from two databases; Wordnet [Miller, 1998] and GloVe. [Pennington et al., 2014] Wordnet is primarily used here to obtain the hyponyms of the words used in the sentence, while the values from GloVe are used in constructing vector represen-tations of the words. The generic sentences themselves are obtained from GenericsKB. [Bhakthavatsalam et al., 2020]

4.3 Building Operators

Before the sentences can be analyzed, they must be made suitable for that process. This is done by removing all symbols that are not present in the dataset of words. This is done to ensure that as many sentences can be processed as possible. To do this, all sentences are turned to lowercase. Also, all non-alphanumeric symbols are removed. After this, all words are turned to their lemma form. This means that a word is turned into a form that one would find in a dictionary, e.g. the word “birds” would be turned into “bird”. After this, the subject of the sentence is separated from the sentence. This leaves two parts to be analyzed: the ‘subject’ part and the ‘sentence’ part. All sentences that contain words of which no information can be found are discarded.

Once a sentence has been made suitable for analysis, the words can be transformed to their operator representation. To obtain the positive operator representations of the words, Lewis’ method described in section 2.2 is used. For this project it was decided that after the positive operator representations for the words had been made, they are normalized by dividing the matrix by its highest eigenvalue. This is done because oth-erwise it would unfairly skew results; if one operator has many different hyponyms that are added to it, this will be weighed much heavier than it should be.

(17)

It should be noted here that a word can have several definitions, and these definitions have different hyponyms. The ideal for this research would be to always only take the intended definition, but that was unfeasible. Therefore, we decided to add the operators of all hyponyms of all definitions together to make the word operator.

Once the words have a positive operator representation, the words from the verb phrase are combined so that only two operators remain. This is done according to equations 2.7 - 2.9. The specific equation used is dependent on the experimental setting. Only sections of 2 or more words are combined. This is because if a longer sentence were used, the sentence would become more complex and this needless complexity would interfere in the results.

Once only two operators of the sentence are left, the kE can be made by placing these

in a Löwner order and creating the kE out of its error term, as was shown in section 2.2.

The kE of the alternative sentences are based on the same principles.

Alternative sentences are generated by replacing the subject from the original sen-tence with an alternative subject. This alternative subject is obtained by taking the original subject and going two hypernymous layers up in its WordNet-hierarchy. Then all words that are two hyponymous levels down from that higher layer are taken, and their corresponding positive operators are accumulated to construct the alternative sub-ject. This means that an alternative subject is not something concrete such as “animals”, as would be desired. Instead it is a more abstract accumulation of words that are close to the original word in the WordNet hierarchy.

(18)

Chapter 5 Results

5.1 Spearman’s rank correlation coefficient

Table 5.1 shows the dummy results. Initial exploration of the dummy results shows that the strongest correlation generally seems to be the relative assertability metric. In general, the pointwise multiplication method seems to have the strongest correlation results, except in the case of regular assertability - but these have unreliable results. Also of note is that in the cases of matrix multiplication, the relative assertability’s correlation is much lower than in the case of pointwise multiplication. Oddly, the kE

results of the pointwise multiplication also show much more positive results than the other multiplication methods. On the other hand, the regular assertability metric performs much worse here - though this is largely explained by the uncertainty value of these results.

For the GenericsKB database, there seems to be little correlation on each variant between assertability and the assessed score of the sentences. It is notable that all the relative assertabilities shown, seem to be inversely correlated with the assessed scores. The strongest correlation shown here is on the high quality knowledge base with the relative assertability metric on pointwise multiplication, but this is also inverse.

It was expected that increased dimensionality would increase correlation, as this would increase the amount of nuance to the data. This was not significantly shown, however.

It is notable that every kE shown has an inverse correlation, even in the cases of the

dummy datasets. This finding is actually consistent with the theory, however. The kE

is similar to the conditional probability, and it was stated before that the conditional probability is in many cases more often false than true - a sentence’s real assertability is only supposed to show when compared to alternative cases.

(19)

[word1] [word2] kE Assertability Rel. Assertability 50d Dummy 0.344 0.0313* 0.751 100d Dummy 0.751 -0.0626* 0.751 200d Dummy 0.595 0.0626* 0.751 300d Dummy 0.720 0.282 0.751 [word1]12[word2][word1] 1

2 k_E Assertability Rel. Assertability

50d Dummy -0.376 0.532 0.454 100d Dummy -0.313 0.407 0.580 200d Dummy -0.376 0.470 0.532 300d Dummy -0.470 0.564 0.344 [word2]12[word1][word2] 1

2 k_E Assertability Rel. Assertability

50d Dummy -0.376 0.501 0.517 100d Dummy -0.282 0.532 0.658 200d Dummy -0.376 0.532 0.0314* 300d Dummy -0.564 0.564 0.344

Table 5.1: These tables shows the Spearman’s rank correlation coefficient of each measure of assertability and its results with DummyKB, using the combination method that is shown in the top left. 50d here means that the 50-dimensional GloVe vectors were used. If a result has a * next to it, it is considered unreliable because of a high uncertainty value.

(20)

[word1] [word2] kE Assertability Rel. assertability 50d -0.16 0.00398 -0.0391 100d -0.145 0.0297 -0.00624 50d best -0.0397 -0.0982 -0.126 100d best -0.0834 -0.0479 -0.0711 50d WordNet -0.198 -0.050 -0.129 100d WordNet -0.274 0.00693 -0.0861 50d best WordNet -0.00436 0.00609 0.00321 100d best WordNet 0.0641 0.0151 0.00800 [word1]12[word2][word1] 1

2 k_E Assertability Rel. assertability

50d -0.103 0.0154 -0.0198 100d -0.131 0.00219 -0.0234 50d best -0.146 -0.00853* -0.0779 100d best -0.145 0.0147 -0.0547 [word2]12[word1][word2] 1

2 k_E Assertability Rel. assertability

50d -0.158 0.0127 -0.0807 100d -0.184 0.0316 -0.052 50d best -0.0521 -0.0316 -0.0765 100d best -0.0968 0.0226 -0.0315

Table 5.2: This table shows the Spearman’s rank correlation coefficient of each measure of assertability used and its results with each dataset, using the combina-tion method that is shown in the top left. 50d here means that the 50-dimensional glove vectors were used. “WordNet” means that the dataset was not filtered to exclude the sentences with a truth value of 1.0, and “Best” means that it uses the GenericsKB dataset which contained the higher quality results. If a result has a * next to it, it is considered unreliable because of a high uncertainty value.

The two methods of matrix multiplication show no real difference in correlation, but the pointwise method does show a slightly stronger correlation with cases. Notable is also that this managed to get the same relative assertability on every dummy test. It is, however, questionable if this is significant enough to claim with certainty that pointwise multiplication is a better method.

Among the three different ‘natural’ datasets, WordNet seems to perform the best, but not significantly so. The ‘best’ dataset seems to be performing slightly better than the GenericsKB set as well. This is as according to prediction, but the difference between best and WordNet in performance cannot be called significant enough to draw a conclusion. Notable here is that if the best datasets are combined with not sorting out the wordnet sentences, the correlation is positive.

In general, with the GenericsKB sets the correlations are so weak that no real con-clusion can be drawn from them.

(21)

5.2 ROC/AUC

[word1] [word2] kE Assertability Rel. Assertability

50d 0.352 0.454 ? 100d 0.317 0.418 ? 50d Dummy 0.25 0.146 0.182 100d Dummy 0.292 0.229 0.091 200d Dummy 0.25 0.187 0.187 300d Dummy 0.187 0.125 0.396

Table 5.3: This table shows the ROC-AUC score of each measure of assertability used with each dataset. Only the pointwise multiplication method was used. 50d means that 50-dimensional glove vectors were used. “Dummy” means that the dummy dataset was used instead of the normal GenericsKB set. If a result has a ? instead of a number, it could not be obtained.

In table 5.3, the results of the ROC/AUC analysis are shown. There are no results for the genericsKB sets on relative assertability, as this metric can also return infinity as an answer and this cannot be processed by the measure. It is shown here that the results for the dummy dataset are actually much worse than on the GenericsKB version, while the opposite is true for the Spearman’s coefficient. The finding that the system classifies more sentences wrong than right is consistent with the Spearman’s correlation findings from GenericsKB, as it shows negative correlations with the data there. Once again, no significant difference is shown between the different dimensionalities.

5.3 Analysis

5.3.1 Spearman’s coefficient

Since many of the results seem questionable, a sentence is dissected here to see what could have gone wrong. For this, the sentence ‘Cats are pencils,’ is chosen. This sentence is chosen from the dummy set and is clearly intended to be false, but of all false sentences in DummyKB it has the highest relative assertability.

The way the used GloVe vectors are chosen, works for each word in three layers. These are the definition layer, the hyponym layer, and the lemma layer. All these will be examined here.

The word “cat” in this sentence has 10 different possible definitions. While it is intended to be used in the sense of the housecat here, notable wrong definitions used here are “guy”, “caterpillar”, “computerized tomography”, and the verb “vomit”. In total there are 10 different definitions used here.

(22)

In the hyponym layer of “cat”, most definitions show no corresponding hyponyms. However, the “big cat” definition of “cat” has 9 hyponyms, while the intended definition has only 2. There is only 1 other hyponym in this layer, so this puts a lot of weight on the wrong definition here.

The final layer is the lemma layer. The intended definition of “cat” has 5 different lemmas associated with its hyponyms. However, only 1 of these corresponds with a glove vector. “cat” has 23 different unintended lemmas, of which 9 seem likely to be in the GloVe dataset.

In the end, this would mean that for the word “cat”, only 1/10 used vectors actually correspond to the intended definition. If most of the wrong definitions had a similarly high number of hyponyms instead of none, this ratio would even be much smaller.

The other words in the sentence have much more predictable definitions and therefore don’t have questionable GloVe vectors being used.

5.3.2 ROC/AUC

The ROC/AUC results were very strange as well. While the dummy dataset performed well on Spearman’s coefficient, it showed very poor results on the ROC measure. The explanation for this seems to be that the small sample size caused outliers to show themselves more prominently. The ROC measure cannot accept infinity as an answer and these results were therefore sorted out of the results for relative assertability. One of these was one of the four sentences marked as true. It’s assumed that the results would match the Spearman’s correlation results more closely if the dataset was large enough to be statistically significant.

(23)

Chapter 6 Discussion and Conclusion

It is clear that there is little correlation between the acceptability of the sentences and the assertability score of these, as calculated here. There are several reasons possible for this, which will be mentioned first.

Most words in a sentence have several possible definitions, while only one would need to be used. Tt was chosen to take every possible definition in account here, but the sentence would have a more accurate meaning if only the right definition was chosen for each word.

The analysis in the results section also reflects this issue - the inclusion of wrong definitions often seems to give sentences a completely wrong definition to accompany the words that are in it.

The knowledge base used has many sentences with a questionable truth value. Many of the sentences can be political in nature, nonsensical, or improperly tagged. It would have been preferred to have a knowledge base that had simpler sentences that are clearly either true or false. The intention of this research had been for sentences that are of a bare plural form, e.g., ‘Dogs bark,’ or, ‘Cats are mammals.’ Future research would benefit from using such a knowledge base with little ambiguity in the correctness of sentences, but currently no other knowledge bases on generic sentences seem to exist.

Only generic GloVe vectors have been used to determine the values of the positive operators. It seems likely that if custom vectors had been generated, which are based on the used generic sentences dataset, performance would improve.

For the generation of alternative sentences, moving up two layers to determine the scope of the alternative subjects was chosen. This was done because moving a single layer was considered to be too narrow a scope to obtain a suitable alternative. For example, Moving up a single layer with the word “dog” would place it in the scope of canines. Moving a static number of layers is not a good way to obtain the right scope, however, and a dynamic method to obtain the scope would generate a more suitable alternative sentence.

(24)

Many of these sentences are statements about causality. Therefore, in future research it would be productive to find a way to make generic sentences fit in a causal model of calculation.

The assertability of a generic sentence could not be determined well by measuring the association of its component words in this system. However, there are many points where the attempt could be improved. Datasets that are more specific to this research seem like the most important improvement here, as results are highly influenced by their contents. Also more datasets on generic sentences would have been useful, as the usage of a single dataset makes the results heavily biased towards that set.

(25)

Bibliography

Greg N Carlson. Generic terms and generic sentences. Journal of philosophical logic, 11 (2):145–181, 1982.

Greg N Carlson. On the semantic composition of english generic sentences. In Properties, types and meaning, pages 167–192. Springer, 1989.

Gregory N Carlson. Truth conditions of generic sentences: Two contrasting views. The generic book, pages 224–237, 1995.

Arnold Kochari, Robert Van Rooij, and Katrin Schulz. Generics and alternatives. Fron-tiers in Psychology, 11, 2020.

Thomas K Landauer and Susan T Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211, 1997.

Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. In proceedings of ACL-08: HLT, pages 236–244, 2008.

Martha Lewis. Compositional hyponymy with positive operators. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 638–647, 2019.

Jelle Bosscher. Capturing implicit biases with positive operators, 2020.

Karl Löwner. Über monotone matrixfunktionen. Mathematische Zeitschrift, 38(1):177– 216, 1934.

Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter Clark. Genericskb: A knowl-edge base of generic statements. arXiv preprint arXiv:2005.00660, 2020.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

George A Miller. WordNet: An electronic lexical database. MIT press, 1998.

(26)

Appendices

(27)

Appendix A

DummyKB

SOURCE TERM QUANTIFIER GENERIC SENTENCE SCORE Dummy Dogs Dogs are Dogs 0.999 Dummy Cats Cats are Cats 0.999 Dummy Cats Cats are Dogs 0.001 Dummy Dogs Dogs are Cats 0.001 Dummy People People are People 0.999 Dummy People People are Dogs 0.001 Dummy People People are Cats 0.001 Dummy Cats Cats are People 0.001 Dummy Dogs Dogs are People 0.001 Dummy P e n c i l s P e n c i l s are P e n c i l s 0.999 Dummy P e n c i l s P e n c i l s are Dogs 0.001 Dummy P e n c i l s P e n c i l s are Cats 0.001 Dummy P e n c i l s P e n c i l s are People 0.001 Dummy Dogs Dogs are P e n c i l s 0.001 Dummy Cats Cats are P e n c i l s 0.001 Dummy People People are P e n c i l s 0.001