Capturing Implicit Biases With Positive Operators

(1)

Capturing Implicit Biases With

Positive Operators

(2)

Layout: typeset by the author using LA_TEX.

(3)

Capturing Implicit Biases With Positive

Operators

Jelle M. Bosscher 10776583 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisors

dr. M.A.F. (Martha) Lewis dr. K. (Katrin) Schulz

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 107 1098 XG Amsterdam

(4)

Abstract

The Implicit Association Test (IAT) is able to identify implicit biases held by humans with a categorization task. The test shows that, among other things, humans associate flowers more with pleasant attributes, and insects more with unpleasant attributes. In response to these findings,Caliskan et al.(2017) developed the Word Embeddings Association Test (WEAT) to extract the same implicit biases from text corpora using machine learning methods. The WEAT uses word embeddings to successfully identify the same implicit biases. However, the IAT is based on a categorization task, whereas the WEAT uses word similarity to measure differences in association. In this thesis, we propose a new method for retrieving these implicit biases from word embeddings. Our new method is based on a measure that actually targets categorization instead of similarity. In order to use this measure we must make use of a different representation of words in distributional semantics: positive operators. We construct these positive operators from two different sources of hyponymy. The resulting method we developed is the Positive Operator Association Test (POAT). The results indicate that by introducing such a measure we are able to outperform the WEAT in the effort of replicating the IAT. Our method holds promise for improving the ability of identifying implicit biases in machine learning methods.

(5)

Acknowledgments

It’s been a long road, but I think I’m finally there. So I want to take the time to thank the people that got me here and, in particular, those who assisted me in writing this thesis.

Firstly, I would like to thank my supervisors for the interesting ideas that have become my thesis and for the effort and time they put into guiding me through this: Dr. Martha Lewis, for the research you are doing is exciting and I appreciate being a part of it and Dr. Katrin Schulz, for without you this thesis would have been an incoherent mess. I would also like to thank my fellow student Emiel Sanders and my brother Hemmo Bosscher for taking the time to proofread this thesis, like they did for so many of my other writing assignments. Additionally, I would like to thank Hemmo Bosscher and Anna Zubkova for allowing me to spend so many days in their home over the past few months where they provided for me and took care of me so I could focus on other things.

This thesis, however, is not just the result of three months of work. It is an accumulation of the last three years in particular. Therefore, I would like to thank my fellow student Jesse de Wringer for our many collaborations and being the student I aspire to be. I would also like to thank Hugh Mee Wong, a fellow student as well, for her appreciation of pretty LA_{TEX formatting or the moments we got too excited about}

logic. But mostly I just want to thank her for being there, I would not have done everything half as well without her.

Lastly, I would like to thank my parents and my sister for all their love and for believing in me.

(7)

CHAPTER 1 Introduction

Modelling words as vectors has been an extremely successful way of representing word meaning in a way that can be programmed into a computer (Landauer and Dumais, 1997; Mikolov, Sutskever, et al., 2013; Pennington et al., 2014). These word vectors have been shown to be affected by the ideas and beliefs of the humans that generated the corpora they are extracted from (Bullinaria and Levy, 2007; Stubbs, 1996). Caliskan et al. (2017) showed that our biases and stereotypical beliefs could also be extracted from these representations. In order to identify implicit biases held by human speakers, the results from a physiological test called the Implicit Association Test (IAT) (Greenwald et al., 1998b) are used as a benchmark. Caliskan et al.(2017) then compared these results to that of their Natural Language Processing (NLP) version of the IAT called the Word Embedding Association Test (WEAT). Utilizing widely adopted distributional models, mainly focusing on GloVe embeddings (Pennington et al., 2014) but also word2vec (Mikolov, Chen, et al., 2013), they were able to replicate every association documented by the IAT that they tested. This leads them to expect that human biases are in general retrievable from statistical properties of language use.

However, the two test methods used to extract implicit biases from human speakers on the one hand and from corpora on the other differ clearly in methodology. The IAT uses a categorization task between a target concept and attribute. The WEAT uses a similarity measure to test for bias in the corpus. There are important differences between linking words (or concepts) by asking whether one can be categorized under the other or asking how similar they are to each other. For instance, a distance measure is symmetric and therefore does not capture the same meaning as an asymmetric measure that is able to categorize target concepts into an attribute category as is done in the IAT. This gives rise to the question whether we can also find the same implicit biases in corpora when the focus lies on extracting information about categorization from them.

The goal of this thesis was to develop such an alternative test for implicit biases in corpora that is based on categorization instead of a similarity measure. Concretely, we used the positive operator representation of words that is capable of graded categorization in terms of hyponymy. To build the representations, we used two different sources of hyponymy and measure the bias using two methods described inLewis(2019) on graded hyponymy. The resulting test method for implicit biases in corpora is called the Positive Operator Association Test (POAT). Our findings indicate that the POAT is indeed able to replicate the results from the WEAT in most of the applicable tests. We will evaluate the quality of this measure by again comparing the predictions it makes with the well-established results of the IAT.

(8)

CHAPTER 2 Background

Before we go into the details of the new method for extracting implicit biases from corpora we have developed, we will first discuss the methodology applied in Caliskan et al. in more detail. This will set the background against which we will develop our proposal and allows us to introduce the key concepts and methods that we used.

2.1 The Implicit Association Test

As stated above, the goal ofCaliskan et al.(2017) was to test whether one can extract biases held by human speakers from the corpora they produce. In order to check whether the methodology makes correct predic-tions, they compared their results to established IAT findings. The IAT is a test within social psychology designed to measure the strength of a person’s subconscious association between mental representations of objects. It is frequently used to test for implicit stereotypes or biases held by a test subject. The way it captures these associations is by measuring the difference in response times in a series of categorization tasks. Each test is slightly different in setup, but uses the same sets of words between which the subject has to make associations: two sets of target concepts and two sets of attribute words. The IAT measures the response time for a test subject to classify the target words with attribute words. The null hypothesis that is tested is that there is no difference in response time between the sets of target words and either of the attribute words. An example of such a test is the pleasant and unpleasant associations with flowers and insects. In the flowers and insects example the attribute words are two sets containing positive and negative words, respectively (e.g., freedom, health, love, .. and accident, death, grief, ..), the target words a set flower species (e.g., aster, clover, poppy, ..) and a set of insect species (e.g., fly, maggot, tarantula, ..). Since most people will behave similarly in this specific example, associating pleasant words more with flowers and negative words more with insects, this served to test the principal assumption underlying the implicit association test (Greenwald et al., 1998a): “That associations can be revealed by mapping two discrimination tasks alternately onto a single pair of responses”. Greenwald et al.(1998b) show that the test subjects have stronger associations between flowers and pleasant words and insect and unpleasant words by using two different measures. The strength of association is shown firstly by measuring the effect size of these timings, which estimate the standardized effect size over both sets of target words and secondly by performing a permutation test, which estimates the likelihood that the null hypothesis should be rejected. Furthermore, many well-known biases have been confirmed by the IAT, including European-American vs. African-American names as targets words and pleasant vs. unpleasant words as attribute words or the difference between math vs. arts target words and male vs. female attributes (Monteith and Pettit, 2011; Nosek et al., 2002a; 2002b).

2.2 The Word-Embeddings Association Test

Caliskan et al.(2017) asked themselves whether the biases that people hold would affect the text corpora they generate and therefore be picked up by computational models of word meaning that are trained on these corpora. To answer this question Caliskan et al. looked at word embeddings (Mikolov, Sutskever, et al., 2013). Word embeddings are vectors whose entries are characterized by the context of the words it encodes. In a relatively simple example fromLandauer and Dumais(1997) called Latent Semantic Analysis (LSA), a vector for a word w is constructed by counting the words that appear in close proximity to w in

(9)

the corpus. Every word is then represented as a vector, with one entry for each word in the vocabulary and the count of how many times those words occur close to it as its elements. These vectors are subsequently reduced in dimensionality, resulting in a representation where these word embeddings encode the context of the word. More complex methods of building word embeddings from corpora are word2vec (Mikolov, Chen, et al., 2013) and GloVe (Pennington et al., 2014). Word2vec uses local information of words encoded as tuples (x, y) (where y appears in the context of x) to generate word embeddings with a two-layer neural network. GloVe, on the other hand, builds vector representations of words by looking at both local and global context (hence the name GloVe: Global Vectors) of the words. All the models share one important feature: the words that have similar meaning are shown to be close together when represented as word embeddings. This notion of similarity is then computable in terms of distance in the vector space of the embeddings. More recently there have been many advancements in a family of embeddings called contextual embeddings. Examples include BERT (Devlin et al., 2018) and ERNIE (Sun et al., 2019). Contextual embeddings can assign the same word different embedding based on its specific context, capturing meaning across multiple varied contexts. Caliskan et al.do not specify why they refrain from using these newer family of embeddings, but supposedly do so because they are interested in capturing bias across a corpus, not in specific context.

The goal of Caliskan et al. (2017) amounted to replicating different demonstrations of implicit human bias fromGreenwald et al.(1998b) with the widely used word embeddings word2vec (Mikolov, Chen, et al., 2013) and GloVe (Pennington et al., 2014). Greenwald et al.(1998b) uses the IAT to identify the implicit biases by comparing the associations between two sets of target words and two sets of attribute words. To replicate these findings they developed a new test called the Word-Embeddings Association Test (WEAT). The WEAT is a statistical test that is based on the IAT, using the same measures but testing on word embeddings instead of test subjects and taking the distance between a pair of vectors as analogous to the reaction time in the IAT. They determine the distance between two vectors using the cosine similarity, a measure of correlation. Their null hypothesis is the same as the one used in the IAT. Using the WEAT Caliskan et. al. attempted to replicate eight different demonstrations of implicit human bias fromGreenwald et al.(1998b). They successfully replicated every association that they tested and highlights the possibility that all implicit human biases are present in the statistical representation of language, and that the WEAT is capable of detecting it. Moreover, the ability of the WEAT to not only show that word embeddings encode stereotyped biases but also other knowledge, such as the pleasantness of a flower or the permanence of physical diseases, support the distributional hypothesis in linguistics: much of what we mean by the meaning of a word can be captured from its context (Sahlgren, 2008).

It is worth emphasizing the fact that the results, the effect sizes and p-values, do not have the exact same interpretation as the IAT because of the difference in ‘test subjects’: the IAT experiments are performed by individual people and their results are averaged over a relatively small group whereas the WEAT bases its results on a very large corpus which in turn is based on the input of millions of people.

However there is a clear methodological problem with the way the WEAT is defined. While the IAT is a categorization task that measures the difference in response time of the test subject when they are tasked to categorize a concept into an attribute category, the WEAT uses similarity as a proxy for these response times. The central question of this thesis is whether we can also find implicit biases when focusing on categorization information that can be extracted from corpora. The goal is to develop a test that does exactly that. This test will be introduced in the next chapter.

(10)

CHAPTER 3 Method

To develop this alternative test we will build on the work ofBankova et al.(2019) andLewis(2019) and use positive operators as representations of word meaning that can be extracted from corpora. We use positive operators because they have an ordering which can be interpreted as categorization.

In this chapter we will firstly introduce the notion of hyponymy, and a graded version, with the help of an example and describe how the hyponyms of a word can be gathered. Afterwards, we introduce positive operators as a method that represent words as collections of hyponyms within a vector based model. Finally, we will use these positive operators to identify the implicit biases in corpora using the same measure of differential association as the WEAT. The entire implementation of the POAT and the reimplementation of the WEAT are built in Python3. The packages that are used consist of NLTK (Loper and Bird, 2002) to interact with WordNet, Pandas (McKinney, 2010) to work with large datasets and NumPy (Oliphant, 2006–) to manipulate large vectors and matrices. The final implementation can be found here: https://github.com/jellebosscher/POAT

3.1 Hyponyms

The notion of categorization is related to subsethood. For example, if we say that oaks are trees, this is true because, in fact, every oak is indeed a tree. This relation is called hyponymy: oak is a hyponym of tree. In standard word embeddings, there is no natural way to represent this relationship, although there have been various approaches to providing measures of hyponymy between word vectors (Kotlerman et al., 2010;

Lenci and Benotto, 2012; Weeds et al., 2014).

The relationship of hyponymy can be represented as subsethood. Given information about all the entities we wish to talk about, and which category they belong to, hyponymy is the subsethood relation. If we represent oak as all the instances that are oaks, and tree as all the instances that are trees, the fact that oak is a hyponym of tree is modeled by: oak ⊆ tree. We may also wish to introduce a graded notion of hyponymy, to reflect the fact that some categories are not perfect subcategories of others: For example, dog is a pet. dog is not fully contained in pet because there are dogs that are not pets.

To build the sets corresponding to each word we gather the information about hyponymy relations from databases such as WordNet (Fellbaum, 1998) or Microsoft’s Concept Graph (Wang et al., 2015; Wu et al., 2012). These databases are arranged as trees, with general categories such as ‘entity’ at the root of the tree and very specific instances at the leaves. To generate the sets corresponding to each word we can take the leaf instances that fall under each word. Another way of generating sets for each word, which will have the same subsethood relation, is to take the set of all hyponyms that fall under a given word. We will later discuss the benefits of this second option. We now go on to describe WordNet and Microsoft’s Concept Graph in detail.

WordNet In contrast to distributional models of language, human-curated models of semantics also exist. One such model is WordNet (Fellbaum, 1998). WordNet is a large lexical database of the English language that groups words into sets of cognitive synonyms called synsets. For each of these synsets the database includes the semantic relations between them in addition to semantic definitions and example usage. Com-bining definition and example usage with synonymy means that WordNet can be seen as a combination of a dictionary and a thesaurus. Together, these relations and synsets produce the WordNet hierarchy. The downward edges in WordNet capture the hyponymy relation and the upward edges the hypernym relation (the reverse of hyponymy). Thus, to construct a set of hyponyms of a word, we gather the transitive closure

(11)

of the hyponymy relation. Meaning that starting from a specific sense in the WordNet hierarchy, all the downward paths are traversed and all senses in these paths are collected together. Furthermore, this is done for all senses in the synset in a given word. Finally resulting in a set containing all hyponyms of all senses of a word.

Microsoft’s Concept Graph Microsoft’s concept Graph (Wang et al., 2015; Wu et al., 2012) is a very large graph consisting of 5,376,526 unique concepts, 12,501,527 unique instances, and 85,101,174 is-a relations. The between the concepts and instances are automatically derived from large corpora using Hearst patterns (Hearst, 1992). Hearst patterns detect hyponym-hypernym relations between pairs of words by looking for specific text patterns. For example: y such as x or y including x and z. The result is the ability to pick out pairs (x,y), (z,y) in a corpus such that x is-a y and z is-a y. To construct a set of hyponyms for a certain word we gather all the concepts that appear in an is-a relation with the specific word.

3.2 Positive operators

We wish to represent a word as a set of hyponyms within a vector-based model. In order to do so, we use the notion of a positive operator. A positive operators over a vector space Rn is a matrix A with the following two properties:

• ∀v ∈ V.h~v, M~vi ≥ 0 • M is self-adjoint

In particular, given any vector ~_{w ∈ R}n_{, the outer product of that vector with itself, ~}_{w ~}_wT_{, forms a positive}

operator, and furthermore positivity is preserved under addition. Positive operators have an ordering to them called the L¨owner ordering (L¨owner, 1934) given by

A ≤ B iff B − A is also positive. (3.1)

We interpret this ordering as hyponymy. In Lewis(2019), two graded versions of the L¨owner ordering are proposed, which can be used to represent graded hyponymy. In order to build positive operators, we use the collections of hyponyms generated from WordNet or Microsoft’s Concept Graph and represent a word by taking all of its hyponyms.

Building the positive operators Since we are replicating the efforts ofCaliskan et al.(2017), this paper uses the same word embeddings as used for the WEAT1_{: GloVe embeddings (}_{Pennington et al., 2014}_{). The reason}

they chose the GloVe embeddings is to ensure impartiality versus training their own embeddings, without possibly sneaking in their own bias in the process. Additionally, the use of the GloVe embeddings simplified their procedure and, since they are widely used in machine learning methods, allows them to replicate the effects in real applications.

The authors of the GloVe embeddings offer several different embeddings trained on different corpora. Caliskan selected the largest of the four corpora aptly named the “Common Crawl”. This corpus is obtained by a web crawler2 _{that has scraped the contents of a large portion of the Internet. The result is a corpus}

containing 840 billion case sensitive tokens, which results in a vocabulary of 2.2 million different tokens. The actual embeddings used byCaliskan et al.(2017) are the 300 dimensional vectors. For the same reason we use GloVe, we use the specific version of the embeddings that are extracted from the “Common Crawl” corpus.

To build the positive operators we use a highly similar method to the approach in Lewis (2019). The words are modeled as collections of their instances. These collections are sets of vectors lifted into the larger space of Rn

⊗ Rn_{, where ⊗ is the tensor product of vector spaces. This is done by taking the outer product}

of the vectors in Rn_{, where n is the amount of dimensions:}

¯

v := ~v~vT (3.2)

The result is a 300 by 300 matrix representing a single word. To model a collection of words, all their matrix representations are summed together. For example, the set of words vectors {~v, ~w, ~x} is organized as follows:

1_{The WEAT initially used GloVe embeddings. After already successfully replicating all the implicit biases they tested for} using GloVe, they repeated the same experiments with word2vec. The word2vec results were significantly worse than the GloVe results.

(12)

M = {¯v, ¯w, ¯x} 7→ ¯v + ¯w + ¯_{x ∈ R}n⊗ Rn

= ~v~vT + ~w ~wT+ ~x~xT ∈ Rn

⊗ Rn (3.3)

The eigenvalues and eigenvectors of M summarize the information it contains to some extent. A simple matrix of the form v = ~v~vT _{has one non-zero eigenvalue. A positive operator, which is a summation of}

multiple of these matrices, may have more than one non-zero eigenvalues and a broader concept will have more non-zero eigenvalues, indicating that the concept covers a larger subspace of Rn_{. We will use the}

eigenvalues and eigenvectors to introduce our measure of graded hyponymy based on the L¨owner ordering. Finally, to represent words using positive operators, we use (3.3) to add together the vector representations of the sets of all hyponyms for each word.

Evaluating positive operator representation In order to verify that the positive operators we built still contain at least the same information as regular word embeddings in terms of similarity, we compare them to the results of multiple works on word similarity. We do this by using the word vector evaluation package provided byFaruqui and Dyer. The package compares the similarity scores for a set of vectors to that of 13 different datasets. In line withLewis(2019) we select a subset of these to compare on, namely theMiller and Charles (1991), MEN (Bruni et al., 2012), Rubenstein and Goodenough (1965), SIMLEX-999 (Hill et al., 2015), SimVerb (Gerz et al., 2016), WordSim353 (Finkelstein et al., 2001) andYang and Powers (2006) datasets. This selection is based on omitting datasets focusing on rare words and dataset splits on relatedness instead of similarity. To use the package with matrices we had to unroll the positive operators row by row, resulting in vectors with 90000 elements. As can be seen inTable 3.1, the positive operators perform similar to regular GloVe embeddings on most datasets and even outperform them on some. Since we know that these positive operators still represent at least as well as regular word embeddings, we can move on to the next step: using the positive operators to measure graded hyponymy.

Embeddings Positive operators

MC 0.7026 0.7101 MEN 0.7375 0.7300 RG 0.7662 0.7300 SIMLEX-999 0.3705 0.3927 SimVerb 0.2267 0.2548 WS-353 0.6054 0.5890 YP-130 0.5613 0.6589

Table 3.1: Performance of positive operators based on GloVe embeddings and built using hyponyms derived from WordNet versus regular GloVe embeddings. Bold values indicate the highest value of each comparison.

3.3 Measuring categorization

As stated inSection 2.2, the WEAT uses cosine similarity to measure difference in association between sets of target words and attribute words. The results are then compared to those obtained using the IAT for the same sets of target words. However, the IAT uses a notion of categorization as a way of measuring association. It measures the difference of the time it takes to successfully categorize the target concepts into a certain category, which in the case of the IAT is a class of attributes, for each set of target concepts. The use of positive operators now allows us to measure the strength of association between two words in terms of graded hyponymy: given two positive operators A and B, graded hyponymy measures to what extent A is a hyponym of B. Our hypothesis is that this measure mimics the workings of the IAT more closely. The asymmetric graded hyponymy measure serves as a proxy for reaction time in the IAT. We expect that a target concept that is categorized into a certain attribute class easily by the subjects in the IAT will also have higher measure of graded hyponymy. Next, we will discuss how measuring graded hyponymy is possible using positive operators and how this allows us to measure differential association between sets of words.

(13)

Lewis (2019) proposes two different measures of graded hyponymy which are both considered in this thesis. The first measure is named KE and measures the proportional relation between an error matrix E

and positive operator A. Let B also be a positive operator and let A vKE B be a function that returns the

degree to which A is a hyponym of B. If A ≤ B then B −A = D where D is also a positive operator. However, if A 6≤ B we need to add in an error term in order to make sure that D is still positive: B − A + E = D. E then corresponds to the part of A “not contained” in B. In a set-theoretic analogy, where A and B are sets, − is set difference, + is union and ≤ is subsethood, E is as illustrated inFigure 3.1.

B A

E

Figure 3.1: This Venn diagram represents the size of E if A 6≤ B in order to make sure that D = B − A + E is positive.

Subsequently, we can use the size of E and the L¨owner ordering to compute our measure of graded hyponymy. We do this by constructing the error matrix E as follows (steps taken fromLewis(2019)):

1. Firstly diagonalize B − A, resulting in a real valued matrix, since B − A is real symmetric.

2. Then, construct a matrix E by setting all positive entries of B − A to 0 and changing the sign of all negative eigenvalues.

The value of A vKE B, i.e. the size of E as a proportion of A, is then calculated as follows:

A vKE B = 1 −

||E||

||A|| (3.4)

If B is only represented by a single word the values for KE will mostly be low, analogous to measuring

subsethood in A ⊆ B when B has only a single element.

The second measure that Lewis proposes to measure graded hyponymy is KBA. This measure looks at

the eigenvalues of B − A to determine to what extent A is a hyponym of B. More concretely, KBAmeasure

the proportion of negative and positive eigenvalues where (| · |) denote the absolute value of the term:

A vKBAB = P iλi P i|λi| (3.5)

If all eigenvalues in B − A are positive the result will be 1 and −1 if all eigenvalues are negative. In other words, if all eigenvalues of B − A are positive, A is fully contained in B and is therefore a hyponym of B to the fullest extent. Both measures of graded hyponymy can now be used to calculate the difference in association between a set of target concepts and a set of attribute words by means of graded categorization.

3.4 Measuring difference of association

Now, instead of taking the difference of the time it takes to successfully categorize a target concept into an attribute category, we take the difference in graded hyponymy between a target concept and the attributes to measure association. This paper uses the same method to measure differential association as Caliskan et. al. except instead of using similarity between single words we use graded hyponymy. To illustrate the workings of this method, consider the non-offensive bias found inGreenwald et al.(1998b) when comparing the associations between two sets of target concepts: flowers vs. insects, and two sets of attribute words:

(14)

pleasant vs. unpleasant. The target concepts in this example contain different types of flowers (e.g., aster, clover, . . . ) and different types of insects (e.g., ant, caterpillar, . . . ). The attributes consist of a collection of words that are known to be strongly associated with the terms pleasant (e.g., love, peace, . . . ) and unpleasant (e.g., murder, sickness, . . . ). The null hypothesis is that the difference in association when comparing the sets of target words and to each attribute set is zero. This means that the relative similarity of the set of flowers compared to the attribute sets pleasant and unpleasant is equal to that of the set of insects compared to the same attribute sets. The likelihood of the null hypothesis is computed using a permutation test. This test measures the probability that a random permutation of attribute words yields the observed difference in sample means.

To formalize all of the above, let X and Y be two sets of target words and let A and B be two sets of attribute words such that the sets of target words and the sets of attribute words have the same amount of words respectively.3 Then recall the definition of the function A vKE B as above. The test statistic that

allows the acceptance or rejection of the null hypothesis is:

s(X, Y, A, B) = X x∈X s(x, A, B) −X y∈Y s(y, A, B) (3.6) where

s(m, A, B) = meann∈A(m vKEa) − meanb∈B(m vKEb), (3.7)

and m represents the positive operator of the word m as given in (3.7). Therefore, s(m, A, B) measures the association of w with the sets of attribute words. If m is associated more with A this value will be positive, if m is associated more with B this value will be negative. The formula s(X, Y, A, B) thus measures the differential association between the two sets of target words and the two sets of attribute words.

The permutation test computes the probability that the test statistic of a random permutation of the target words is larger than that of the original sets. The goal of the permutation test is to calculate the likelihood of the null hypothesis. Recall that the null hypothesis states that there is no difference in terms of association between the target and both attributes sets. This probability is named the one sided p-value and is determined as follows:

Pri[s(Xi, Yi, A, B) > s(X, Y, A, B)] (3.8)

Finally, the measures of difference of associations between the target sets and attribute sets is done by computing the effect size. More specifically, we compute Cohen’s d (Cohen, 2013) for each experiment as follows:

meanx∈Xs(x, A, B) − meany∈Ys(y, A, B)

std devw∈X∪Ys(w, A, B)

(3.9)

This is a normalized measure that shows how distributed two distributions are. In other words, the effect size estimates the difference of association. A larger positive value indicates that the first target set is associated more strongly with the first attribute set and the second target set more with the second attribute set. It is possible to use KBAinstead of KE by merely substituting all the occurrences of one for the other. We can

now use the above methods to find out whether we can replicate the results. To evaluate the results, we will compare them to the WEAT and the IAT.

3.5 The case study

The WEAT, and by extension the POAT, is a supervised method of detecting implicit bias: we need to select the bias we want to look for. In other words, we must define sets of target concepts and attributes. Caliskan et al.(2017) replicated eight different findings from the IAT. The target concepts and attributes used in those experiments are listed inTable 2, together with the sizes of these sets. Note that some sets have the same name, but this does not mean that these are necessarily equal or even subsets of each other. For example, the first occurrence of ‘European-American vs. African-American names’ is taken from Greenwald et al. (1998b) whereas the second occurrence is fromBertrand and Mullainathan(2004) and they only contain 13 names that are the same across both sets. The order of the target and attributes sets is meaningful: the expected implicit bias (based on WEAT results) is that the first target set is associated more with the first attribute set and the second target set is associated more with the second attribute set.

3_{This is needed to determine a correct test statistic using (}_3.6_{), since it sums over the values associations for each of target} sets.

(15)

Target words Attribute words NA NT Reference

1 Flowers vs. insects Pleasant vs. unpleasant 25 × 2 25 × 2 Greenwald et al., 1998b

2 Instruments vs. weapons Pleasant vs. unpleasant 25 × 2 25 × 2 Greenwald et al.,1998b

3 European-American vs. African-American names Pleasant vs. unpleasant 32 × 2 25 × 2 Greenwald et al.,1998b

4 European-American vs. African-American names same as rows 1-3 16 × 2 25 × 2 Bertrand and Mullainathan, 2004

5 European-American vs. African-American names same as row 10 16 × 2 8 × 2 Bertrand and Mullainathan,2004

6 Male vs. female names Career vs. family 8 × 2 8 × 2 Monteith and Pettit, 2011

7 Mental vs. physical disease Temporary vs. permanent 6 × 2 7 × 2 Nosek et al., 2002a

8 Math vs. arts Male vs. female terms 8 × 2 8 × 2 Nosek et al.,2002a

9 Science vs. arts Male vs. female terms 8 × 2 8 × 2 Nosek et al.,2002b

10 Young vs. old people’s names Pleasant vs. unpleasant 8 × 2 8 × 2 Nosek et al.,2002a

Table 3.2: This table lists ten different experiments consisting of nine different sets of target words and eight sets of attribute words tested for implicit association using the IAT. Each row corresponds to one such experiment where NA and NT indicate the number of target words and attribute words, respectively. The order of the classes (e.g.,

flowers and insects, pleasant and unpleasant) in the target words and attribute words show the association that has been documented by the IAT: flowers are implicitly associated more with pleasant and insects more with unpleasant.

The interaction with WordNet is not case sensitive, therefore the hyponyms gathered from WordNet are not always as expected. For instance, Jack is not only interpreted as a proper noun, but also as a noun in multiple synsets and even as a verb. This leads to Jack gathering over 250 hyponyms that have nothing to do with the proper noun sense of the word. Therefore, we decided to not collect any hyponyms for proper nouns and represent them by single vector positive operators. This decision makes sense intuitively: the name of a person should not have any hyponyms.

We removed all words from target sets and attribute sets that did not appear or had too few occurrences in the GloVe embeddings dataset. Because we are replicating the WEAT, we did this in exactly the same way as is done in Caliskan et al. (2017). All the resulting sets of target and attributes sets are listed in AppendixA.1.

Initially, we decided to drop three IAT sets: European-American vs. African-American names – pleasant vs. unpleasant, male vs. female names – career vs. family and young vs. old people’s names – pleasant vs. unpleasant. That was because all of these consist of target sets entirely made up out of proper nouns. This is a problem since WordNet only has entries for a very small amount of famous names, resulting in many operators being built without any hyponyms. However, since these proper nouns only occur in the target sets, we decided to keep them. As they need to be classified into a category instead of the other way around, these sets should still perform well on the POAT.

(16)

CHAPTER 4 Results

In this chapter we first present the results of four versions of the POAT. Furthermore, we will highlight and briefly discuss interesting findings for each version. We test on the the same sets of target words and attribute words as are used byCaliskan et al.(2017) (seeA.1. Firstly,Table 4.1shows the results of the POAT with hyponyms derived from WordNet and using the KE as the measure of graded hyponymy. Secondly,Table

4.2 shows the results of the POAT using the KBA instead of the KE, but the hyponyms are still derived

from WordNet. The third version of the POAT uses Microsoft’s Concept Graph to derive the hyponyms of each word and uses the KE measure. The results for this test are presented in Table 4.3. Table 4.4shows

the performance of the POAT using KE on single vector positive operators. Lastly,Table 4.5highlights the

results of the POAT for two related experiments where we test the effect of different attribute sets. We end this section by briefly discussing appendixA.2 which presents the number of hyponyms for every word in the target and attribute sets.

Table 4.1presents the results for the ten experiments also used in Caliskan et al.(2017) with KE as a

measure of graded hyponymy and hyponyms derived from WordNet. Each experiment is represented as a single row in the table and compares the effect sizes and p-values for the IAT, the WEAT and the POAT. This table shows that the POAT is able to replicate most of the results to a certain extent. The POAT performs well on non-offensive experiments such as the differential association between flowers vs. insects and pleasant vs. unpleasant. Additionally, the POAT records stronger stereotyping in the association tested in the last two rows, as well as in the European-American vs. African-American – pleasant vs. unpleasant experiment from Bertrand and Mullainathan (2004) (row 5, Table 3) when using the attributes from the young vs. old people’s names experiment. Two experiments that were not replicated well by the POAT are those in row 7 and 8. The first one instead shows a reversed association and the second records a low effect size and a high likelihood of the null hypotheses holding up.

Target words Attribute words IAT WEAT POAT

NT NA d P d P

1 Flowers vs. insects Pleasant vs. unpleasant 25 × 2 25 × 2 1.35 10−8 _1.50 ₁₀−7 _1.39 ₁₀−6

2 Musical instruments vs. weapons Pleasant vs. unpleasant 25 × 2 25 × 2 1.66 10−10 _1.53 ₁₀−7 _1.47 ₁₀−7

3 European-American vs. African-American Pleasant vs. unpleasant 32 × 2 25 × 2 1.17 10−5 _1.41 ₁₀−8 _0.89 ₁₀−3

4 European-American vs. African-American Pleasant vs. unpleasant† _{16 × 2} _{25 × 2} _– _– _1.50 ₁₀−4 _1.04 ₁₀−2

5 European-American vs. African-American Pleasant vs. unpleasant‡ _{16 × 2} _{8 × 2} _– _– _1.28 ₁₀−3 _1.58 ₁₀−5

6 Male vs. female names Career vs. family 8 × 2 8 × 2 0.72 < 10−2 _1.81 ₁₀−3 _1.68 ₁₀−3

7 Mental vs. physical disease Temporary vs. permanent 6 × 2 7 × 2 1.01 10−2 _1.38 ₁₀−2 _−1.51 ₁₀−2

8 Science vs. arts Male vs. female 8 × 2 8 × 2 1.47 10−24 _1.24 ₁₀−2 _−0.001 _0.50

9 Math vs. arts Male vs. female 8 × 2 8 × 2 0.82 < 10−2 1.06 10−1 1.25 10−2

10 Young vs. old people’s names Pleasant vs. unpleasant 8 × 2 8 × 2 1.42 < 10−2 _1.21 ₁₀−2 _1.29 ₁₀−2

Table 4.1: This table shows the effect size (Cohen’s d) and p-values for the WEAT and the POAT using the KE

measure and hyponyms derived from WordNet. Each row concerns a different implicit bias documented by the IAT. In each case the first set of target words is found to be more compatible with the first set of attributes words, and the second set of target words with the second set of attributes. The columns NT and NA indicated the number of

target words and attribute words, respectively. The bold values highlight the POAT or WEAT effect size closest to that of the IAT.†The attributes for this experiment are the same attributes as are used in the Flowers vs. insects experiment. ‡The attributes for this experiment are the same attributes as are used in the Young vs. old people’s names experiment.

(17)

Using the other measure proposed byLewis(2019), namely KBA, yields the results documented inTable

4.2. This measure performs poorly on most experiments, with effect sizes close to zero and relatively high p-values. However, it is able to correctly identify the bias between mental vs. physical diseases and temporary vs. permanent attributes. Due to this poor performance we did not test the KBA measure on the POAT

versions using the Microsoft Concept Graph or without using hyponyms because of the poor performance on the WordNet version. Some preliminary tests not presented here indicated the KBAwould perform similarly

on those versions.

NT NA d p d p d p

1 Flowers vs. Insects Pleasant vs unpleasant 25 × 2 25 × 2 1.35 10−8 1.50 10−7 −0.07 0.59 2 Musical instruments vs. Weapons Pleasant vs unpleasant 25 × 2 25 × 2 1.66 10−10 _1.53 ₁₀−7 _−0.48 _0.96

3 European-American vs. African-American Pleasant vs. unpleasant 32 × 2 25 × 2 1.17 10−5 _1.41 ₁₀−8 _−0.09 _0.66

4 European-American vs. African-American Pleasant vs. unpleasant† _{16 × 2} _{25 × 2} _– _– _1.50 ₁₀−4 _0.37 _0.13

5 European-American vs. African-American Pleasant vs. unpleasant‡ _{16 × 2} _{8 × 2} _– _– _1.28 ₁₀−3 _0.41 _0.03

6 Male vs. female names Career vs. family 8 × 2 8 × 2 0.72 < 10−2 _1.81 ₁₀−3 _−0.25 _0.69

7 Mental vs. Physical disease Temporary vs. permanent 6 × 2 7 × 2 1.01 10−2 _1.38 ₁₀−2 _1.14 ₁₀−1

8 Science vs. Arts Male vs. female 8 × 2 8 × 2 1.47 10−24 _1.24 ₁₀−2 _−0.24 _0.68

9 Math vs. Arts Male vs. female 8 × 2 8 × 2 0.82 < 10−2 _1.06 _0.18 _−0.16 _0.63

10 Young vs. old people’s names Pleasant vs. unpleasant 8 × 2 8 × 2 1.42 < 10−2 1.21 10−2 0.50 0.16

Table 4.2: This table shows the effect size (Cohen’s d) and p-values for the WEAT and the POAT using the KBA

measure and hyponyms derived from WordNet. Each row concerns a different implicit bias documented by the IAT. In each case the first set of target words is found to be more compatible with the first set of attributes words, and the second set of target words with the second set of attributes. The columns NT and NA indicated the number of

target words and attribute words, respectively. The bold values highlight the POAT or WEAT effect size closest to that of the IAT.†The attributes for this experiment are the same attributes as are used in the Flowers vs. insects experiment. ‡The attributes for this experiment are the same attributes as are used in the Young vs. old people’s names experiment.

Table 4.3 summarizes the results of the POAT when the positive operators are built using hyponyms gathered from the Microsoft Concept Graph. This version of the POAT seems to perform similarly compared to the version using hyponyms derived by WordNet. Furthermore, there is a visible pattern across both types of the test: in both cases it performs poorly compared to the WEAT on the experiments in rows 7 and 8. One clear difference between the two versions is found in the math vs. arts – male vs. female experiment: It yields significant results when using the WordNet version of the POAT compared, but finds little difference in association when building positive operators using hyponyms derived by the Microsoft Concept Graph. Conversely, the Microsoft Concept Graph version outperforms the WordNet version on the mental vs. physical – temporary vs. permanent experiments compared to the WEAT and the IAT.

NT NA d p d p d p

6 Male vs. female names Career vs. family 8 × 2 8 × 2 0.72 < 10−2 1.81 10−3 1.55 10−3 7 Mental vs. physical disease Temporary vs. permanent 6 × 2 7 × 2 1.01 10−2 _1.38 ₁₀−2 _0.68 _0.11

8 Science vs. arts Male vs. female 8 × 2 8 × 2 1.47 10−24 _1.24 ₁₀−2 _−0.68 ₁₀−1

9 Math vs. arts Male vs. female 8 × 2 8 × 2 0.82 < 10−2 _1.06 ₁₀−1 _0.65 ₁₀−1

10 Young vs. old people’s names Pleasant vs. unpleasant 8 × 2 8 × 2 1.42 < 10−2 _1.21 ₁₀−2 _0.61 _0.11

Table 4.3: This table shows the effect size (Cohen’s d) and p-values for the WEAT and the POAT using the KE measure and hyponyms derived from Microsoft’s Concept Graph. Each row concerns a different implicit bias

documented by the IAT. In each case the first set of target words is found to be more compatible with the first set of attributes words, and the second set of target words with the second set of attributes. The columns NT and

NAindicated the number of target words and attribute words, respectively. The bold values highlight the POAT or

WEAT effect size closest to that of the IAT.†The attributes for this experiment are the same attributes as are used in the Flowers vs. insects experiment. ‡The attributes for this experiment are the same attributes as are used in the Young vs. old people’s names experiment.

(18)

The results inTable 4.1indicated that the POAT performed poorly on a couple of experiments where the amount of hyponyms for the target words are not fairly distributed compared to the amount of hyponyms for the attribute words (A.2). To investigate this issue we also ran all ten combinations of target concepts and attributes without gathering the hyponyms of the words. We use the same measure as in the regular POAT, but build the positive operators from the single word embeddings for each specific word. The results for these tests are presented inTable 4.4and the most difference in the rows 7 and 8: both now show large positive effect sizes and slightly smaller p-values compared to those of the WEAT. This version of the POAT performs best on all tested IAT findings. The found effect sizes are closer to the IAT effect sizes in 6 out of 8 experiments compared to the WEAT.

NT NA d P d P d P

6 Male vs. female names Career vs. family 8 × 2 8 × 2 0.72 < 10−2 _1.81 ₁₀−3 _1.74 ₁₀−3

7 Mental vs. physical disease Temporary vs. permanent 6 × 2 7 × 2 1.01 10−2 _1.38 ₁₀−2 _1.26 ₁₀−1

8 Science vs. arts Male vs. female 8 × 2 8 × 2 1.47 10−24 _1.24 ₁₀−2 _1.06 ₁₀−1

9 Math vs. arts Male vs. female 8 × 2 8 × 2 0.82 < 10−2 _1.06 ₁₀−1 _1.00 ₁₀−1

10 Young vs. old people’s names Pleasant vs. unpleasant 8 × 2 8 × 2 1.42 < 10−2 _1.21 ₁₀−2 _1.52 ₁₀−2

Table 4.4: This table shows the effect size (Cohen’s d) and p-values for the WEAT and the POAT using the KE

measure, represented without hyponyms. Each row concerns a different implicit bias documented by the IAT. In each case the first set of target words is found to be more compatible with the first set of attributes words, and the second set of target words with the second set of attributes. The columns NTand NAindicated the number of target

words and attribute words, respectively. †The attributes for this experiment are the same attributes as are used in the Flowers vs. insects experiment. ‡The attributes for this experiment are the same attributes as are used in the Young vs. old people’s names experiment.

As can be seen inTable 4.1, the POAT values of the experiment that measures the differential association between the science vs. arts target concepts and male vs. female attributes are significantly different from the WEAT values. However, we found that setting the science vs. arts target concepts against similar attribute words improved the results significantly. To investigate the cause of this discrepancy we compared the results of this experiment to the results when using the attribute from math vs. arts – male vs. female, since both have the same category of attribute words. (Table 4.5) We also tested using the math vs. arts target concepts and the male vs. female attributes from science vs arts – male vs. female. In both cases, when using the attribute words from math vs. arts the POAT is able to detect a clear implicit bias, whereas when using the attribute words from science vs. arts the POAT shows that there is almost no difference in association between the target concepts and the attributes.

Target words Attribute words WEAT Science-arts att. Math-arts att.

NT NA d P d P d P

Science vs. Arts Male vs. female 8 × 2 8 × 2 1.24 10−2 −0.001 0.50 1.34 10−2 Math vs. Arts Male vs. female 8 × 2 8 × 2 1.06 10−1 −0.03 0.48 1.25 10−2

Table 4.5: A comparison between the effect sizes and p-values of different combinations of target and attribute sets using the POAT. The first column of values correspond to the original WEAT results using the original target and attribute words for both Science vs. Arts and Math vs. Arts. The second and third column indicate the results when using either the Science vs. Arts attributes or the Math vs. Arts attributes.

Another noteworthy result is the large negative effect size (−1.51) of the mental vs. physical disease – temporary vs. permanent experiment. This experiment is the only experiment to indicate a significant bias in the ‘wrong’ direction (compared to the expectations looking at the IAT and the WEAT). This low effect size indicates that mental diseases are associated more with permanent attributes and physical diseases more with temporary attributes. We will come back to this point and its cause in the discussion.

(19)

Finally, appendixA.2lists, per experiment, the amounts of hyponyms gathered for each word by WordNet and Microsoft Concept Graph. As far as the number of hyponyms go, WordNet generally contains more hyponyms than the Microsoft Concept Graph. On average across experiments, the Microsoft Concept Graph no hyponyms for half of the attribute words, whereas in WordNet this is true for just four attributes across all attribute sets. Additionally, the Microsoft Concept Graph has many outliers. For example, in the target concepts for Musical instruments vs. weapons the word club has 2955 hyponyms, while the average of the other words is around 65. Similarly in the target concepts for math vs. arts, the target concept technology has 22.516 hyponyms while the others average around 300. One of the most problematic experiments is Mental vs. physical disease. Here, the Microsoft Concept Graph is only able to find hyponyms for 7 out of 26 target concepts and attributes.

(20)

CHAPTER 5 Discussion

In this chapter we will discuss the results presented in Chapter 4. Firstly, we will go over the results of the different versions of the POAT as presented in the previous chapter. Following that, we will discuss the different methods of gathering hyponyms we applied and the performance of the different measures used in the POAT. Lastly, we will highlight to what extent graded hyponymy adds to the findings of the WEAT.

5.1 Different configurations of the POAT

The version of the POAT that uses KE as a measure of graded hyponymy and hyponyms derived from

WordNet is able to replicate eight out of ten findings that were documented by the IAT and replicated by the WEAT. We will discuss the two findings it was not able to replicate later on in this section. This version of the POAT has high effect sizes for experiments that are well represented in terms of hyponyms. This means that the words in the target and attribute sets have expansive entries in WordNet (A.2) which leads to the positive operators of the words being constructed out of many hyponyms. For example, in three out of four experiments consisting of pleasant vs. unpleasant attributes that are also tested by the IAT (Table 4.1: rows 1, 2 and 10), the POAT measures high effect sizes and low p-values.

The difference between the POAT versions using KE and KBAis substantial. This is because the KBA

measure does not express single word containment in the same way as the KE. Even if the positive operators

of two words are more or less disjoint, the value for KBA could still be high. Additionally, the values for

KBA are not distributed well, with very little representation around 0 (only if A ≈ B), and more crowded

around 1 and some around 1. Since most of the values for graded categorization are high, the differential association is lower. This can be clearly seen inTable 4.2. In conclusion, the POAT performs better with a measure of graded hyponymy whose values are distributed more evenly across its range.

Overall the WordNet and Microsoft Concept Graph approaches of gathering hyponyms perform similarly. Both methods have the most issues with the same three experiments (Table 4.1: rows 7,8 and 9) and are able to correctly identify implicit biases in the other seven. The WordNet method is stronger at measuring the significance of the findings, which are the p-values. Although the Microsoft Concept Graph measures high p-values for the experiments concerning European-American vs. African-American, the WordNet matches the IAT more closely. This is especially the case in rows 3, 6 and 7. However, the results when using the Microsoft Concept Graph to gather hyponyms are less reliable: as can be seen in AppendixA.2half of the attributes are only represented by the word itself, meaning that the KE can not measure the containment

effectively in half of the cases.

The POAT using KE to measure differential association performs very well on single word vectors

repre-sentations. This version is the only configuration that outperforms the WEAT if we only look at the effect sizes and evaluate in terms of closeness to the IAT effect sizes. However, the findings are not as statistically significant as those of the POAT with hyponyms derived from WordNet which has smaller p-values in most experiments. The KE measure on single vector positive operators can not be interpreted in the same way

as when using positive operators built up out of multiple hyponyms. It no longer measures a relation of containment, or subsethood, but a different asymmetric relation between two points.

(21)

5.2 Mental vs. physical – Temporary vs. permanent

In the experiment that attempts to find a differential association between mental vs. physical disease target concepts and temporary vs. permanent attributes the POAT records an effect size of −1.51, indicating a stronger association between mental disease and permanent and physical disease and temporary. This result is the exact inverse of what was expected when looking at the WEAT and the IAT. However, the test that does not take into account the hyponyms of a word actually performs in line with WEAT results on this experiment. Since eight of ten experiments performed according to the WEAT, it can be established that it is not the mere fact of adding hyponyms to the representations that cause this discrepancy. Instead, our results indicate that the action of adding the hyponyms to the positive operators in the POAT actually measures different biases. The cause of bias-reversal is a combination of two factors in the gathering of hyponyms. The first factor that weighs into these results is the bias induced by the editors of WordNet. WordNet is a human curated database on relations between senses of words, so there is an element of human influence on the hyponyms. Secondly, the operators of two words in the set of physical disease concepts, namely ‘illness’ and ‘disease’, are built up out of significantly more hyponyms than the other concepts. These specific two words consist of over a 1.000 hyponyms each, while the other target concepts of this test consist of an average of 36 hyponyms. Having such ‘large’ operators in the target concepts makes categorization difficult. Conceptually, imagine the difference between categorizing a very broad concept into a very specific category and vice versa.

Including illness and disease d p POAT: wordnet hyponyms −1.51 10−2

POAT: no hyponyms 1.26 10−1

WEAT 1.38 10−2

Excluding illness and disease d p POAT: wordnet hyponyms −1.66 10−2

POAT: no hyponyms 1.07 10−1

WEAT 1.30 10−1

Table 5.1: POAT and WEAT results for the differential association between mental vs physical diseases targets concepts and temporary vs permanent attributes. The top table uses the original sets for both the target concepts and attributes. The bottom table displayes the POAT and WEAT results after the removal of illness and disease from the target words of physical disease and two random one from mental disease. Still using the original attributes.

To further investigate the mental vs. physical disease – temporary vs. permanent experiment,Table 5.1 displays the results of several tests where we excluded specific words from the sets of words and switched between gathering hyponyms of these words or not. The table shows that, firstly, the wrong differential association is measured by the POAT when representing positive operators by all of its hyponyms from WordNet. Even after removing the two largest positive operators from the target concepts (the words represented by the most hyponyms), the same contrast is still measured. Secondly, by using the WEAT as a baseline, we can determine that after removing these two words, the correct differential association can still be measured. For this specific experiment, we can therefore conclude that the wrong association most likely originates from the biases introduced to WordNet by its creators.

Conversely, the Microsoft Concept Graph version of the POAT performs well on this experiment, in line with the IAT and WEAT (Table 4.3). Unfortunately, this is mostly due to the fact that only three out of 26 words in the target and attributes sets have hyponyms in the method. Two of these words have over 450 hyponyms, while the other target words only average around 1. Both words belong to the physical disease target concepts. Therefore, we removed them and two random ones from the mental diseases as we did with the WordNet version above. We saw little change, the effect size increased from 1.11 to 1.36 and the p-value stayed the same. Since 23 out of 26 target concepts and attributes only gathered a single hyponym (the word itself), the performance is mostly similar to that of the POAT without hyponyms (Table 4.4).

(22)

5.3 Science vs. arts – Female vs. male

Two other experiments that deserve some further discussion are: math vs. arts – male vs. female and science vs. arts – male vs. female. The POAT records the correct differential association for the first experiment while the there is barely any difference found in the second, even though the target concepts are very similar and the attribute classes (male vs. female) are exactly the same. The observed difference can be traced back to the words that make up the attribute set for this experiment and the hyponyms that are gathered for each of these words. While both math vs. arts and science vs. arts had an average of around 80 hyponyms for the target words, the attributes of the former also averaged at around 80 hyponyms but the attributes of the latter had approximately ten per word. Additionally, the attributes of science vs. arts are narrower and are therefore harder to categorize into. For example, math vs. arts has attributes such as female, consisting of 280 hyponyms, and woman, consisting of 255 hyponyms. Whereas the ‘best’ attributes in science vs. arts are sister, consisting of 13 hyponyms, and mother, consisting of 35 hyponyms. Because the attributes of science vs. arts are harder to categorize into, the graded hyponymy can be relatively low for each association and that makes the differential association difficult to be measured by the standard POAT. To determine these effects we cross checked the results with both target sets and attribute sets which are shown inTable 4.5. It is evident that the attributes from science vs. arts perform poorly, even on the target words from math vs. arts. The table also shows that the science vs. arts target words perform much better on the math vs. arts attributes as they are able to show the correct differential association. For the same reasons, we can see similar performance for these two experiments when using the Microsoft positive operators. The POAT is again not able to replicate the IAT findings for science vs. arts, it even records a reversed bias, and it is able to determine a positive effect size for the math vs. arts experiment.

5.4 Graded hyponymy as a proxy for graded categorization

In all experiments where the attributes were well represented in terms of the amount of hyponyms (rows 1, 2, 3, 6, 7, 9 and 10, (Table 4.1), the POAT performed at least comparable to the IAT and found more statistical significant differential association compared to the WEAT in most. A more complete method of deriving hyponyms could improve these results. Also, the biases introduced by the curators of WordNet (which is clearly visible in row 7, (Table 4.1) likely has a similar effect on other experiments. An unsupervised model such as the Microsoft Concept Graph that is able to find the appropriate hyponyms for every word can possible use the KE measure to extract the implicit biases from corpora using our methods, and could

possibly replicate the IAT results more accurately. If we compare the version without hyponyms to the version that uses the KE measure and hyponyms derived from WordNet, we find that adding the hyponyms

to the POAT affects the detected bias both positively and negatively. In most experiments the WordNet version makes the results less biased. In one specific example it finds totally different biases. One of the reasons for this is that using hyponyms in the representation adds a type of noise to the test. Due to the inconsistency of WordNet, the coverage of the attributes is not stable across all experiments.

(23)

CHAPTER 6 Conclusion

In order to replicate the results of the WEAT in a way that mimics the principles behind the IAT more closely, we proposed a method called the POAT that measures differential association using graded categorization in terms of graded hyponymy. The POAT uses positive operators to represent words as a collection of word embeddings based on the hyponyms of the word. This representation allows us to define a measure of graded hyponymy thanks to the L¨owner ordering on positive operators. In nine out of ten experiments the POAT is able to recognize implicit biases in the word embeddings.

The performance of the POAT is strongly influenced by the method of deriving hyponyms. We have shown that certain biases introduced by the curators of WordNet are visible in the results of the POAT. Furthermore, we used unsupervised sources of hyponymy in the form of the Microsoft Concept Graph. This performed worse than WordNet due to under and over representation of the hyponyms for some words. Moreover, it was more difficult to replicate the implicit biases in cases where the attributes were narrow (i.e., represented by fewer hyponyms). We have also shown that the use of the KE measure as a proxy

on single vector positive operators outperforms the WEAT on six out of eight replications of IAT findings. This indicates that the use of an asymmetric measure to determine differential association is better at detecting implicit bias in word embeddings than the symmetric distance measure of the WEAT. We propose significance testing on the results of the replicated IAT findings by the POAT. The next step should be to identify exactly why the POAT performs so well on single vector positive operators.

Although positive operators built out of hyponyms derived from WordNet are good candidates for the POAT as they perform well on similarity tests (Table 3.1) and carry implicit biases, neither WordNet and Microsoft’s Concept Graph contain an entry for every word. Nor do they contain all hyponyms of a word and in several cases the entry has zero hyponyms. An example word type that for which this issue was prominent are the male and female pronouns that were part of the IAT experiments. Out of six pronouns only two had entries and of those two neither contained the correct synset. Therefore, in order to make this method as dependable as possible the problematic word categories must be identified and remedied with some other method of deriving hyponyms. A possible way to gather the hyponyms of pronouns is to look into the arguments of a pronoun. These can be derived from a corpus by taking the lemma of the verb argument (‘He walks to the store.’, ‘walk’). For nouns and verbs it is possible to replace the missing words with other entries. Using word similarity measures, we could find the most similar WordNet entries and use this for the experiment instead. Of course, we instead simply hope for a more complete model.

One other advantage of our method to detect biases in word meanings is that positive operators fit well inside a compositional framework (Lewis, 2019). This allows us to form phrases and sentences as well as generic sentences. Generic sentences such as “mosquitos carry malaria” express regularities. Using positive operators gives the potential to assess associations between words and subphrases, such as mosquitos and carry malaria.

Another interesting observation we made that one should look closer at in future research is that using the KE measure on positive operators built out of single vectors could also lead to interesting results.

Normally, the relation between two points is measured in terms of distance, which is symmetric. It would be interesting to work out the applications of an asymmetric measure like KE on word representation in

(24)

Bibliography

Bankova, D., Coecke, B., Lewis, M., & Marsden, D. (2019). Graded hyponymy for compositional distribu-tional semantics. Journal of Language Modelling, 6 (2), 225–260.

Bertrand, M., & Mullainathan, S. (2004). Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination. American economic review, 94 (4), 991–1013. Bruni, E., Boleda, G., Baroni, M., & Tran, N.-K. (2012). Distributional semantics in technicolor, In

Proceed-ings of the 50th annual meeting of the association for computational linguistics: Long papers-volume 1. Association for Computational Linguistics.

Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior research methods, 39 (3), 510–526.

Caliskan, A., Bryson, J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356 (6334), 183–186.https://doi.org/10.1126/science.aal4230 Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic press.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional trans-formers for language understanding.

Faruqui, M., & Dyer, C. (2014). Community evaluation and exchange of word vectors at wordvectors.org, In Proceedings of acl: System demonstrations.

Fellbaum, C. (1998). Wordnet: An electronic lexical database. Bradford Books.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited, In Proceedings of the 10th international conference on world wide web.

Gerz, D., Vuli´c, I., Hill, F., Reichart, R., & Korhonen, A. (2016). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv:1608.00869.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998a). Measuring individual differences in implicit cognition: The implicit association test. Journal of personality and social psychology, 74 6, 1464–80. Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998b). Measuring individual differences in implicit cognition: The implicit association test. Journal of personality and social psychology, 74 (6), 1464. Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora, In Proceedings of the 14th

conference on computational linguistics-volume 2. Association for Computational Linguistics. Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine)

simi-larity estimation. Computational Linguistics, 41 (4), 665–695.

Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering, 16 (4), 359.

Landauer, T. K., & Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104 (2), 211. Lenci, A., & Benotto, G. (2012). Identifying hypernyms in distributional semantic spaces, In * sem 2012:

The first joint conference on lexical and computational semantics–volume 1: Proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation (semeval 2012).

Lewis, M. (2019). Compositional hyponymy with positive operators, In Proceedings of the international con-ference on recent advances in natural language processing (ranlp 2019), Varna, Bulgaria, INCOMA Ltd.https://doi.org/10.26615/978-954-452-056-4-075

Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. CoRR, cs.CL/0205028. http://dblp.uni-trier.de/db/journals/corr/corr0205.html#cs-CL-0205028

L¨owner, K. (1934). ¨Uber monotone matrixfunktionen. Mathematische Zeitschrift, 38 (1), 177–216.

McKinney, W. (2010). Data Structures for Statistical Computing in Python (S. van der Walt & J. Millman, Eds.). In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th Python in Science Conference. https://doi.org/10.25080/Majora-92bf1922-00a

Capturing Implicit Biases With Positive Operators

Capturing Implicit Biases With

Positive Operators

Capturing Implicit Biases With Positive

Operators

Abstract

Contents

Acknowledgments

CHAPTER 1

Introduction

CHAPTER 2

Background

2.1

The Implicit Association Test

2.2

The Word-Embeddings Association Test

CHAPTER 3

Method

3.1

Hyponyms

3.2

Positive operators

3.3

Measuring categorization

3.4

Measuring difference of association

3.5

The case study

CHAPTER 4

Results

CHAPTER 5

Discussion

5.1

Different configurations of the POAT

5.2

Mental vs. physical – Temporary vs. permanent

5.3

Science vs. arts – Female vs. male

5.4

Graded hyponymy as a proxy for graded categorization

CHAPTER 6

Conclusion

Bibliography