Predicting individual cognitive biases from bias measurements in word embeddings

(1)

Predicting individual

cognitive biases from bias

measurements in word

embeddings

(2)

Layout: typeset by the author using LATEX.

(3)

-Predicting individual cognitive

biases from bias measurements in

word embeddings

Jim Buissink 10441433

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. K. Schultz Dr. P.L. Mirabile

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

1 of36

Abstract

Word embeddings, which are vector representations of words, are currently widely used in natural language processing (NLP) tasks. Recent research show that these word embeddings inherit undesirable word associations, such as stereotypes and prejudices, that are present in the corpora they are trained on. Two proposed methods to measure such biases are the WEAT (Caliskan et al., 2017) and the POAT (Bosscher, 2019). Both take inspiration from the Implict Association Test (IAT), a method that is commonly used in psychology to measure implicit stereo-types held by individuals. However, the methodology that the WEAT and the POAT use does not allow us to compare their results. In this project, a methodol-ogy is developed that does allow for this comparison to be made. This is done by building different regression models which attempt to predict results found in IAT experiments from the measurements done by WEAT and POAT. Four different re-gression models were tested, and each one showed poor performance for both the WEAT measurements and the POAT measurements. The WEAT measurements did however always perform just as good as or better than the POAT measure-ments. Following from this, we found that through our methodology, comparisons can be made that are theoretically more sound than the original methodology that the WEAT and POAT used but are less significant.

(5)

1 |

Introduction

In the field of natural language processing (NLP), we often use large sets of text corpora to train and develop our applications. Because computers can not inter-pret textual data in the same manner as humans can and are better at handling

numerical data, words are often represented as real-valued vectors [1]. There are

several techniques for mapping the words in a vocabulary to vectors, which as

a collective are called word embeddings [2]. It has been shown that word

em-beddings capture and conserve semantics from the corpora they are trained on [3][4]. This could mean that the ideas and beliefs from the writers contributing to the text corpora are also inherited by the corresponding word embeddings, which is how cultural stereotypes and prejudices, biases, can be preserved through AI applications [5].

To demonstrate the presence of human-like psychological biases in word

embed-dings, Caliskan et al. [6] developed a statistical test called the Word Embedding

Association Test (WEAT). Their method draws inspiration from the Implicit

Asso-ciation Test (IAT) [7], a method often used in psychology to measure the strength

of an individual’s implicit bias towards different concepts. This test has shown, among other things, that people generally associate flowers with more pleasant attributes, and insects with more unpleasant attributes. But there are also other uses for the IAT concerning more harmful biases, not limiting to, such as racial and gender biases.

One of the limitations with the WEAT is the evaluation of the method. Caliskan et al. claim that the WEAT is capable of measuring the presence of biases in word embeddings because they are able to replicate the results from different IAT stud-ies [7]–[11]. More specifically, they claim that their method is capable of measuring the magnitude of biases as they are able to replicate the effect size for a specific

bias, described in terms of Cohen’s d [12]. But, it is not theoretically sound to

compare the effect sizes obtained from the WEAT and the IAT method. The effect sizes are calculated using different measurements which use different metrics; for the IAT these are the reaction times from test subjects and for the WEAT these are the distances between vectors. Also, the IAT measurements are obtained from

(7)

4 of36

individual humans, whilst the WEAT measurements are obtained from collectivly created corpora. Therefore, it is not justifiable to draw conclusions about the magnitude of biases that are present in the corpora through the use of the WEAT. Other methods of measuring the presence of bias in word embeddings have been proposed. One of such methods is the Postive Operator Association Test (POAT)

[13], which Bosscher developed as an alternative test to the WEAT. Bosscher

claims that the POAT is theoretically closer to the IAT, however also uses effect sizes to quantify the presence of bias. The POAT, in its turn, calculates effect sizes through the notion of graded hypnomy, a measure that indicates how much a word can be categorized under another word. In line with the criticism presented in the previous paragraph, the performance of the WEAT and that of the POAT can not be compared to each other through their reported effect sizes. Thus, one is unable to evaluate wether the WEAT is more capable of measuring bias in word embeddings or if the POAT is. This is problematic, as we want to find the best possible method of measuring bias in word embeddings. This method could for instance serve as an instrument for further research regarding the presence of bias in word embeddings.

The purpose of this thesis is to develop a methodology which allows us to compare the different proposed methods of measuring bias in word embeddings. This is approached by making statistical models that attempt to predict individual cognitive biases found in IAT experiments from the measurements of the WEAT and POAT. This methodology will then be used to assess and compare the two proposed methods.

This thesis will be divided into five sections. The first section serves as a tech-nical background and describes the details of the IAT, WEAT and POAT. The second section describes our approach and goes into detail on the data collection and model selection. The third section presents the results from our predictive models. The fourth section discusses the results and reflects on possible improve-ments on the methodology. Finally, the fifth section summarizes and concludes our research.

(8)

2 |

Background

The following chapter serves as a more detailed introduction to the methods which will be discussed throughout the thesis and introduces important concepts. First, the method used to measure biases in the implicit beliefs that people hold that is used in this thesis will be introduced: the Implicit Association Test (IAT) from Greenwald et al. [7]. Second, a database for IAT experiments, Project Implicit, will be introduced, from which we will use the study materials for our own experiments and the results as our predicted variables in the statistical models [14]. Third, the first of the two methods used to measure biases in word embeddings that will be evaluated in this thesis will be introduced: the Word-Embeddings Association Test

(WEAT) from Caliskan et al. [6]. Finally, the other method that will be evaluated

in this thesis will be introduced: the Positive Operator Association Test (POAT) from Bosscher [13].

2.1 Implicit Association Test

The Implicit Association Test (IAT) is a cognitive response task developed to measure implicit attitudes, which are automatic actions or judgements of which the performer is not aware [7]. If these attitudes are in favor of one concept, person or group compared to others, we are talking about biases. Biases are generally considered unfair and serve as an umbrella term for prejudice and stereotypes. These implicit attitudes, or biases, are captured through a categorization task. In this task, subjects are asked to quickly sort words into categories that are either on the left side or on the right side of a computer screen and have to respond with corresponding keys. The time it takes subjects to sort the stimuli in the correct categories is then measured. The stimuli are divided into four sets, two sets of target concepts and two sets of attributes. The target sets consists of stimuli related to the concept in which we are interested (e.g. fat or black people in one set, thin and white people in the other) and the attribute sets consist of stimuli that are used as evaluations (e.g. healthy or nice in one set, unhealthy and rude in the other). The experiment is divided into several blocks of which the first, second

(9)

2.2. PROJECT IMPLICIT 6 of36

and fourth serve as introductory instructions where the subjects are asked to sort either the target sets into categories or the attribute sets. In the third block, one of the target sets is paired with one of the attribute sets (e.g. fat + good, thin + bad) and in the fifth block, this combination is reversed (e.g. fat + bad, thin + good). The order in which these pairings are made is different for each experiment. The difference in average latency for these blocks is used to measure the subject’s implicit bias.

If the subject feels neutral towards the concept, there should be no difference in the average reaction times from block three and block five. However, if the subject has an implict association between the concept and one of the attributes, it should be easier for the subject to pair these two together and thus will respond faster. As an example, if a subject feels biased towards thin people instead of fat people, the average time it takes a subject to categorize stimuli when they are compatible (e.g. thin + good) versus when they are noncompatible (e.g. fat +

good) is lower. To quantify this bias, Greenwald et al. measure effect sizes1 _by

dividing the difference between the two mean latencies for block three and block five by their pooled standard deviation. Effect sizes measure the strength of a relationship between different variables and conventional sizes are small, medium, and large with values of 0.2, 0.5, and 0.8, respectively.

In their paper, Greenwald et al. [7] validate the IAT with an experiment where

subjects are asked to pair flowers or insects with positive or negative attributes. They believe that in general, people experience flowers as more pleasant than insects. The results show that people responded faster when combining flowers with pleasant words than when combining insects with pleasant words. Thus, the subjects feel more positively towards flowers than insects, which confirms their hypothesis. Subsequently, the IAT has been used to show the presence of other, more harmful, biases like gender and racial beliefs.

2.2 Project Implicit

For our research, we will need the results of many different IAT experiments which we will try to predict later on with help of our stastical models. Rather than conducting these experiments ourselves, we will make use of existing work done

by the researchers from Project Implicit (PI) [14]. PI is an organization and

international collaborative of researchers who are interested in implicit biases, and host online IAT experiments. As the IAT experiments are available to everyone with access to the internet, there is a much larger participant pool in comparison with controlled experiments where subjects are selected, resulting in less selection

(10)

2.3. WORD-EMBEDDING ASSOCIATION TEST 7 of36

bias (not to confuse with the cognitive bias the experiments are measuring). There are 15 IAT studies available on the PI demonstration site, of which most have been collecting data for over 10 years. A list of the IAT studies can

be found in Table 2.1. Datasets for these experiments are provided through an

Open Science Framework project [15], and consist of the results from all subjects

who participated in an experiment and the study material for said experiment.

Different from the IAT experiments from Greenwald et al. [7] is that for some

experiements, the study material also consist of images rather than words. From

Table 2.1, studies 1 through 13 were used in this thesis, and full descriptions can

be found in the appendix A. The transgender study is excluded from our thesis

as there is not enough data since the study just started. The president study is excluded from our thesis, as its study material study changed whenever a new president of the United States of America was elected, which resulted in many different biases instead of one. The skintone and race studies are combined in our thesis, as they measure they same bias when we transform their study materials to fit our work, which is explained in detail in section 3.1.1.

Study name Years Size Study material

Concept Attribute

1 Age 2002-2019 2.709.678 Young vs. Old Pleasant vs. Unpleasant 2 Arab-Muslim 2004-2019 977.626 Arab-Muslim vs. Other People names Pleasant vs. Unpleasant words 3 Asian 2004-2019 1.108.403 Asian vs White faces American vs. Foreign landmarks 4 Disability 2004-2019 1.001.198 Disabled vs Abled abilities Pleasant vs. Unpleasant words 5 Gender-Career 2005-2019 2.688.547 Male vs. Female names Career vs. Family words 6 Gender-Science 2003-2019 1.873.129 Male vs. Female family members Science vs. Arts

7 Native-American 2004-2019 591.795 Native vs. White faces American vs. Foreign landmarks 8 Race 2002-2019 875.209 European-American vs. African-American names Pleasant vs. Unpleasant words 9 Religion 2004-2009 292.782 Jewish vs. Other religous items Pleasant vs. Unpleasant words 10 Sexuality 2004-2019 3.752.026 Heterosexual vs. Homosexual couples Pleasant vs. Unpleasant words 11 Skintone 2004-2019 2.555.866 Light-skinned vs. Dark-skinned faces Pleasant vs. Unpleasant words 12 Weapons 2004-2019 1.535.705 Light-skinned vs. Dark-skinned faces Harmless object vs. Weapons 13 Weight 2004-2019 2.786.468 Overweight vs. Thin faces Pleasant vs. Unpleasant words

14 Transgender - - -

-15 Presidents - - -

-Table 2.1: List of the available studies available on Project Implicit. Each row describes a dataset which will be used in this thesis, where the ’Years’ collumn indicates the duration for which the data has been collected and the ’Size’ collumn indicates how many experiments were done during these years.

2.3 Word-Embedding Association Test

One of the two methods of measuring bias in word embeddings that will be

evalu-ated in this thesis is the The Word-Embeddings Association Test (WEAT)[6]. The

WEAT is a result of Greenwald et al. trying to translate the IAT to word embed-dings with the purpose of measuring if they contain biases. A word embedding is

(11)

2.3. WORD-EMBEDDING ASSOCIATION TEST 8 of36

a representation for text where its words are mapped to vectors of real numbers. There are different methods to generate these mappings, although they all share one essential trait: words with a similar semantic meaning will be close together in the vector space. In their research, Caliskan et al. used the GloVe method to

learn these mappings [16]. The GloVe method is built on the idea that you can

extract semantic relationships between words from the probability that they occur together in a corpus. It seperates itself from other word embedding techniques on the fact that it does not depend on local (context information of words) but global (co-occurence of words) statistics. Caliskan et al. used pre-trained GloVe embeddings for their research, which simplifies the reproduction of their results and ensures partiality.

Whereas with the IAT, the difference in reaction time for pairing words from the concept sets with the attribute sets was used to calculate the IAT effect, with the WEAT, the distance between a pair of vectors is used to calculate the WEAT effect. This distance is defined as the cosine similarity of the two vectors. Given two vectors ~x and ~y, the cosine similarity is calculated as follows:

cos(~x, ~y) = ~x · ~y

k~xkk~yk (2.1)

The association between a concept and the attributes is then calculated through the difference between the mean distances to the attributes. Given two sets of at-tribute words A and B, the association of a word w with the atat-tributes is calculated as follows:

s(w, A, B) =meana∈Acos( ~w, ~a) −meanb∈Bcos

~ w,~b

(2.2) Comparable to the IAT, effect sizes are calculated from the difference in mean associations divided by the pooled standard deviation. Given two target sets X and Y , effect sizes are calculated as follows:

d = meanx∈Xs(x, A, B) −meany∈Ys(y, A, B)

std-devw∈X∪Ys(w, A, B)

(2.3) In their paper, Caliskan et al. try to replicate findings from 8 well-known IAT

experiments [7]–[11] through the use of the WEAT on word embeddings. Using

the same stimuli as the researchers used during their IAT experiments, Caliskan et al. calculate effect sizes using the WEAT. For all studies, the IAT effect sizes and the WEAT effect sizes are evaluated as a large effect. However, these effect sizes can not be interpreted in the same fashion. The IAT experiments are done in a

(12)

2.4. POSITIVE OPERATOR ASSOCIATION TEST 9 of36

controlled setting by a select few subjects, whilst the WEAT experiments are done through the use of a large corpus; a collection of texts written by many people. The IAT can and has been used to draw conclusions about individual subjects or the population of which the subjects are part of. The WEAT is only able to draw conclusions about subjects (who created the corpus) as a collective, with no possibility to observe individual differences. Furthermore, the variables used to calculate the IAT and WEAT effect sizes (reaction time and cosine similarity, respectivly) are measuring completely different things. Concluding, the reported IAT and WEAT effect sizes may seem similar but could very well represent different connotations. Therefore, a direct comparison between IAT effect sizes and WEAT effect sizes is not justifiable.

2.4 Positive Operator Association Test

The second method of measuring bias in word embeddings that will be evaluated

in this thesis is the Positive Operator Association Test (POAT) [13]. The POAT

was developed as an alternative to the WEAT, as Bosscher criticizes the WEAT on the fact that the IAT and WEAT differ greatly in their methodology. He states that whilst the IAT uses a categorization task to measure associations between concepts and attributes, the WEAT uses similarity to measure these associations. The relationship between the concepts and attributes using distance as a measure is symmetrical. This symmetrical measure can not capture the same meaning as an asymmetric measure as is used in the IAT. This was his motivation to create a method which can measure bias in word embeddings, through the use of categorization information.

Whilst the WEAT used the difference between the mean distances to the at-tributes as a measure of association, Bosscher introduces a new measure: graded hyponomy. Hyponomy is the semantic is-a relationship between hyponyms and hypernyms (e.g., an eagle is a hyponym of its hypernym bird). Using this hypon-omy, a word could be described by its set of hyponyms. Bosscher used Wordnet, a

large human-curated lexical database [17], to construct such sets. This word, now

represented as its set of hypnoyms, is then translated to a vector-based model.

Bosscher does this through the notion of a positive operator2. He uses positive

operators as they have an ordering, called Löwner ordering [18]. This Löwner

ordering [18] is defined by Equation 2.4 which Bosscher interpreted as hyponomy.

A < B iff B − A is also positive (2.4)

Some words are not fully expressed through hyponomy (e.g., a dog could be 2_{https://encyclopediaofmath.org/wiki/Positive_operator}

(13)

2.4. POSITIVE OPERATOR ASSOCIATION TEST 10 of36

a pet, but not all dogs are pets), which requires some form of graded hypnomy.

Bosscher used a measure of graded hyponomy proposed by Lewis [19], KE, which

measures the proportional relationship between a postive operator A and an error matrix E. This error matrix E is constructed through the difference between two positive operators, B − A, and the graded hyponomy is calculated as follows:

A vKE B = 1 −

kEk

kAk (2.5)

The measure A vKE B indicates how much the positive operator A is a subset

of the positive operator B, and thus how much the word that is represented by

A is a hyponym of the word that is represented by B. The difference in graded

hyponymy is then used to measure the association betweeb a concept and the attributes. Given two sets of attribute words A and B, the association of a word w with the attributes is calculated as follows:

s(w, A, B) =meana∈A(w vKE a) −meanb∈B(w vKE b) (2.6)

Similarly to the WEAT, POAT effect sizes are calculated from the difference in mean associations divided by the pooled standard deviation. Given two target sets X and Y , effect sizes are calculated as follows:

d = meanx∈Xs(x, A, B) −meany∈Ys(y, A, B)

std-devw∈X∪Ys(w, A, B)

(2.7) To assess the POAT, Bosscher takes the same approach as Caliskan et al. did in their paper. Using the same stimuli and corpus that Caliskan et al. used for their research, Bosscher calculated the POAT effect sizes. He compared the effect sizes for the IAT, WEAT and POAT experiments from which he concludes that the POAT does a better job at replicating the IAT results, as for certain experiments the difference between the POAT effect sizes and the IAT effect sizes is smaller than the difference between the WEAT effect sizes and the IAT effect sizes3. Again, as already stated in section 2.3, comparing the effect sizes is not

justifiable as different metrics have been used to calculate these effect sizes. The goal of this thesis is to build a methodology that does allow us to compare the results of different methods of measuring bias in word embeddings. In the next section the approach used will be explained.

3_{Different POAT configurations (e.g., an other measure of graded hyponomy, different}

(14)

3 |

Method

The following chapter will discuss the implementation of the predictive models. As mentioned before, the goal is to create statistical models that attempt to predict the data from IAT experiments using the data of the different NLP methods. Firstly, the process of obtaining the data for the individual cognitive biases will be explained. Afterwards, an explanation for the process of obtaining the data from the NLP methods will be given. Finally, the model selection will be explained.

All work has been done in Python3 [20]. To manipulate and analyze large

datasets, the Pandas library was used [21]. For computations, the Numpy package

was used [22]. The statiscal models were implemented through the use of three

packages: Scikit for the linear and polynomial model [23], XGBoost for the gradient

boosted model [24] and the Statsmodel module for the mixed-linear model [25].

All code and data used can be requested by email1_.

3.1 Individual cognitive biases

As mentioned in Section 2.2, we are using the data from 13 different IAT studies

which can be found on PI. However, some alterations are in order before the data can be used.

3.1.1 Replacing images

Unlike with the experiments from Greenwald et al. [7] where the sole task was to

categorize words, particpants for the PI experiments were also asked to categorize images. For 10 out the 13 studies, the study material contained images in either of the two types (concept and attribute) of sets. This presents us with a problem, as the general goal of this thesis is to do research on text corpora. Therefore, substitutions were made for all of the images that were used during the PI ex-periments. The substitutions need to satisfy two requirements. First, the original

1_{jim.buissink@student.uva.nl}

(15)

3.1. INDIVIDUAL COGNITIVE BIASES 12 of36

concept that the images represents needs to be preserved. Second, the influence of an implicit personal preference for specific terms has to be minimized. As we are measuring the presence of bias in word embeddings, it is important that we don’t add any extra bias from my own implicit attitudes to the stimuli.

During this substitution process we encountered three possible scenarios, and the following procedures were applied:

1. The image depicts an object or place. In this scenario the image is substi-tuted for a frequently used word that describes the object or the place name.

An example for an image from the religion IAT is given in Figure 3.1. In

this example, we substituted the image for the word dreidl.

Figure 3.1: An image depicting a dreidl, which is then replaced by the word dreidl 2. The image depicts a person of a specific ethnic group. In this scenario the image is substituted for a first name commonly used for people of the depicted ethnic group. The names are chosen randomly from a list as provided by Caliskan et al. [6]. They in turn, used the names as were used in the original IAT experiments but removing names that did not occur frequently enough in the corpus. An exception is made for the Asian IAT, where such a list was absent. For this study, 6 names (3 male, 3 female) were generated through a

random name generator [26]. An example for an image from the Asian IAT

is given in Figure3.2. In this example, the image is substituted for the word Takei.

Figure 3.2: An image depicting someone of Asian heritage, which is replaced by the name Takei

(16)

3.1. INDIVIDUAL COGNITIVE BIASES 13 of36

3. The image depicts a more ambiguous concept. In this scenario the image is substituted for a word chosen from a list containing the concept and its

synonyms. The list of synonyms is obtained from an online thesaurus [27].

The thesaurus is sorted on relevancy, with the most relevant synyoms at the top, wherefrom the required amount of words were chosen in descending order. An example for an image from the weight IAT is given in Figure

3.3. In this example, an image from the ’fat’ concept set is being replaced.

From the list of synonyms for the word ’fat’ (big, bulging, bulky, ..., weighty, whalelike), the most first and thus most relevant item is picked: big

Figure 3.3: An image depicting a person with overweight, which is replaced by the word big.

Through this process, all of the images from the PI study materials have been replaced with corresponding words. For the complete list of replacements see

AppendixA.

3.1.2 Cleaning data

The initial publication from Greenwald et al. [7] reports the results of experiments done in a controlled environment where researchers ensure that noise in the data is kept to a minimum. For example, only data from subjects who finish the entire experiment is used in their computations and subjects can not be distracted during a trial and start a different task. In contrast to this, we are using the results from experiments that are done over the Internet. Subjects who start experiments can end prematurely or pause during a trial which leads to incomplete and incorrect results. To adapt the IAT scores to this new approach of conducting IAT

exper-iments, Greenwald et al. [28] proposed a new improved scoring algorithm. Two

extra steps that are added to the conventional algorithm are as follows: 1. “Eliminate trials with latencies > 10,000 milliseconds”

(17)

3.2. NLP MEASUREMENTS 14 of36

2. “Eliminate subjects for whom more than 10% of trials have latency less than 300 milliseconds”

The first step deals with subjects that paused in one way or another during a trial, as the corresponding enlarged latency for one trial will shift the mean latency for the entire block. The second step deals with subjects that guessed during trials, as a latency of less than 300 milliseconds is deemed to be too fast of a response. Subjects with too many of these fast-responses show increased error rates and are to be eliminated from the results.

We applied these selection criteria to our dataset as we want to exclude these inaccuracies from our data. Due to the fact that the datasets do not contain information for seperate trials, but rather for an average over all trials, an adaption to the first criterium has been made. Subjects of which their reported standard deviations of the response times fall in the upper 75th percentile, are eliminated. Additionaly, subjects with missing values are eliminated aswell. The corresponding

pseudocode can be seen in Algorithm 1:

Algorithm 1Data cleaning

for all subjects in dataset do

if percent under 300ms > 10 then delete subject

else if IAT score = nan then delete subject

else

for all block in blocks do

if std of latency in block is in Q3 then delete subject

end if end for end if end for

From the resulting dataset, the calculated IAT score for each subject is used as the predicted value in the model. In the next section the approach for the calculation of the WEAT and POAT scores is explained.

3.2 NLP measurements

To obtain our predictor variables, we have to apply the WEAT and the POAT on a corpus using the same stimuli as were used in the PI experiments. The

implementation of the POAT2 _{was provided by Jelle Bosscher and the WEAT was}

2_{Available on}

(18)

3.3. MODEL SELECTION 15 of36

reimplemented by Oskar van der Wal3_.

For the choice of the corpus on which are applying the WEAT and POAT, we follow in the footsteps of the researchers that developed the methods. Both the

original WEAT [6] and POAT [13] paper used pretrained GloVe embeddings [16]

during their research, specifically an embedding trained on the Common Crawl corpus. This corpus has been obtained through a large-scale search of the web, resulting in 840 billion tokens. These tokens are casesensitive, which results in 2.2 million unique ones. As this is also the largest corpus available, this corpus should be the most representive of general public as many authors from all over the world helped to create the corpus. For these reasons, the Common Crawl GloVe embeddings were also used for the calculation of the predictor variables.

As our predictor variables, we use the differential association of the two target sets with the attribute sets. Given two sets of target words X and Y and two sets of attribute words A and B, this measurement is obtained as follows:

s(X, Y, A, B) =X

x∈X

s(x, A, B) −X

y∈Y

s(y, A, B) (3.1)

We already showed how to calculate the association of a word with the

at-tributes, s(w, A, B), for the WEAT in equation2.2 and for the POAT in equation

2.6. Postive values for s(X, Y, A, B) mean that the concept set X is more

associ-ated with attribute set A than Y is associassoci-ated with A. For negative values, the opposite is true. For each study in our dataset, we calculated WEAT measures and POAT measures using the corresponding stimuli and serve as our predictor variables.

3.3 Model Selection

Having discussed how to obtain the predictor variables and the predicted variables, the next part of this chapter explains the reasoning behind the model selection. The goal of the model is to predict a continuous dependent variable from a con-tinuous independent variable. For this reason, a regression model seems most suitable. There are several different types of regression models available, and we will fit four different types odels to our data. This results in eight different models in total, four using the WEAT measurements as predictor variables and four using the POAT measurements as predictor variables. In the following subsections, our motivation to use the different types of regression models is explained.

3_{A PhD student from the Bias Barometer research group}

https://bias-barometer.github. io/

(19)

Linear Model

The first model is a linear regression model. Linear regression is one of the most basic types of regression analysis and would serve as a starting point. It is the most straightforward model to use and understand. Only if the predictor and the predicted variables are linearly related to each other, we expect the linear model to perform fairly well. The linear model was implemented using the LinearRegression class from the Sklearn python package [23].

Polynomial Model

The second model is a polynomial regression model. Whilst polynomial regression is technically still a type of linear regression, it does allow for a nonlinear relation-ship between the dependend and independed variable. Polynomials can be used to model curvature whilst with linear regression, a straight line is fitted to the data. The polynomial model will fit closer to the data than the linear model when the relationship between the variables is nonlinear. Different polynomial orders were tested and similar results were found when k ≥ 3. Therefore, a polynomial of degree k = 3 was implemented. This was done using the LinearRegression class

and Polynomialfeatures class from the Sklearn python package [23].

Linear Mixed Model

The third model is a linear mixed model (LMM) [29]. LMMs are extended versions

of simple linear models and allow for both fixed and random effects. LMMs are generally used when there is some form of hierarchical structure in the data. In our dataset, the IAT score is an independent variable as a different score for each subject is reported. Even for the same studies, there is a lot if variability in the IAT scores. On the contrary, the WEAT and POAT measurements are not independent variables, as they are repeated measurements with no variability for a specific study. To deal with this hierarchical data, we could just take the average IAT score from all subjects for a specific study but it would be better to take advantage of all the data that we have. Therefore we implemented a LMM, where we fix the different studies as a random effect. The LMM was implemented using the Statsmodels package [25].

Gradient boosting

The final model is more of an ensemble of different models than a specific model

itself. XGBoost [24] is an all-around algorithm used typically to solve regression

predictive modeling problems. It involves around the training and combination of individual models to get single predictions. It is praised for its fast training speed

(20)

and excellent performance, and could be considered as the current state-of-the-art. As the results of this model would be difficult for us to interpret, we will use this model mainly to see what the possibilities are and serves as some sort of golden standard. The XGBoost model was implemented using its corresponding XGBoost package.

(21)

4 |

Results

In this chapter the performance of the two methods discussed in prior chapters, and consequently the performance of the chosen predictive models, is evaluated. First, a description of the dataset is presented. Second, the relationship between the variables of the dataset is identified through a correlation matrix. Third, the predictions made by the models are evaluated. Finally, after mentioning an issue with the current dataset regarding similarity in the word sets, a truncated dataset is presented. From this dataset, a new correlation matrix was calculated and its predictions evaluated.

4.1 Complete dataset

A description of the complete dataset can be found in Table4.1. Each row indicates

a particular study. Whilst the table only reports the average IAT score, the dataset consists of the individual scores for each subject. The IAT scores range from -1.953 to 1.922 (µ = 0.314, σ = 0.432). Using the stimuli from the PI studies, the WEAT measurements (µ = 0.049, σ = 0.033) and POAT measurements were calculated (µ = 0.011, σ = 0.019). Already visible from the data is that for most studies, the signs of the predictive and predicted variables are equal.

To understand the relationship between the variables, the correlation is

exam-ined. A correlation matrix can be found in Table 4.2. There is a very weak linear

relationship between the IAT scores and the POAT measurements (ρ = 0.043), and a moderate linear relationship between the IAT scores and the WEAT mea-surements (ρ = 0.123). This does not indicate the absence of a relation between the variables, but rather that the relationship is not linear.

(22)

4.1. COMPLETE DATASET 19 of36

Target words Attribute words NT NA N IAT WEAT POAT

1 Young vs. Old Pleasant vs. Unpleasant 6 x 6 8 x 8 571.006 0.414 0.046 0.001 2 Arab-Muslim vs. Other names Pleasant vs. Unpleasant 10 x 10 8 x 8 180.837 0.030 0.049 -0.003 3 Asian vs. European-American names Foreign vs. American landmark 4 x 4 4 x 4 172.287 0.202 0.102 0.007 4 Disabled vs. Abled Pleasant vs. Unpleasant 4 x 4 6 x 6 163.198 0.440 0.134 0.020 5 Male vs. Female names Career vs. Family 5 x 5 7 x 7 566.949 0.390 0.097 0.058 6 Male vs. Female Science vs. Arts 8 x 8 7 x 7 386.496 0.327 0.054 0.005 7 European-American vs. Native-American names American vs. Foreign places 9 x 9 5 x 5 101.215 0.139 -0.008 -0.004 8 Judaism vs. Other religions Pleasant vs. Unpleasant 5 x 5 8 x 8 74.493 -0.176 -0.048 0.007 9 Heterosexual vs. Homosexual Pleasant vs. Unpleasant 4 x 4 8 x 8 562.218 0.221 0.028 0.015 10 European-American vs. African-American names Pleasant vs. Unpleasant 12 x 12 8 x 8 666.902 0.303 0.024 0.015 11 African-American vs. European-American names Weapons vs. Harmless 12 x 12 5 x 5 428.394 0.333 0.038 0.005 12 Underweight vs. Overweight Pleasant vs. Unpleasant 10 x 10 8 x 8 581974 0.402 0.038 -0.008

Table 4.1: Summary of the complete dataset. Each row represents a different bias

that was studied by PI. N : number of subjects. NT: number of target words. NA:

number of attribute words. Reported is the average IAT score (IAT), the WEAT measurement (WEAT) and the POAT measurement (POAT)

IAT WEAT POAT

IAT 1 0.123 0.043

WEAT 0.123 1 0.587

POAT 0.043 0.587 1

Table 4.2: Correlation matrix for the all variables from the dataset as described

in4.1. The correlation coefficients are measured with Pearson’s r

The evaluations of the predictions done by the different models can be found

in Table 4.3. The performance is evaluated through explained variance, R2_{. As}

an extra measure, the correlation efficient between the predictions and the true values is reported. Usage of the WEAT measurements resulted in better or equal predictions for every model compared to the predictions made using the POAT measurements. The mixed-effects model had the best predictions overall, albeit that the difference with the XGBoost model’s predictions is neglectable. Thus, the performance of the models is to be deemed equivalent.

Linear model Polynomial model Mixed model XGBoost model

WEAT POAT WEAT POAT WEAT POAT WEAT POAT

R2 _0.015 _0.002 _0.027 _0.005 _0.071 _0.070 _0.071 _0.071

σ 0.123 0.043 0.164 0.079 0.265 0.265 0.265 0.266

Table 4.3: Summary of the evaluations of the predictive models. For every model, the predictions using the WEAT measurements and the predictions using the POAT measurements are evaluated. The first row reports the explained variance

(R2_{) and the second row reports the correlation coefficient measured in Pearson’s}

(23)

4.2. TRUNCATED DATASET 20 of36

4.2 Truncated dataset

The results in Table 4.3 show poor performance for all experiments. One possible

cause for this is that the substitution of images into words created word sets of which some words are not similar to its neighbours and thus to the concept it is trying to portray. If too many of the words in a set are unsimilar, the word set might end up portraying a completely different concept or no concept at all. To investigate if this was the case, the WEAT has been used to identify these word lists. Let X ∪ A be the combination of the first set of target words X and the first set of attribute words A. Let Y ∪ B be the combination of the second set of target words Y and the second set of attribute words B. To determine if a study needs to be deselected, the average similarity of a set needs to be over a threshold.

These thresholds TX∪A and TY ∪B are calculated as follows:

TX∪A = P wX∪As(w, X ∪ A, Y ∪ B) kX ∪ Ak − σX∪A TY ∪B = P wY ∪Bs(w, X ∪ A, Y ∪ B) kY ∪ Bk + σY ∪B (4.1) where s(w, X ∪ A, Y ∪ B) is the similarity between a word and the set it belongs to (e.g., if w is an attribute, to the attribute sets A and B), and σ is the standard deviation of similarity of the combined set. The thresholds for the complete dataset

are TX∪A = 0.136 and TY ∪B = −0.097. After the inner similarity for word lists

was evaluated, four studies were deemed unsatisfactory. These are: the Gender-Science, the Weight, the Skintone and the Weapon study. The results for all studies can be found in Table 4.4.

(24)

Target words Similarity F Similarity G

Thresholds 0.136 -0.07 Arab 0.295 -0.099 Asian 0.187 -0.429 Gender-Career 0.176 -0.238 Sexuality 0.139 -0.449 Religion 0.34 -0.238 Gender-Science 0.12 -0.109 Native-American 0.467 -0.363 Weight 0.066 -0.139 Disability 0.411 -0.257 Skintone 0.321 -0.081 Weapon 0.321 -0.081 Age 0.210 -0.198

Table 4.4: This table shows the average similarity for all words from the combined sets F and G. Highlighted are the sets of which their averaged similarity is smaller than the average of all sets plus the standard deviation.

After the removal of the unsatisfactory studies, a new correlation matrix was

calculated which can be seen in4.5. This correlation matrix shows larger

correla-tion coefficients for all variables compared to the correlacorrela-tion coefficients from the

previous matrix 4.2. This indicates that usage of the truncated dataset might also

improve the predictive quality of the models.

IAT WEAT POAT

IAT 1 0.200 0.127

WEAT 0.200 1 0.590

POAT .127 0.590 1

Table 4.5: Correlation matrix for the variables from the truncated dataset as

described in 4.1. The correlation coefficients are measured with Pearson’s r

Table 4.6 summarizes the evaluation using the truncated dataset.

Compar-ing the results with table 4.3, all values are improved on with regards to their

counterparts. Both tables prove that the WEAT measurements add more value to

the predictive models than the POAT measurements do as the reported R2 _values

are at least equal or higher. Additionally, the mixed-effects model and XGBoost model also have the best score overall on both tables.

(25)

Linear model Polynomial model Mixed model XGBoost model

WEAT POAT WEAT POAT WEAT POAT WEAT POAT

R2 0.040 0.016 0.050 0.016 0.109 0.109 0.108 0.108

σ 0.200 0.128 0.223 0.128 0.330 0.330 0.329 0.329

Table 4.6: Summary of the evaluations of the predictive models, using the

trun-cated dataset from Table 4.4. For every model, the predictions using the WEAT

measurements and the predictions using the POAT measurements are evaluated.

The first row reports the explained variance (R2) and the second row reports the

(26)

5 |

Discussion

In this chapter we will discuss our results from Chapter 4. First, we compare the

findings from the different statiscal models with each other. Afterwards, we will highlight some weaknesses with our methodology. For this discussion, we reflect on the quality of the word lists. Afterwards, we introduce some issues of the WEAT

concerning manipulation of the word sets using a paper from Ethayarajh et al [30].

Lastly, an explanation for the poor performance of the POAT is given regarding the absence of hyponyms.

5.1 Comparing the regression models

From the results presented in Table 4.3and Table 4.6, we observe two noteworthy

events. The first point is that despite of the low explained variance scores, there is a small difference between the linear and the polynomial model. This is a good indication that we are indeed dealing with a nonlinear relationship, which makes it harder for us to intuitively understand how the variables are interacting. The second point is that the LMM model and the XGBoost model report the same evaluative scores. A possible explanation for this is that both models fitted the exact same function to the data. However, because of the complex nature of the XGBoost model, it is still unclear if this was the case. The purpose of the inclusion of the XGBoost model was to create some sort of golden standard, as it is considerd as the state-of-the-art. Because the LMM is less complex but still boasts the same result predictive qualities, we consider the LMM as the best match for our problem.

5.2 Evaluating word lists

As previously seen in4.2, we made use of the WEAT to check whether the words

from a target or attribute set are more related to the other words within the same set or to the words in its counterpart. When too many words are more related to the words from the opposite set, the target or attribute set loses its meaning. This

(27)

5.3. ISSUES WITH WEAT 24 of36

amount depends on the size of the set and the strength of the similarity. The results

in Table4.4demonstrate that this occured for four of the tested studies. For three

out of the four studies, namely the weight, skintone and weapon study, images from the original study materials were replaced with words. These substitutions could account for the loss of meaning, although this loss of meaning also happened for the gender-science study for which no substitutions were made. For this study, insignificant similarity was present for all types of word sets. The results in Table

4.6 show an increased performance for all models when these flawed studies were

removed from the dataset. Applying this knowledge to the way in which IAT experiments are conducted, we find a use for the WEAT to evaluate the quality of word lists before they are used in experiements. If IAT experiments are conducted with flawed target and attribute sets, the IAT data could become more noisy, which then hurts the predictive performance of the models.

5.3 Issues with WEAT

To reflect on a possible cause of the poor performance of the models when the

WEAT data is used, work from Ethayarajh et al. will be discussed [30]. In their

paper, Ethayarajh et al. question the validity of the WEAT to measure bias in word embeddings. One shortcoming of the WEAT they point out is that during the development of the attribute sets, one could cherry-pick words to obtain a desired association. To give an example, they test if the word ‘dog’ has a stronger association with male-words than female-words, relative to the word ‘cat’. Without too much effort, one could construct several different reasonable sets for each attribute. The word choice however, can result in completly different associations.

An example for such a manipulation can be seen in Table 5.1.

Target Word Sets Attribute Word Sets Effect Size p-val Outcome

{masculine} vs. {feminine} 2.0 0.0 male-associated

{dog} vs. {cat} {actress} vs. {actor} -2.0 0.5 inconclusive

{womanly} vs. {manly} 2.0 0 female-associated

Table 5.1: This table shows that the WEAT can be manipulated to obtain desired associations through clever choice of attribute words. Note that effect sizes are positive for both the male-association and the female-association, as the ordering for the attributes is reversed. The effect size is negative when the attributes are ‘actress’ and ‘actor’, but the outcome is statistically insignificant. Similar outcomes are obtained when the size of the words sets are increased.

Linking this side effect of the WEAT to the work done in this thesis, it stresses the need that the substitution of images into word sets ask for as little

(28)

interfer-5.4. ISSUES WITH POAT 25 of36

ence from the person doing the substitutions as possible. As the choice of words heavily influence the outcome of the test, we should limit the researchers’ degrees of freedom when it comes to the development of word sets. Moreover, as a small alteration in word sets can drasticly change the results, the WEAT has poor gen-eralizability.

5.4 Issues with POAT

Not only did the usage of the WEAT data result in poor predictions, usage of the POAT data performed even worse. To give a possible explanation for this, we once more have a look at the word sets. Bosscher implemented the POAT as he stated that the IAT is a categorization task, and that a test for bias in word embeddings should behave in a similiar manner. In his work, he suggested that the notion of categorization is directly related to hyponymy. Whilst these type-of relationships mostly hold for nouns and verbs, they do not apply to proper nouns and adverbs.

If we look at the study materials in appendix A, 7 out of the 13 studies (study 1,

3, 4, 8, 9, 10, 12) use proper nouns in one or more of the word sets and 8 out of the 13studies (study 3, 4, 5, 6, 7, 10, 11, 13) use adverbs. It is likely that studies with such word sets will not perform well under the POAT as no or few hypnonyms can be found, which in return results in inaccurate or incorrect predictions when its outcomes are used in the predictive models.

In his thesis, Bosscher also encountered poor performance for studies where the average amount of hyponyms of the target set and attribute set differ a lot. He tested another version of the POAT, this time without gathering the hyponyms for words but still uses positive operators. Comparing the two versions, the version without hypnoyms reported effect sizes closer to the ones found in the IAT exper-iments, but only for the studies that the other version did not perform well on. However, the POAT loses its original intention of creating a test revolving around categorization. Concluding, the POAT version which does use hyponymy only performs well on certain word sets (where the number of hyponyms are equally distrubted among the word sets) and the POAT version without hyponymy does not work in its intended way. As the word sets used in our research consist mostly of pronouns and adverbs, the POAT is not a suitable test for predicting IAT scores.

(29)

6 |

Conclusion

Our goal for this thesis was to find an answer to the question: Is it possible to create a methodology that allows us to compare different methods of measuring bias in word embeddings? We approached this problem by using regression models to predict the cognitive biases found in IAT experiments using the measurements of the WEAT and the POAT. We found that the models that used the WEAT measurements were able to make equal or better predictions than the models that used the POAT measurements. Therefore, we can conclude that it is possible to compare different methods of measuring bias in word embeddings and to evaluate their ability to measure bias in word embeddings, albeit that the signific.

Nonetheless, there is still a lot of room for further research on this topic. The data that we used for our predicted variables in the models was obtained from on-line IAT experiments. The stimuli for these experiments consisted of both images and text. In order for us to obtain the WEAT and POAT measurements using the same stimuli, we had to substitute the images into corresponding words. The difference in the use of visual and textual stimuli could influence the magnitude of the biases found in test subjects. Therefore, research should be done on the impact of this difference by doing IAT experiments where the subjects are asked to do two IAT experiments on the same topic; one using visual stimuli and the other using textual stimuli. If there are significant differences in the results, our research should be replicated without the use of IAT experiments that make use of visual stimuli.

Furthermore, as we saw in section4.2, the performance of the predictive models

is strongly depended on the quality of the word sets that we use to calculate the WEAT and POAT measurements. This quality is defined through the inner word similarity of the set, which one can calculate using the WEAT. Not only did we encounter unsatisfactory word sets for the studies where we had to alter the stimuli, we also encountered unsatisfactory word sets for studies when we left the stimuli unchanged. This can be taken to mean that the IAT experiments made use of word sets which do not accurately translate to the concepts which they wish to represent. Thus, further research has to be done on how to improve on the

(30)

27 of36

composition of word sets which are used as stimuli in IAT experiments.

Moreover, we have established that regression models using WEAT ments were able to make better predictions than the models using POAT

measure-ments, as can be seen in table 4.6. However, these measurements were obtained

through the use of word sets which consisted mostly of pronouns and adjectives. For these words, no or few hyponyms can be found which hurts the performance of the POAT. For this reason, our research could be replicated with different word sets for which its words are well represented in terms of the amount of hyponyms said word has.

Additionally, only two methods of measuring bias in word embeddings have been taken into consideration in this thesis. There are other methods, for instance

the RIPA as mentioned in 5.3, which have been left out. For us to be able to

generalize our methodology to work with all different kinds of such methods, rather than only the WEAT and POAT, our efforts have to be replicated with the use of other methods.

Our research does however add some new insights regarding the research topic. We found that one could use the WEAT to evaluate the quality of word sets before IAT experiments are conducted. We developed a methodology to assess the study

material, using equation4.1 and an example of this can be seen in table4.4. Also,

we found that measurements made by the WEAT and POAT are related to the results of individual IAT experiments, albeit that these relationships are nonlinear. Concluding, using these measurements and the IAT experiments, we showed that it is possible to create theoretically sound models to predict the cognitive biases found in the IAT experiments from the WEAT and POAT measurements. But, as the predictive performance of the models is not significant, our research is merely as starting point and should be improved on the topics discussed in this chapter.

(31)

Bibliography

[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Ad-vances in neural information processing systems, 2013, pp. 3111–3119. [2] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space

mod-els of semantics,” Journal of artificial intelligence research, vol. 37, pp. 141– 188, 2010.

[3] J. A. Bullinaria and J. P. Levy, “Extracting semantic representations from word co-occurrence statistics: A computational study,” Behavior research methods, vol. 39, no. 3, pp. 510–526, 2007.

[4] M. Stubbs, Text and corpus analysis: Computer-assisted studies of language and culture. Blackwell Oxford, 1996.

[5] S. Barocas and A. D. Selbst, “Big data’s disparate impact,” Calif. L. Rev., vol. 104, p. 671, 2016.

[6] A. Caliskan, J. J. Bryson, and A. Narayanan, “Semantics derived automat-ically from language corpora contain human-like biases,” Science, vol. 356, no. 6334, pp. 183–186, 2017.

[7] A. G. Greenwald, D. E. McGhee, and J. L. Schwartz, “Measuring individual differences in implicit cognition: The implicit association test.,” Journal of personality and social psychology, vol. 74, no. 6, p. 1464, 1998.

[8] B. A. Nosek, M. R. Banaji, and A. G. Greenwald, “Math= male, me= female, therefore math6= me.,” Journal of personality and social psychology, vol. 83, no. 1, p. 44, 2002.

[9] L. L. Monteith and J. W. Pettit, “Implicit and explicit stigmatizing attitudes and stereotypes about depression,” Journal of Social and Clinical Psychology, vol. 30, no. 5, pp. 484–505, 2011.

[10] M. Bertrand and S. Mullainathan, “Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination,” American economic review, vol. 94, no. 4, pp. 991–1013, 2004.

(32)

BIBLIOGRAPHY 29 of36

[11] B. A. Nosek, M. R. Banaji, and A. G. Greenwald, “Harvesting implicit group attitudes and beliefs from a demonstration web site.,” Group Dynamics: The-ory, Research, and Practice, vol. 6, no. 1, p. 101, 2002.

[12] J. Cohen, Statistical power analysis for the behavioral sciences. Academic press, 2013.

[13] J. M. Bosscher, “Capturing implicit biases with positive operators,” B.S. The-sis, M.S. theThe-sis, FNWI, University of Amsterdam, Amsterdam, Jun. 2020. [14] F. K. Xu, B. A. Nosek, A. G. Greenwald, K. Ratliff, Y. Bar-Anan, E.

Uman-sky, M. R. Banaji, N. Lofaro, and C. Smith, “Project implicit demo website datasets,” 2016.

[15] ——, Project implicit demo website datasets, 2021. doi:https://doi.org/

10.17605/OSF.IO/Y9HIQ.

[16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [17] G. A. Miller, WordNet: An electronic lexical database. MIT press, 1998. [18] K. Löwner, “Über monotone matrixfunktionen,” Mathematische Zeitschrift,

vol. 38, no. 1, pp. 177–216, 1934.

[19] M. Lewis, “Compositional hyponymy with positive operators,” in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 638–647.

[20] G. Van Rossum and F. L. Drake, Python 3 Reference Manual. Scotts Valley, CA: CreateSpace, 2009, isbn: 1441412697.

[21] W. McKinney et al., “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference, Austin, TX, vol. 445, 2010, pp. 51–56.

[22] T. E. Oliphant, A guide to NumPy. Trelgol Publishing USA, 2006, vol. 1. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[24] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” Aug.

2016, pp. 785–794. doi: 10.1145/2939672.2939785.

[25] S. Seabold and J. Perktold, “Statsmodels: Econometric and statistical mod-eling with python,” in 9th Python in Science Conference, 2010.

(33)

BIBLIOGRAPHY 30 of36

[26] Language name generator • the ultimate bank of 100,000+ names. [Online].

Available: https : / / blog . reedsy . com / character - name - generator /

language.

[27] The world’s favorite online thesaurus! [Online]. Available: https : / / www .

thesaurus.com/.

[28] A. G. Greenwald, B. A. Nosek, and M. R. Banaji, “Understanding and using the implicit association test: I. an improved scoring algorithm.,” Journal of personality and social psychology, vol. 85, no. 2, p. 197, 2003.

[29] A. Gelman and J. Hill, Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006.

[30] K. Ethayarajh, D. Duvenaud, and G. Hirst, “Understanding undesirable word embedding associations,” arXiv preprint arXiv:1908.06361, 2019.

(34)

A

|

Word Lists

The original study materials can be found on the Project Implicit webpage1. Listed

below are the word sets that were used for our research. Studies for which alter-ations were made have a description of the changes.

1. Gender-Career IAT

• Female stimuli are: Rebecca, Michelle, Emily, Julia, Anna • Male stimuli are: Ben, Paul, Daniel, John, Jeffrey

• Career stimuli are: Career, Corporation, Salary, Office, Professional, Management, Business

• Family stimuli are: Wedding, Marriage, Parents, Relatives, Family, Home, Children

2. Gender-Science IAT

• Female stimuli are: mother, wife, aunt, woman, girl, female, grandma, daughter

• Male stimuli are: man, son, father, boy, uncle, grandpa, husband, male • Science stimuli are: astronomy, math, chemistry, physics, biology,

geol-ogy, engineering

• Liberal Arts stimuli are: history, arts, humanities, english, philosophy, music, literature

3. Arab-Muslim IAT

• Good words are: laughter, happy, joy, love, glorious, pleasure, peace, wonderful

• Bad words are: failure, agony, awful, nasty, terrible, horrible, nasty, evil.

1_{https://www.projectimplicit.net/resources/study-materials/}

(35)

32 of36

• Arab-Muslim names are: Hakim, Sharif, Yousef, Wahib, Muhsin, Salim, Karim, Habib, Ashraf, Akbar

• Other People names are: Ernesto, Matthais, Maarten, Philippe, Guil-lame, Benoit, Takuya, Kazuki, Chaiyo, Marcelo

4. Skintone IAT

Original study materials were: 12 dark-skinned (8 male, 4 female) and 12 light-skinned (8 male, 4 female) sketched faces. Changed into 12 random African American names (8 male, 4 female) and 12 random European Amer-ican names (8 male, 4 female) selected from the list provided by Caliskan et al. (in WEAT 3)

• European American names: Geoffrey, Brett, Brad, Todd , Matthew, Brendan, Neil, Greg, Meredith, Sarah, Anne, Allison

• African names-American names: Tyrone, Darnell, Kareem, Leroy, Rasheed, Jermaine, Hakim, Jamal,Kenya, Lakisha, Aisha, Keisha

• Good words: laughter, happy, joy, love, glorious, pleasure, peace, won-derful

• Bad words: failure, agony, awful, nasty, terrible, horrible, hurt, evil 5. Age IAT

Original study materials were: 6 young (3 male, 3 female) and 6 old (3 male, 3 female) morphed white faces. Changed into 6 synonyms for the adjective oldand 6 synonyms for the adjective young. The most relevant synonyms as provided by Thesaurus.com have been used; when more synonyms are given then necessary, the first items from the list have been picked.

• Old words: aged, ancient, decrepit, elderly, gray, mature

• Young words: budding, inexperienced, new, youthful, adolescent, bloom-ing

• Good words: happy, wonderful, love, pleasure, peace, joy, glorious, laughter

• Bad words: hurt, agony, evil, nasty, terrible, horrible, failure, awful 6. Weight IAT

Original study materials were: 10 fat (5 male, 5 female) and 10 thin (5 male, 5 female) morphed faces of multiple ethnicities. The same face is represented as thin and fat. Changed into 10 synonyms for the adjective fat, and 10 synonym for the adjective thin.

(36)

33 of36

• Fat words: big, bulging, bulky, chunky, heavy, hefty, inflated, large, meaty, obsese

• Thin words: delicate, fragile, gaunt, lean, meager, narrow, skinny, slim, small, attenuate, attenuated

• Good words: laughter, happy, joy, love, glorious, pleasure, peace, and wonderful

• Bad words: failure, agony, awful, nasty, terrible, horrible, nasty, and evil

7. Religion IAT

Multiple studies ran on project implicit for this IAT. We will be looking at a study ran from 2004-2009 because this is the bigger dataset of the bunch. For this project the study materials were 5 jewish and 5 other religion images. The items depicted in the images have been named appropriately.

Exceptions: Judaisminstead of the Star of David

• Jewish words: shabbat, menorah, dreidel, torah, Judaism • Other religious words: cross, Buddha, crucifix, Shiva, totem

• Good words: celebrate, cheerful, smiling, friend, laughing, delight, beau-tiful, Magnificent

• Bad words: bothersome, horrible, pain, yucky, rotten, gross, hurtful, hatred

8. Weapon IAT

Original study materials were the same images of faces used in the skintone IAT; these will be replaced using the same names. Furthermore, 14 more images were used, 7 of weapons and 7 of harmless objects. These have been replaced by the items they are depicting, gathered from the filename.

• European American names: Geoffrey, Brett, Brad, Todd , Matthew, Brendan, Neil, Greg, Meredith, Sarah, Anne, Allison

• African names-American names: Tyrone, Darnell, Kareem, Leroy, Rasheed, Jermaine, Hakim, Jamal, Kenya, Lakisha, Aisha, Keisha

• Weapons: revolver, sword, grenade, rifle, mace, cannon, axe • Harmless objects: soda, camera, bottle, wallet, ice

(37)

34 of36

9. Native American IAT

Original study materials were: 8 White American (4 male, 4 female) and 8 Native American (4 male, 4 female) historical portrait. These will be replaced by European American names (the same first 4 male and 4 female names as in the skintone IAT) and the 8 largest Native American Tribes (Source: U.S. Census Bureau, Census 2010). Furthermore, images of 5 American and 5 foreign natural landmarks (e.g., Grand Canyon and Mt. Everest) have been used, and have been replaced by American states/cities and foreign countries/cities. These words were used in previous versions of the IAT instead of the images currently being used.

• White American names: Geoffrey, Brett, Brad, Todd, Meredith, Sarah, Anne, Allison

• Native American tribes: Navajo, Cherokee, Sioux, Chippewa, Choctaw, Apache, Pueblo, Iroquois

• American words: Miami, Missouri, Ohio, Seattle, Utah • Foreign words: France, Italy, Moscow, Oslo, Warsaw 10. Race IAT

Original study materials were images of 6 black (3 male, 3 female) and 6 white (3 male, 3 female) young faces. These are replaced by African American names and European American names as from the list in the skintone IAT. The first 3 male and 3 female names were picked.

• European American names: Geoffrey, Brett, Brad, Meredith, Sarah, Anne

• African names-American names: Tyrone, Darnell, Kareem, Lakisha, Aisha, Keisha

• Good words are: joy, happy, laughter, love, glorious, pleasure, peace,wonderful • Bad words are: evil, agony, awful, nasty, terrible, horrible, failure, hurt 11. Disability IAT

The original study materials used were 8 images (4 abled and 4 disabled) representing abilities in the form of street signs or related representational forms. They are replaced by the words that are depicted in the images. Exception is the word disabledfor the image of a guide dog and blindfor the image of a walking stick

(38)

35 of36

• Abled words: skiing, walking, running, hiking

• Good words are: joy, love, glorious, pleasure, peace, wonderful • Bad words are: evil, agony, nasty, terrible, rotten, bomb 12. Asian IAT

Original study materials used were 6 asian (3 male, 3 female) and 6 white (3 male, 3 female) faces. These will be replaced by the same European American names as listed in the skintone and random Chinese, Japanese and Hindi first names (3 male and 3 female). Also included were 6 American and 6 foreign landmarks, which were replaced by using the countries/cities where they can be found for the foreign landmark, and the states for the American landmarks. Exceptions: for an image of the Statue of Liberty as this can not be captured by a single word (New York), Chicago has been used.

Used for the random name generator: https://blog.reedsy.com/character-name-generator/language

• Asian names: Cheng, Bai, Takei, Hayashi, Ayaan, Aishwarya

• European American names:Geoffrey, Brett, Brad, Meredith, Sarah, Anne • Foreign landmarks: England, Egpyt, London, Paris, Australia, Pisa

• American landmarks: California, Washington, Seattle, Hollywood, Chicago, Missouri

13. Sexuality IAT

Original study material consisted of 2 gay (male), 2 lesbian, and 2 hetero-sexual image representations. These images were replaced by synonyms for the concerning images.

• Gay (male) words: gay, homosexual, homo, homophile • Lesbian words: gay, homosexual, lesbian, queer

• Heterosexual words: straight, heterosexual, het, hetero

• Good words are: beautiful, superb, joyful, lovely, glorious, pleasure, marvelous, wonderful

• Bad words are: humiliate, agony, awful, nasty, terrible, horrible, tragic, painful

14. President IAT Original study material consisted of 6 images of headshots from the current president and 10 images of headshots of other recent pres-idents. As the data collection for this IAT has been ongoing since 2003,

(39)

36 of36

there have been 3 current presidents (Trump, Obama and Bush). To find 6 replacement words for all presidents would be impossible, so it is suggested to omit this IAT

Predicting individual cognitive biases from bias measurements in word embeddings

Predicting individual

cognitive biases from bias

measurements in word

embeddings

-Predicting individual cognitive

biases from bias measurements in

word embeddings

Abstract

Contents

1

|

Introduction

2

|

Background

2.1

Implicit Association Test

2.2

Project Implicit

2.3

Word-Embedding Association Test

2.4

Positive Operator Association Test

3

|

Method

3.1

Individual cognitive biases

3.1.1

Replacing images

3.1.2

Cleaning data

3.2

NLP measurements

3.3

Model Selection

4

|

Results

4.1

Complete dataset

4.2

Truncated dataset

5

|

Discussion

5.1

Comparing the regression models

5.2

Evaluating word lists

5.3

Issues with WEAT

5.4

Issues with POAT

6

|

Conclusion

Bibliography

A

|

Word Lists