Evaluating the Cost of Group Fairness Interventions in Machine Learning

(1)

MSc Artificial Intelligence

Master Thesis

Evaluating the cost of group fairness

interventions in machine learning

by

Joosje Goedhart

10738193

October 11, 2020

48 ECTS February 2020 - October 2020 Supervisor:

PROF. DR. HINDA HANED

Assessor:

PROF. DR. MAARTEN DE RIJKE

Supervisors Municipality Amsterdam: RIK HELWEGEN

JEROEN SILVIS DAAN SMIT

(2)

Abstract

The recent successes of machine learning algorithms have led to an increased concern for unfair or discriminatory automatised decision making when algorithms are being applied to personal data. This resulted in the proposal of various definitions of algorithmic fairness. The majority of these definitions belong to a category called group fairness, which partitions individuals into protected groups, based on gender, race or other sensitive attributes, and requires these groups to be treated equal on average. However, imposing group fairness is generally assumed to reduce classification utility. Moreover, group fairness does not imply individual fairness, which requires that similar pairs of individuals are treated similarly. However, insights into when and how group fairness interventions come at a cost in terms of classification utility and individual fairness, and how this cost relates to the data, are lacking.

To address this gap, we propose a test-bed to enable the evaluation of the cost of group fairness interventions. Our test-bed allows for the comparison of multiple group fairness intervention algorithms on various real-life datasets. We hypothesise that group fairness interventions come at a larger cost if the distribution of the data over the groups is different. To assess this, we propose and evaluate a criterion to measure the difference in the distribution of the data over the groups. As existing individual fairness metrics have limitations, we propose a metric to capture the extent to which a classifier meets individual fairness: the Lipschitz Violation metric.

To evaluate the effectiveness of our test-bed, we study the cost of group fairness intervention across 500 experimental settings, that span ten group fairness interventions algorithms and five datasets employing ten train/test splits. Our results show that imposing group fairness comes at a significant higher cost in terms of classification utility if the distribution of the labels of the groups is very different. This implies that group fairness interventions come at a larger cost when they are most needed, i.e. when there is large inequality between the groups for which fairness is required. Additionally, we found that the performance of group fairness intervention algorithms varies per dataset, and that these algorithms are sensitive with respect to different train/test splits. We also found that some group fairness algorithms trained on imbalanced data face a high risk of learning process collapse, i.e. they learn to only predict the majority label to achieve acceptable accuracy and perfect fairness. Finally, the proposed individual fairness metric shows too much variation across datasets and algorithms. Since individual fairness is of substantial importance in automated decision making which employs personal data, several directions for feature research are discussed for defining a metric that captures the individual fairness of a classifier successfully.

(3)

Acknowledgements

This thesis marks the end of the incredible journey my student time has been. After moving from Wageningen to Amsterdam in 2014 to study econometrics, I managed to make it to the final destination: a master in Artificial Intelligence at the University of Amsterdam. The decision to study A.I. could not have been more fulfilling. How cool is it to be involved in this rapidly evolving field, and to learn about and experiment with the newest techniques. A big shout out to everyone involved in the program for making it such a great experience.

This thesis has been one of the hardest and most exciting hurdles I have encountered during my studies. Not only because I never spent nine months studying one topic before, but also because the topic, fairness, contains a bit more ethics and philosophy than what I was used to. Luckily, I did not have to go through this process alone.

Hinda, thank you for being beyond the best supervisor I could have wished for. I would probably still be reading papers at this point if it was not for your ability to keep me on track. Your immense commitment in terms of time, wisdom, trust and support have been indescribably meaningful as well as inspiring. Thank you for showing me that doing research can be fun, especially if you collaborate with great people.

This thesis was supported by the City of Amsterdam. Rik, thank you for inspiring me to become involved in the field of fair machine learning and for the fruitful discussions on what fairness should and should not entail. Jeroen, Daan, and Iva, thank you for always showing interest in what I have been up to and for your motivational talks. Lastly, Loek, thank you for your support during this process, for always listening to my complaints, and for being the best and multi-skilled colleague I will probably ever have.

I am very grateful to my friends. Rosa, Rianne, Freya, thank you for still being along my side after all these years. I am excited to experience what life has in store for us. Helen, Sophie, thank you for enduring countless moments at the UvA campus with me, and for being a large source of joy and support over the past years.

Finally, I would like to thank my family. Jeroen, Meke thank you for being you and for being the greatest siblings I know. Mama, papa thank you for always believing in me, for giving me all the freedom I needed while supporting me to find my way. I cannot wait to see what is next.

(4)

1 |

Introduction

1.1 The Need for Algorithmic Fairness

The world we live in is filled with biases with respect to personal traits. For example, in the U.S. people of colour are more often convicted for the same crime as Caucasian people (Commission, 2018) and scientific papers with female authors receive less appreciation than exactly the same papers with male authors (Paludi and Bauer, 1983). Such examples are not limited to the United States: the Netherlands Institute for Social Research reports that job applicants with a non-Western name have a significant lower chance of being invited to an interview compared to Dutch applicants with a similar resume (Nievers and Andriessen, 2010). It is evident that such disadvantageous biases lead to discrimination, the practice of treating someone or a particular group in society less fairly than others. Data will naturally inherit these biases (Calders and Žliobait˙e, 2013). Additionally, the data collection process itself can be biased. Recently, it was revealed that the Dutch Tax and Customs Administration has performed ethnic profiling for years: individuals were targeted for investigation based on whether they held a double nationality (Washington-Post). Such practices might further increase the bias with respect to disadvantaged groups in the collected data (Kallus and Zhou, 2018).

The recent successes of machine learning algorithms have fuelled an increased interest in automated decision making within a wide range of domains. This includes the setting of insurance rates, the allocation of police, the provision of health care and the admission of students. Such algorithms are often deployed without the purpose of actively discriminating certain groups of people. However when deploying algorithms to make decisions about people, it is important to consider the risk of reproducing biases against disadvantaged groups present in the data. Bias and discrimination with respect to sensitive personal traits rely on a concept central to this work: fairness. Fair machine learning, hereafter Fair ML, is the field that aims to deploy algorithmic decision making models that prevent discrimination against individuals, while maintaining utility. In this thesis we focus on fairness in classification tasks. The need for Fair ML becomes clear when considering recent real-life examples where the use of automated decision making resulted in active discrimination against certain groups. For example, Vigdor (2019) reveals gender discrimination in the credit limits of Apple card, Dastin (2018) reports gender bias in Amazon’s recruiting tool and Angwin et al. (2016) describes racial bias in recidivism prediction instruments.

To bring fairness into the domain of machine learning and classification, Fair ML researchers have proposed a myriad of fairness definitions. At a high level, fairness definitions can be divided into two categories: group fairness and individual fairness. Group fairness definitions partition individuals into protected groups (often based on gender, race or another sensitive personal attribute) and

(7)

4 CHAPTER 1. INTRODUCTION

require these groups to be treated equal on average in terms of some classifiers statistic, such as accuracy, prediction rate or false positive rate. These groups are referred to as privileged groups, those receiving advantaged treatment, and unprivileged groups, those receiving disadvantageous treatment. Various papers have proposed classification mechanisms that comply with group fairness definitions (Feldman et al., 2015), (Calmon et al., 2017), (Zemel et al., 2013), (Zafar et al., 2017), (Kamishima et al., 2012), (Hardt et al., 2016). In contrast, individual fairness is introduced in (Dwork et al., 2012) to alleviate the limitations of group fairness. Individual fairness demands similar individuals to be treated similarly irrespective of group membership. The practical adaption of individual fairness is pending as existing individual fairness frameworks are often subject to too many assumptions (Kearns et al., 2019).

The Cost of Group Fairness

Group fairness and individual fairness are considered to be incompatible when, as a consequence of complying with group fairness, pairs of individuals who are otherwise similar but different in a sensitive personal attribute, are assigned different outcomes (Dwork et al., 2012), (Lahoti et al., 2019), (Kearns et al., 2019). We consider a dummy example for illustration. An employer invites an equal proportion of Western and non-Western applicants to a job interview to ensure both groups are treated equally. As a result of this decision, a Western applicant is not invited despite the fact that her qualities are similar to one of the invited non-Western applicants. Many research papers have proposed ways to minimise the assumed conflict between group and individual fairness (Zemel et al., 2013), (Calmon et al., 2017), (Lahoti et al., 2019), but insight in when and how group

fairness comes at a cost in terms of individual fairness is lacking.

In addition to the incompatibility of group fairness and individual fairness, it is widely assumed that enforcing group fairness results in a loss in classification utility (Dutta et al., 2020), (Kleinberg et al., 2016), (Zafar et al., 2017). For example, Zafar et al. (2017) develop a framework to satisfy group fairness in classification which they evaluate on several real-life datasets, for which they show that meeting group fairness results in a cost in terms of classification utility, where the magnitude of this cost differs per dataset. Similar results have been found in (Feldman et al., 2015), (Zliobaite, 2015a) and (Hardt et al., 2016). However, conclusions on how this cost is influenced by

the distribution of the data over the groups are missing.

1.2 Research Questions

Previous efforts have provided insufficient insight in the cost of group fairness interventions. We believe that it is important to take a step back and investigate when and how group fairness interventions come at a cost. To our knowledge this has not been investigated in a data set transcendent manner. We hypothesise that group fairness interventions come at a higher cost if the data of the groups follow a different distribution. Understanding the relationship between the cost of group fairness interventions, in terms of individual fairness and classification utility, and the distribution of the data over the groups can provide the Fair ML community with much needed insights about how the implications of group fairness interventions differ per use case (data set). Therefore the main research question this thesis aims to answer is:

How does the cost of group fairness interventions - in terms of (i) individual fairness and (ii) classification utility - depend on the distribution of the data over the privileged

(8)

1.3. FAIR ML AT THE MUNICIPALITY OF AMSTERDAM 5

and unprivileged group?

We develop a test-bed to evaluate and quantify the cost of group fairness interventions and to relate this cost to the distributional difference of the groups that are ought to be treated equally. Our test-bed is a transparent and replicable platform that allows for evaluation and comparison of multiple group fairness intervention algorithms on various real-life data sets. We predominantly focus on the most popular and widely adapted notion of group fairness: statistical parity (discussed in Chapter 3). We measure classification utility as accuracy and F1-score, where the latter has not been mentioned in relationship to fairness costs in earlier work. Individual fairness was originally introduced in (Dwork et al., 2012) as an optimisation constraint called Lipschitz Condition. As other individual fairness metrics have severe limitations, which we discuss in Chapter 4, we use the original Lipschitz constraint as a metric to measure individual fairness. This metric is called the Lipschitz Violation metric in the sequel. Our detailed research questions are then as follows:

RQ1 How to evaluate the cost of group fairness interventions in terms of classification utility? RQ2 How to evaluate the cost of group fairness interventions in terms of individual fairness?

RQ3 How to measure the difference in the distribution of the data over the privileged and unprivileged group?

1.3 Fair ML at the Municipality of Amsterdam

The fast rise of machine learning techniques makes it possible to improve the efficiency and precision of processes within institutions such as municipalities. For example, the City of Amsterdam is currently experimenting with using A.I. to automatise and optimise garbage truck routes. However, when A.I. techniques are being employed using personal data, it is important to ensure machine learning methods are fair, i.e. that they do not discriminate certain groups. If this can be ensured, the City of Amsterdam can benefit from A.I. within a wide domain, ranging from the allocation of financial resources to primary schools, to detecting illegal Airbnb rentals. In 2018 the City of Amsterdam was brought into disrepute: System Risk Indicator (SyRI), an algorithm to detect abuse of social welfare schemes, was claimed to discriminate particular neighbourhoods, especially those housing many people with a migration background or with low average income.1 _{In 2020 the} Dutch court concluded that SyRI’s decision making process was not transparent enough and that it possibly induced discriminating or stigmatising effects. However, the court did encourage the use of A.I. technology for detecting social welfare fraud.2 This has led to the development of the FairTrade method (Helwegen and Braaksma, 2020), a fairness intervention algorithm that relies on causal reasoning. However, the actual implementation of such fairness intervention algorithms is pending, as general insights into their behaviour and implications are still lacking. This thesis aims to solve this gap, by providing stakeholders and fairness practitioners with insights into the cost of fairness interventions and the relationship between this cost and the data of the groups for which fairness is required.

1

https://www.volkskrant.nl/nieuws-achtergrond/een-druk-op-de-knop-van-de-computer-en-je-wordt-opeens-verdacht-van-fraude b539dfde/

2

(9)

2 |

Preliminaries

In this section, we prepare the ground for our work by reviewing group and individual fairness definitions and algorithms our work builds upon. We explain why a better understanding of the cost of fairness interventions and its relationship to the data of the groups is much needed. We close this chapter by further formalising the scope of this thesis. We now start by discussing the terminology required to incontrovertible reason about Fair ML.

2.1 Fairness Terminology and Notation

Fair ML aims to deploy fairness-aware machine learning methods. This ranges from resource allocation, e.g. the distribution of police officers to neighbourhoods (Elzayn et al., 2019) or the distribution of financial resources to public schools, to real valued target prediction, e.g. risk scores in lending or grades of college students (Agarwal et al., 2019). This thesis, in line with the majority of Fair ML, focuses on fair classification tasks. In fair classification the predicted outcomes of an algorithm trained on data about individuals are aimed to be non-discriminatory, or fair, with respect to one or multiple protected attributes of those individuals. A protected attribute, also known as a sensitive attribute, is a personal trait such as race, gender, age, religion or physical ability. We use the terms ‘sensitive’ and ‘protected’ interchangeably. The values of these sensitive attributes partition the population into privileged and unprivileged groups, where the latter has been at systematic disadvantage historically. An example of an unprivileged group are people of colour. Practices as redlining (Zenou and Boccard, 2000) - the denial or limitation of financial services to specific neighbourhoods mainly because its residents are people of colour - have led to a socio-economic disadvantage for this group. The predicted outcomes in a classification task are generally called targets or labels. In fair classification, a favourable label is a label that is advantageous to the receiver. An example illustrates the introduced terms. Imagine a decision-making scenario in which a bank uses a machine learning model to determine which loan applications are approved (the favourable label), or denied (the unfavourable label). The bank observes that men have a higher historical probability of receiving loans compared to women. The protected attribute in such a scenario would be sex, which partitions the population in a privileged group (men) and an unprivileged group (women).

In this thesis capital letters (e.g. Z) denote random variables and bold capital letter (e.g. Z) denotes a set of random variables, i.e. a dataset. For the fair classification problems considered in this thesis, we assume to have access to a dataset D consisting of n i.i.d. samples {(A, X1, ..., Xm−1, Y }, where A denotes the protected attribute, X1, ..., Xm−1 denote other non-protected variables used for decision making, to which we refer as features, and Y is the target to be predicted. For simplicity we consider classification tasks such that Y is a binary target and A is a single binary protected

(10)

2.2. HOW TO DEFINE FAIRNESS? 7

attribute.

We now introduce notation that will be useful throughout this thesis.

• Y ∈ {0, 1}, where Y = 1 corresponds to the favourable label and Y = 0 to the unfavourable label. Y denotes the set of labels

• A ∈ {0, 1}, where individuals with A = 1 belong to the privileged group and A = 0 to the unprivileged group. A denotes the set of protected attributes

• X ∈ RN,M _{denotes the set of features}

• N is the number of individuals in the dataset • M is the number of features in the dataset

We consider classification models or classifiers that use the features (and possibly protected attributes) to learn a mapping M : {X} → ∆(Y) (M : {X, A} → ∆(Y)) to predict Y given X while preserving some fairness condition with respect to A. ∆(Y) denote the probabilities that the classifier predicts the favourable label. This predicted probability is denoted as M (u) for a specific individual u ∈ X. The predicted labels, denoted as ˆY can be generated by rounding the predicted probabilities to the nearest integer. The predicted labels are denoted as Round(M (u)) for a specific individual u ∈ X

2.2 How to Define Fairness?

To bring fairness into the domain of machine learning and classification, several definitions of algorithmic fairness have been proposed. The literature on Fair ML definitions has grown to be too extensive to comprehensibly summarise, yet we refer to (Mitchell et al., 2018) for a concise and recent survey. We now discuss the fairness definitions our framework builds upon. Along the way we point out the gaps between the mathematical notions of fairness and the larger social goals these concepts were introduced to address. Subsequently, we briefly discuss utility in the context of Fair ML.

2.2.1 Group Fairness Definitions Explained

Group fairness aims at equal treatment of the privileged and unprivileged group. In the context of a classifier, this means that a classifier’s statistic, e.g. false positive rates, error rates or false negative rates, is equal for the two groups. While the required type of equal treatment depends on the task at hand (Mitchell et al., 2018), this thesis focuses on the most popular and widely adapted notion of group fairness (Feldman et al., 2015), (Zemel et al., 2013), (Calders and Verwer, 2010), known as statistical parity, also referred to as demographic parity or predictive rate parity. To avoid confusion we solely use the term statistical parity. Statistical parity requires the proportion of individuals in a group receiving positive (negative) classification to be equal to the proportion of the population as a whole. Hardt et al. (2016) notes that all group fairness definitions make limiting assumptions. However, the reason for us to focus on the cost of intervening on statistical parity, rather than other group fairness metrics, is three-fold:

1. Statistical parity has the advantage of not relying on the true labels of the individuals, as is for example the case when requiring parity in terms of classification errors as false positive rates or true negative rates. Defining fairness in terms of those labels may still result in unfairness

(11)

8 CHAPTER 2. PRELIMINARIES

as the labelling itself can be biased (Jiang and Nachum, 2020). For example, Commission (2018) show that people of colour in the U.S. are more often sentenced for the same crime as Caucasian people, and González et al. (2019) show that women with the same qualifications as men are less often hired for the same position. Error based definitions of group fairness will view such scenarios as fair, while they are evidently undesirable. Statistical parity does not face this issue as it solely uses the predicted labels returned by the classifier.

2. Statistical parity has proven to be the most comprehensible metric for individuals without expertise in fairness, machine learning or both (Saha et al., 2019). It is important that these non-experts are able to understand and criticise the metrics incorporated in the systems they are subject to, as they are the people eventually impacted by automated decision making systems (Barocas and Selbst, 2016).

3. Statistical parity is acknowledged to be among the most popular and widely adapted definitions of group fairness in the literature. In order to investigate the cost of group fairness interventions, we need successful group fairness intervention algorithms. These algorithms are widely available for statistical parity (Calders and Verwer, 2010), (Kamishima et al., 2012), (Zemel et al., 2013), (Feldman et al., 2015), (Calders and Verwer, 2010), (Zafar et al., 2017) but to our knowledge there is a lack of algorithms for other types of fairness interventions.

Definition (Statistical Parity). A classification model M : X → ∆(Y ) satisfies statistical parity if the following condition holds for its predictions ˆY

P ( ˆY = 1|A = 0) = P ( ˆY = 1|A = 1)

which is equivalent to P ( ˆY = 0|A = 0) = P ( ˆY = 0|A = 1).

Criticism Statistical Parity

A prevalent criticism of statistical parity is that it provides weak guarantees from an individuals point of view. It is often suggested that achieving fairness on an average over the population can still harm people on an individual level (MacCarthy, 2017), (Dwork et al., 2012). To illustrate this criticism, consider the following scenario. The recruitment manager of a large company wants to invite three out of six job applicants for a new sales function. The applicants data can be found in Figure 2.1, where a green or yellow colour corresponds to an applicant belonging to the ethnic majority or minority respectively.. To meet statistical parity the recruitment manager invites the two most qualified green candidates and the one most qualified yellow candidate as seen in Figure 2.2. However, now applicant C, who has more work experience than applicant D is invited, and C is not. This example shows that satisfying statistical parity alone does not take features of individuals into account, which might harm certain individuals.

2.2.2 Individual Fairness Definitions Explained

Individual fairness was defined in (Dwork et al., 2012) to address the limitation of group fairness. Dwork et al. (2012) captures fairness by the principle that any two individuals who are similar with respect to a particular task should be treated similarly. In the recruitment example, this would imply that individual B and individual C would both be invited to the interview. This notion is formalised as a Lipschitz condition on a classifier.

(12)

2.2. HOW TO DEFINE FAIRNESS? 9

Figure 2.2: Possible invitation scenario in green satisfying statistical parity

Definition (Lipschitz Condition). (Dwork et al., 2012) A classification model M : X → ∆(Y ) satisfies the (D, d)-Lipschitz condition if for every pair of individuals u, v ∈ X, we have

D(M (u), M (v)) ≤ d(u, v)

where d is a similarity metric over individuals describing the extent to which pairs of individuals should be regarded as similar, and D is a distance metric over the individuals’ predicted probabilities returned by the classifier. The metric over individuals d measures how similar two individuals are. The metric D over the predicted probabilities measures how similarly two individuals are treated. We reconsider the recruitment scenario from Section 2.2.1 for illustration. A classifier M was trained to return the probability that an applicant should be invited to the interview. In this task, an example of d for a pair of individuals, could be their difference in work experience, and an example of D the difference in predicted probabilities returned by M .

Dwork et al. (2012) formalise their individual fairness as an optimisation problem that maximises the classifiers utility such that Lipschitz Condition is satisfied. Throughout their framework, they assume d.D is chosen to be a metric over probabilities, such as the total variation distance. The Lipschitz condition provides a strong guarantee of fairness on an individual level, but has one major obstacle to deployment, as the distance metric that defines the similarity between individuals is assumed to be given (Zemel et al., 2013), (Lahoti et al., 2019), (Kearns et al., 2019). Dwork et al. (2012) acknowledge that society’s ability to develop such a metric is the most challenging aspect of

their individual fairness proposal.

An Example: College Admissions in the United States

In order to illustrate the difficulty of finding a metric that correctly describes the similarity of individuals with respect to a given task, we consider the example given in (Lahoti et al., 2019). Imagine the task of selecting students for Graduate School in the U.S. College admissions tests can be taken multiple times, and students only report their best score for admissions. Furthermore, each retake of a test comes at a financial cost. On average, admission scores for African-American

(13)

students are lower than for white students (Brooks, 1992). When deciding which individuals are similar, there is no access to information about the number of resits or possibly about private tutoring. Therefore, a fairness expert might deem an African-American student with a relatively lower test score to be similar to a Caucasian student with a slightly higher score. However, it is not easy to quantify this information with a similarity metric.

2.2.3 Classification Utility in Fair ML

Fair ML classification tasks face the challenge of preventing discrimination against unprivileged groups, while at the same time achieving utility for the classifier. Although fairness is of great importance, policy makers or companies might refrain from adapting fair models with unsatisfactory utility. For instance, a fair automated lending system that does not discriminate with respect to race or gender, but that provides a loan to a vast amount of people that are not capable of repayment, has no practical value. When evaluating a the utility of a classifier, it is commonly acknowledged that only reporting accuracy is problematic when the data is imbalanced, which is often the case in Fair ML tasks (Asuncion and Newman, 2007).

However, we observed that a surprising number of Fair ML works only report accuracy when evaluating the utility of their fairness intervention algorithms (Zemel et al., 2013), (Louizos et al., 2015), (Raff et al., 2018). This is problematic, as a trivial model that maps all individuals to the majority class label, achieves perfect statistical parity and individual fairness, but does not learn anything. As fairness models are usually optimising two possibly contrary objectives simultaneously (fairness and classification utility), there is a high risk that the learning process collapses, meaning the majority class is predicted for all individuals. Therefore to adequately evaluate a Fair ML classifiers utility, it is important to also consider false positive/negative rate related performance metrics.

2.3 Fairness Intervention Algorithms

To evaluate the cost of fairness interventions we need mechanisms that perform such interventions. We refer to a classification mechanism as a fairness intervention if it aims to satisfy some fairness condition while maintaining utility for the classifier. Various papers have recently proposed such fairness intervention algorithms. In this section we critically review prominent fairness interventions in both group fairness and individual fairness.

2.3.1 Group Fairness Interventions

Models that intervene on group fairness are typically categorised into three types: pre-processing, in processing and post-processing mechanisms. For a complete overview of existing group fairness mechanisms we refer to (Pessach and Shmueli, 2020).

• Pre-processing mechanisms aim to modify the input data such that all information about group membership is obfuscated. This should guarantee that any classifier applied to the modified data is fair. Such approaches are motivated by the idea that the bias in the training data is the cause of the discrimination (Friedler et al., 2019). One such algorithm that we will analyse in this paper is that of (Feldman et al., 2015) which modifies the features such that the marginal distributions of each feature become similar for the privileged and unprivileged

(14)

2.3. FAIRNESS INTERVENTION ALGORITHMS 11

group. Another example is the work of (Calmon et al., 2017) that learns to map the features to a latent space independent of the sensitive attribute.

• In-processing mechanisms are most common in Fair ML. Such techniques modify specific learning algorithms by imposing additional constraints with respect the fairness of the outcomes. An example is the algorithm of (Zafar et al., 2017) that adds constraints to a classification model in order to satisfy statistical parity. Kamishima et al. (2012) adds a regularisation term to a cross-entropy objective function that penalises the mutual information between the classifiers predictions and the protected attribute. Another work that relies on both pre- as in-processing is that of (Zemel et al., 2013). Their algorithm Learning Fair Representations learns a modified representation of the features using a multi-objective classification loss function that optimises statistical parity and accuracy simultaneously. • Post-processing mechanisms modify the results of already trained classifiers to obtain

the desired group fairness constraints on the privileged and unprivileged group. For example, Hardt et al. (2016) propose a mechanism to post-processes the decisions of a trained classifier by flipping them such that fairness conditions are met.

Limitations The majority of the discussed fairness intervention algorithms evaluate their proposed method on two or more real-life datasets, exceptions being (Kamishima et al., 2012) and (Hardt et al., 2016). Usually performance (in terms of fairness and utility) differs significantly across these datasets. For example in (Calmon et al., 2017) the algorithm from (Zemel et al., 2013) is used as a baseline. The algorithm of (Zemel et al., 2013) achieves excellent statistical parity on one of the tasks considered (recidivism prediction) yet poor statistical parity on another task (income prediction). It is uncomforting that conclusions about the relationship between such performance differences and the datasets are absent in the discussed works.

2.3.2 Individual Fairness Interventions

Much work has aimed to bridge the gap between the strong semantics of the individual fairness definition of (Dwork et al., 2012) and its immediate applicability. The developed individual fairness mechanisms can be separated into those that require feedback in their learning process (e.g. from a regulator or oracle), and those that do not require feedback. We now discuss each category and its limitations.

Individual Fairness Mechanisms with Feedback

Unlike Dwork et al. (2012), Kim et al. (2018) do not assume the entire distance metric d over individuals to be known to the learner. Instead they assume the metric can be queried for a limited number of individual pairs. With this information they define sub-populations, such that individuals within each sub-population are treated similarly in their fairness intervention algorithm, named metric multi-fairness. Gillen et al. (2018) and Bechavod et al. (2020) do not require access to the distance metric in quantitative form. In their online learning settings a regulator detects fairness violations between pairs of individuals without enunciation of a quantitative measure. These violations are used to learn a distance metric over individuals.

Limitations These individual fairness algorithms have one major limitation: they assume some sort of oracle that can provide the classifier with information about the similarity of individuals. This assumption limits the adaptation and scalability of the discussed algorithms as such oracles

(15)

are hardly ever available in practice. Moreover, when available, multiple oracle will possibly yield different and subjective metrics.

Individual Fairness Mechanisms without Feedback

Parallel to oracle based individual fairness algorithms, Mukherjee et al. (2020) learns a distance metric over pairs of individuals without requiring feedback from a regulator or oracle. Their metric is learned from a set of comparable individuals. These are chosen to be individuals with a different protected attribute value, but with the same label. Another line of thought is that of (Kearns et al., 2019). In their average individual fairness algorithm each individual is subject to decisions made by multiple automated decision-making systems for a given period of time. A discussed example is the admission to public schools, where students apply not just to one school but to many, each making their own admission decisions. Their algorithm ensures that the error rates, defined as the average over the all classification tasks considered, are equalised across all individuals.

Limitations Although these algorithms make an important step towards the operationalisation of individual fairness, we do not agree with how individuals are compared or considered similar. Their definitions of similarity do not adhere to the problem of target bias discussed in Section 2.2.1. For example a man and a woman with the same qualities but with a different target, i.e. hired previously and not hired previously, are considered dissimilar in (Mukherjee et al., 2020) and (Kearns et al., 2019). In line with Dwork et al. (2012) we argue that the similarity or comparability

of individuals should be judged based on their features rather than labels.

2.4 The Scope of This Thesis

Algorithmic fairness definitions belong to two, possibly contradicting, categories: group fairness and individual fairness. Subsequently, an abundance of group fairness algorithms have been developed. However, generalisable conclusions that relate the achieved fairness and utility of these algorithms to aspects of the data they are trained on are lacking. Additionally, we saw that existing individual fairness algorithms are precarious: they either deviate from the original individual fairness definition in (Dwork et al., 2012), by defining similarity in terms of labels rather than features, or they rely on a human subjective oracle, which is undesirable. We argue that it is important to take a step back. Insight into existing fairness definitions is needed: when they coincide, when they contradict, when the loss in utility is too severe for the fairness model to be feasible and how this all relates to the data at hand. These insights can be generated by considering the cost of fairness interventions for varying datasets. Because individual fairness algorithms lack practical implementation, we focus on the cost of intervening on group fairness, such that the algorithm complies with statistical parity for reasons discussed in Section 2.2.1. We measure this cost in terms of individual fairness and in terms of classification utility and investigate how it depends on the difference in data distribution of the groups. To the best of our knowledge, this cost has not been investigated on a variety of real-life datasets.

(16)

3 |

Related Work

The Fair ML community has produced a large body of work that evaluates the mutual interaction of fairness measures and the interaction between fairness and the utility of the classifier. The term trade-off is often used here to indicate a balance or compromise between two or more desirable but incompatible fairness requirements. Kim et al. (2018) distinguish between a fairness-fairness and a fairness-performance trade-off. The first term refers to the possible incompatibility of different group fairness definitions and the latter to the possible incompatibility of group fairness and classification utility. Using the terminology introduced by (Kim et al., 2018), we investigate group-fairness vs. utility and group-fairness vs. individual-fairness trade-offs. Note that we use the term utility to refer to a classifier’s utility where (Kim et al., 2018) uses the term performance. Here performance is employed to refer to classifiers results in terms of both utility and fairness. We now discuss related work in each trade-off category.

3.1 The Tension between Fairness and Utility

The tension between group fairness and classification utility has been evaluated theoretically and empirically.

3.1.1 Theoretical Evaluations

A line of work that theoretically investigates the relationship between group fairness and utility is that of (Kleinberg et al., 2016) and (Kim et al., 2018), where the incompatibility of fairness and utility is related to the base rates of the privileged and unprivileged group. These base rates are defined as the fraction of true positive labels of a group. Kleinberg et al. (2016) formally prove that satisfying group fairness will result in lower accuracy in case the groups’ base rates are unequal. Kim et al. (2018) provides a model independent tool to understand the potential fairness trade-offs exhibited by classification models. They mathematically derive conditions in terms of utility (False Omission Rate and Precision) and base rates under which several group fairness notions can be met simultaneously. Although these works relate the group-fairness vs. utility trade-off to the base rates of the data, the derived relations are merely static and theoretical. Kim et al. (2018) and Kleinberg et al. (2016) consider scenarios where the base rates of the groups are either equal or different. However, actual base rate values are not considered. How the cost in utility caused by a small difference in base rates compares to the cost caused by a large difference remains undiscussed. We observe that perfectly equal base rates are rare in practice, and therefore such conditions have little value to practitioners preforming fairness interventions. Therefore, we will investigate how the cost of group fairness interventions in terms of utility depends on the data, rather than which hard constraints we should set on the data in order to be able to attain group fairness.

(17)

14 CHAPTER 3. RELATED WORK

3.1.2 Empirical Evaluations

A line of work that empirically studies group-fairness vs. utility trade-offs for group fairness models on specific datasets is that of (Feldman et al., 2015), (Zliobaite, 2015a), (Hardt et al., 2016) and (Zafar et al., 2017). For example (Zafar et al., 2017) and (Feldman et al., 2015) both evaluate their group fairness intervention algorithm, which aims to meet statistical parity, on two real-world datasets. Their evaluations show that satisfying statistical parity comes at a (small) cost in terms of utility. These results are obtained by changing the hyperparameters of their algorithms, which results in different fairness and utility values. However, these investigations assess the group-fairness vs. utility trade-offs for single datasets without providing guidance for other sets of data.

A recent effort relevant to our work is that of (Friedler et al., 2019). The authors conducted a comprehensive analysis of four fairness intervention algorithms, and created an open-source benchmark that facilitates the direct comparisons of these fairness intervention algorithms. They focus on group fairness algorithms that aim to comply with statistical parity. Their benchmark also allows for examination of the impact of different pre-processing techniques and the impact of different train/test splits. Their findings lead to an important conclusion: the performance of group fairness algorithms heavily depend on the data, i.e. different data pre-processing techniques or different train/test splits yield other results in fairness and utility. However, to our knowledge the exact relationship between the performance of such algorithms and the data remains understudied in the literature. Our work aims to fill this gap, by investigating how the base rates of the groups influence the cost of group fairness interventions in terms of utility for the algorithms considered in (Friedler et al., 2019). The evaluation of the cost of group fairness interventions in terms of classification utility fairness is assessed in RQ1 as posed Section 1.2. Subsequently, answering RQ3 allows us to relate this cost to the data of the privileged and privileged group. Note that we use the term cost rather than trade-off throughout this thesis. We argue that speaking of the cost of satisfying A in terms of B is more informative and interpretable for practitioners performing fairness interventions than speaking off a trade-off between A and B, which refers to a more general incompatibility instead of a directional relationship.

3.2 The Apparent Conflict between Group and Individual

Fairness

Dwork et al. (2012) mention the possible incompatibility of group fairness (Statistical Parity) and individual fairness (Lipschitz Condition). They study the extent to which a Lipschitz mapping (a classification model satisfying the Lipschitz Condition while maximising utility) can violate statistical parity. Using linear programming they prove that the statistical parity, measured as P ( ˆY = 1|A = 0) − P ( ˆY = 1|A = 1), of a classification model that satisfies the Lipschitz Condition (with an appropriate similarity metric over features of individuals and their predicted outcomes) is smaller or equal to the earthmover distance (Hitchcock, 1941) between the privileged and the unprivileged group. In general, the earthmover distance measures the distance of probability distributions. In (Dwork et al., 2012), it measures the distance between the joint distribution of features and labels of the privileged and unprivileged group. For example, in case the individuals in the privileged and unprivileged group follow a different joint distribution, the earthmover distance is large and therefore statistical parity is violated. In this case the Lipschitz condition violates statistical parity. In the opposite case, i.e. a small earthmover distance, the Lipschitz condition

(18)

3.2. THE APPARENT CONFLICT BETWEEN GROUP AND INDIVIDUAL FAIRNESS 15

implies statistical parity.

This yields an important conclusion: if the data of the groups follow a different distribution, group fairness (statistical parity) and individual fairness (Lipschitz condition) are incompatible. On the other hand, in case of a similar distribution, group fairness and individual fairness are compatible. Although these (in)compatibility’s are formally derived, they face one major limitation: in their proof (Dwork et al., 2012) assume to have access to a suitable distance measure between individuals and their predicted outcomes. We know from section 2.2.2 and 2.3.2 that such models and distance metrics are neither available in the existing literature nor in practice. Therefore verifying whether the described conclusion also holds in practice is not possible. To enable this verification and to investigate how the conflict between group fairness and individual fairness depends on the difference in data of the privileged and unprivileged group, we flip the approach of (Dwork et al., 2012). Rather than investigating the cost in terms of group fairness associated with satisfying individual fairness, we investigate the cost in terms of individual fairness associated with satisfying group fairness. It is possible to evaluate this relationship on real-life datasets as group fairness models are widely available and have successfully been applied to these datasets, which is not the case for individual fairness models.

There are prior efforts that made an attempt to optimise group fairness and individual fairness simultaneously (Zemel et al., 2013), (Lahoti et al., 2019). However, if group fairness and individual fairness are truly incompatible in some cases, such frameworks will optimise two contradicting objectives, with unreliable results as a consequence. We argue that before such methods are developed, this possible group-fairness vs. individual-fairness trade-off needs specification. To our knowledge, the practical (in)compatibility of group fairness and individual fairness has not been formalised yet, neither within a dataset nor across datasets. Our proposed test-bed allows for both. This is done by relating the cost of satisfying group fairness, in terms of individual fairness, to the data of the privileged and the unprivileged group. In this way the much needed specification is delivered. RQ2 posed in Section 1.2 assesses the evaluation of the cost group fairness interventions in terms of individual fairness. RQ3 us enables to relate this cost to the data of the groups.

(19)

4 |

Methodology

In Section 3.2 we saw that the Fair ML field faces a lack of knowledge on how intervening on group fairness influences the individual fairness of a classifier. Additionally, we saw that the cost of group fairness interventions have not been related to the difference in the distribution of the data over the groups in real-life settings. The goal of this thesis is to fill these gaps and to enable the evaluation of the cost of group fairness interventions in terms of classification utility and individual fairness, and to relate this cost to the difference in the data distribution of the groups. In this chapter we describe how we enable this evaluation by (i.) providing a method to evaluate the individual fairness of a model’s outcomes (RQ2) and (ii.) providing a method to compare the distribution of the privileged and the unprivileged group (RQ3). In Chapter 5 we describe how these methods allow us to answer our main research question.

4.1 Evaluating Individual Fairness (RQ2)

In this section we describe the quantitative evaluation of individual fairness required to answer RQ2. In Sections 2.2.2 and 2.3.2 individual fairness was introduced and reviewed as an imposed constraint on a classifier. However, to evaluate individual fairness, we are not interested in constraints, but in a metric that captures the extent to which a classifier treats similar individuals similarly. We demand that this metric is applicable to an arbitrary classifier trained on an arbitrary dataset, without requiring information from an oracle or regulator. Defining such a metric raises two questions: (i.) how to define similarity of individuals? (ii.) How to define similar treatment of individuals?

4.1.1 Defining Similarity of Individuals

Defining a similarity metric between individuals is a challenging task. We know from Section 2.3.2 that such metrics are not available in practice. Nonetheless, to be able to capture the similarity of individuals, in line with (Zemel et al., 2013) we assume that the similarity between a pair of individuals is given by the standardised Euclidean distance between their features, defined in Equation 4.1. Although this assumption does not meet the requirements of an appropriate similarity metric discussed in Section 2.2.2, such an assumption is necessary when investigating individual fairness in real-life classification tasks.

dEuclidean(us, vs) = dEuclidean(vs, us) = v u u t M X j=1 (usj− vsj)2 (4.1)

where us, vsdenote the standardised feature vectors of two individuals u, v ∈ X. Standardisation is important when features with different scales are compared. For example, if individuals have

(20)

4.1. EVALUATING INDIVIDUAL FAIRNESS (RQ2) 17

a feature age (in years) and a feature income (in euros), euclidean distance without standardised features will mainly report differences in income. Standardisation allows for an equal contribution of each feature to the reported difference.

Alternative distance metrics: While we use the standardised Euclidean distance, other distance metrics could be equally valid depending on the definition of similarity that is assumed. For example, if individuals are considered similar if their features are at a close angle from each other, cosine distance could be a valid metric. Additionally, weighting the features can be relevant for tasks where certain features are of greater importance for defining similarity.

4.1.2 Defining Similar Treatment of individuals

In the Lipschitz Condition from (Dwork et al., 2012) pairs of individuals are said to be treated similarly if their predicted probabilities are close. Similar treatment is then measured by a distance function D over these probabilities. For cases with binary labels, the Total Variation Norm can be used (Dwork et al., 2012), which is the absolute difference between the predicted probabilities, as defined in Equation 4.2.

DT otalV ariationN orm(M (u), M (v)) = |M (u) − M (v)| (4.2) where M follows the notation introduced in Section 2.1.

4.1.3 Consistency Metric

To our knowledge, the only metric available in the literature that aims to capture the individual fairness of a classifier, is the Consistency metric defined in (Zemel et al., 2013). Consistency, given in Equation 4.3, assesses the model classifications locally in the input space by comparing the prediction of a classifier to the predictions of its k-nearest neighbours in the feature space. It is used by (Raff et al., 2018) and (Lahoti et al., 2019).

k-Consistency = 1 − 1 N X u∈X |Round(M (u)) − 1 k X v∈kN N (u) Round(M (v))| (4.3)

where kN N (u) returns the k individuals with the smallest Euclidean distance to individual u. Consistency values close to 1 indicate that similar individuals are treated similarly. This metric defines the extent to which ‘similar individuals are treated similarly’ as follows: (i.) individuals are compared against their k-nearest Neighbours instead of against all individuals as in (Dwork et al., 2012), (ii.) similar treatment is considered with respect to predicted labels, rather than the predicted probabilities of these labels as in (Dwork et al., 2012). To understand the differences between these definitions, we consider an example.

An Example: Recruiting and Consistency Consider the recruitment scenario in Section 2.2.1, to which we now add an extra feature: the IQ of the applicant. We calculate 2 − Consistency for the classification scenario in Figure 4.1, where a prediction of 1 corresponds to the favourable label (the applicant is hired). This model seems to comply with the definition of individual fairness from (Dwork et al., 2012): individuals with similar work experience and intelligence receive the same prediction. However, the corresponding consistency score is low, namely 0.667. This is due to the fact that individuals are always compared against their nearest neighbours, even if the

(21)

18 CHAPTER 4. METHODOLOGY

distance to these neighbours is large. For example, individual A, who does not resemble any of the applicants, is still compared against its two nearest neighbours, B and C, while it is evident from Figure 4.1 that these individuals should not be considered similar. Such an effect might vanish if the population considered is larger, but the problem remains for outliers.

Figure 4.1: Visualisation of the features and outcomes of a fictitious recruitment scenario. Applicants A-F are to be invited for a job interview based on their work experience (Y-axis) and IQ (X-axis). Filled red circles (pred: 1.0) correspond to the favourable label, i.e. being invited to the interview, and unfilled red circles to the unfavourable label.

4.1.4 Lipschitz Violation Metric

We propose a new metric to measure the individual fairness of a classifier in order to address the limitations of the Consistency metric. We name our metric Lipschitz Violation as it turns the original Lipschitz Condition from (Dwork et al., 2012) into a metric. Our metric, given in equation 4.4, measures individual fairness by the mean number of times the Lipschitz Condition from (Dwork et al., 2012) is violated. Lipschitz-Violation = 1 N (N − 1) X u6=v∈X [K ∗ D(M (u), M (v)) ≤ d(u, v)] (4.4)

where the summation is over all possible pairs (u, v) and where K is a hyperparameter. The brackets [.] return a value of 1 when the Lipschitz condition in brackets is violated, and 0 otherwise. Small values of our metric imply more individual fairness than large values. However, values close to zero might still indicate a large amount of violation. For example, a dataset consisting of 1000 individuals, i.e. 999.000 possible pairs, and a classifier with a Lipschitz Violation metric equal to 0.001 still results in 9990 pairs violating individual fairness. In our test-bed we choose d(u, v)) and D(M (u), M (v)) to be the standardised Euclidean distance and Total Variation Norm respectively, for reasons discussed in Sections 4.1.1 and 4.1.2.

The parameter K is a positively valued hyperparameter that controls the rigidity of our fairness metric. Smaller values of K give rise to smaller values of the Lipschitz Violation metric, i.e. ‘more individual fairness’, since it allows similar individuals in feature space to have (very) different

(22)

4.2. EVALUATING THE DIFFERENCE IN DATA DISTRIBUTION BETWEEN THE GROUPS

(RQ3) 19

predicted probabilities. A large value of K implies that individual fairness is more constrained, i.e. similar individuals in feature space should have (very) similar predicted probabilities. In our experiment in Section 6, we will investigate how this hyperparameter influences the Lipschitz Violation metric.

4.1.5 Consistency or Lipschitz Violation

In summary, the proposed Lipschitz Violation metric in 4.4 differs from the Consistency metric in 4.3 (Zemel et al., 2013) in two ways. First, it compares each individual to all other individuals, rather than just the nearest neighbours. Second, it defines similar treatment in terms of predicted probabilities of labels rather than predicted labels (Dwork et al., 2012). In Section 5.2, we will see the implications of these differences.

4.2 Evaluating the Difference in Data Distribution between

the Groups (RQ3)

A prerequisite to answer RQ3 discussed in Section 1.2 is a measure of the difference in the distribution of the data over the privileged and the unprivileged group. In line with (Hardt et al., 2016) and (Kim et al., 2020), we use the base rates to measure the data inequality between the privileged and unprivileged group. Our metric, defined in Equation 4.5, equals the base rate difference of the privileged and unprivileged group, which is equal to the difference of their probabilities of truly belonging to the favourable class.

Base Rate Difference = P [Y = 1|A = 0] − P [Y = 1|A = 1] (4.5)

In lieu of considering the difference in labels of each group, we could consider the difference in the joint set of labels and features of each group. For example, (Dwork et al., 2012) uses the earthmover distance over the joint distribution of features and labels to quantify the difference in data of the privileged and unprivileged group. However, their framework is merely theoretical. In practice we have no knowledge of the true distribution of the entire dataset of each group that is required to compute the earthmover distance. However, there are methods to approximate the earthmover distance over two datasets. For example, Alvarez-Melis and Fusi (2020) uses optimal transport techniques to derive a quantitative metric that compares two datasets. However, implementing such methods is beyond the scope of this thesis and remains a possibility for future research.

In the next chapter, we outline the test bed for evaluating the cost of group fairness, using the metrics we introduced in this chapter.

(23)

5 |

Test-Bed for Evaluating the Cost

of Group Fairness Interventions

In this section, we describe our test-bed for the evaluation of fairness interventions. This test-bed allows for a transparent evaluation and comparison of multiple group fairness intervention algorithms on various real-life datasets. This allows us to answer our main and detailed research questions as described in Section 1.2. In Section 5.1 we describe the experimental pipeline our test-bed relies on. In Section 5.2, we conduct an experiment to explore and justify the metrics for individuals fairness discussed in Section 2.3.2, which gives additional insights into RQ2.

5.1 A Pipeline for Evaluating the Cost of Group Fairness

Interventions on Real-Life Data

To investigate how the relationship between the cost of group fairness interventions and the difference in the distribution of the data over the privileged and unprivileged group takes shape in real-life classification tasks, we evaluate the outcomes of ten group fairness intervention algorithms on five real-life datasets. Our test-bed guides this process by providing an experimental pipeline for the generation and evaluation of group fairness intervention outcomes. This pipeline builds on the benchmark developed in (Friedler et al., 2019), that we adapt by evaluating additional fairness metrics and debiasing algorithms. Our pipeline consists of three phases, and is outlined in Box 5.1.1. In Phase 1 we prepare the datasets. In Phase 2 we generate the outcomes of several group fairness intervention algorithms on these datasets. In Phase 3, we evaluate these group fairness intervention outcomes in terms of group fairness, individual fairness, classification utility and base rate differences. In what follows we illustrate each phase.

(24)

5.1. A PIPELINE FOR EVALUATING THE COST OF GROUP FAIRNESS INTERVENTIONS

ON REAL-LIFE DATA 21

Box 5.1.1 Experimental Pipeline Phase 1. Data and Data Pre-processing

1A. Data

Description of datasets 1B. Data pre-processing

Ensure that each dataset is pre-processed in the same manner 1C. Increasing the number of evaluations

Evaluate each dataset for multiple train/test splits

Phase 2. Group Fairness Intervention Algorithms 2A. Generating group fairness intervention outcomes

Description of group fairness algorithms and baselines 2B. Parameter Settings for Fairness Intervention Algorithms

Description of hyperparameter settings

Phase 3. Evaluation of Group Fairness Intervention Outcomes 3A. Evaluating statistical parity

Measure the statistical parity of a model 3B. Evaluating classification utility

Measure Accuracy and F1-Score 3C. Evaluating individual fairness

Measure Consistency and Lipschitz Violation

3D. Evaluating the difference in data distribution of the groups Measure Base Rate Difference

5.1.1 Phase 1 - Data and Data Pre-processing

Phase 1 of our pipeline consists of three components. 1A describes the datasets we use to enable the practical evaluation of the cost of group fairness interventions. Subsequently, 1B describes how each dataset is pre-processed to ensure all algorithms can be objectively compared, and 1C describes how and why we increase the number of evaluations by performing different train/test splits for each dataset. We now describe each component in more detail.

1A. Data

The selected datasets originate from real-world classification tasks covering a variety of domains in which certain groups have been subjected to disadvantageous treatment. The domains we consider are recruiting and promotion, credit-worthiness and recidivism prediction. A prerequisite of the selected datasets is that they cover a wide range of base rate differences, allowing us to adequately investigate the relationship between the difference in the data distributions of the groups and the cost of group fairness interventions. In Section 6.2 we verify whether this is the case. A summary of the datasets can be found in Table 5.1. We now discuss each dataset in detail. 1

(25)

22

CHAPTER 5. TEST-BED FOR EVALUATING THE COST OF GROUP FAIRNESS INTERVENTIONS

Ricci: Firefighter Promotion Exam Scores This dataset (Miao, 2010) was used as part of a U.S. court case (Ricci v. DeStefano) that dealt with racial discrimination. Ricci consists of administered exams of Connecticut firefighters intended to qualify for a promotion in New Haven’s fire department. The dataset has 118 entries and five attributes, including the sensitive attribute race, where Caucasian/Hispanic individuals belong to the privileged group and people of colour belong to the unprivileged group. The task is to predict whether a firefighter receives a promotion.

The German Credit dataset This dataset (Hofmann, 1994) consists of 1,000 individuals with twenty features. The task is to predict whether they have good or bad credit risk. The dataset includes two binary sensitive attributes: sex, for which the privileged and unprivileged group are males and females respectively, and age, for which the privileged and unprivileged group are adults (> 25 years old) and adolescents (≤ 25 years old) respectively.

COMPAS Recidivism Risk Score This dataset (Larson et al., 2016) is collected by newsroom Propublica about the use of the COMPAS software, a risk assessment tool used by U.S. courts to assess the likelihood of a defendant becoming a recidivist. For each of the 6.167 defendants, it includes information such as the charge degree and the number of prior defences, along with the sensitive attribute race (Caucasians/Hispanics being the privileged and people of colour being the unprivileged group). The task is to predict recidivism, i.e. whether an individual will be arrested again within two years (the unfavourable label).

COMPAS Violent Recidivism Risk Score This dataset (Larson et al., 2016) is used for the same task as the recidivism prediction task described above, but now the focus is on predicting a violent re-offence within two years (the unfavourable label). This dataset contains 4.010 individual, who are again divided into a privileged group (Caucasians/ Hispanics) and an unprivileged group (people of colour).

Dataset

N

M

Favourable Label

Protected Attribute

Ricci

118

3 Promotion

Race

German

1.000

20 Good Credit

Sex

German

1.000

20 Good Credit

Age

Propublica

6.167

8 No Recidivism

Race

Propublica Violent

4.010

8 No Violent Recidivism

Race

Table 5.1: Datasets used in our experimental pipeline with N the number of observations and M the number of features.

Step 1B. Data Pre-processing: Ensuring Equal Input

Implementations that develop or evaluate group fairness interventions frequently combine the pre-processing of the data with the algorithm itself (Zafar et al., 2017), (Calmon et al., 2017). However, Friedler et al. (2019) show that choices for pre-processing unequivocally affect the performance of fairness algorithms. Therefore, to reliably compare group fairness intervention algorithms it is crucial that algorithms receive the same data as input. This is the purpose of Step 1B. Each dataset is pre-processed such that (i.) the label to be predicted is binary, (ii.) the sensitive attribute with respect to which fairness is required is binary, (iii.) all features are numerical, i.e. categorical

(26)

5.1. A PIPELINE FOR EVALUATING THE COST OF GROUP FAIRNESS INTERVENTIONS

ON REAL-LIFE DATA 23

features are one-hot encoded. The last requirement is necessary as some algorithms we consider cannot handle categorical features.

1C. Increasing the Number of Evaluations

To increase the robustness of our investigation, we increase the number of fairness intervention outcomes generated for each dataset. We do this by evaluating different train/test splits of each dataset. In this step, we randomly split each processed dataset a 10 times. Each random split employs 2/3 of the data to train the fairness intervention algorithm and the remaining 1/3 is used to evaluate the trained algorithm in Phase 3.

5.1.2 Phase 2 - Group Fairness Intervention Algorithms

The statistical parity intervention algorithms are those used by (Friedler et al., 2019), along with the Learning Fair Representations algorithm from (Zemel et al., 2013). The choice of these algorithms was based on diversity of fairness interventions and accessibility of source-code. For Learning Fair Representations, we adapted the implementation of (Bellamy et al., 2019) such that it integrates into the benchmark structure of (Friedler et al., 2019). In Phase 2A we discuss the algorithms and in Phase 2B their corresponding hyperparameter settings.

2A. Group Fairness Intervention Algorithms

Along with the group fairness intervention algorithms described below, we evaluate each train/test pair on four classifier baselines, whose only purpose is to maximise classification utility. These baselines are Decision Trees, Gaussian Naive Bayes (GaussianNB), Logistic Regression (LR) and Support Vector Machines (SVM). This results in a total of ten algorithms.

Feldman et al. (2015): provide a pre-processing approach that modifies the features such that the marginal distribution of each feature becomes similar for the privileged and unprivileged groups while relevant information for classification is preserved. Recall that the aim of pre-processing techniques is that any classifier applied to the modified data satisfies group fairness. Since this algorithm only pre-processes the data, we evaluate this approach on four classifiers: Decision Trees (DecisionTree), Gaussian Naive Bayes (GaussianNB), Logistic Regression (LR) and Support Vector

Machines (SVM), to which we refer as Feldman-DecisionTree etc.

Zafar et al. (2017): provide an in-processing approach that reformulates the non-convex Sta-tistical Parity into a relaxed, convex definition applicable to decision boundary classifiers such as logistic regression or support vector machines. They propose two related optimisation problems: one that maximises accuracy under fairness constraints, and one that maximises fairness under accuracy constraints. In line with (Friedler et al., 2019) we refer to these versions as ZafarAccuracy and ZafarFairness respectively. The first may be used in cases were compliance with statistical parity (or Zafar et al. (2017)’s relaxed version of it) is required, whereas the latter may be used when some form of unfairness is allowed to meet businesses needs in terms of accuracy.

Zemel et al. (2013): provide a combined pre- and in-processing approach that learns a modified representation of the features and simultaneously learns the parameters of a classification model. Their neural network like structure learns a mapping from the features to a prototype space independent of the sensitive attribute. These prototypes are then used as input for a classification

Evaluating the Cost of Group Fairness Interventions in Machine Learning

MSc Artificial Intelligence

Master Thesis