Exploring the effect of survivorship bias mitigation on the performance of a recidivism prediction tool

(1)

Survivorship bias

Exploring the effect of survivorship

bias mitigation on the performance

(2)

Layout: typeset by the author using LA_TEX.

(3)

Exploring the effect of survivorship

bias mitigation on the performance

of a recidivism prediction tool

Method and challenges

Ninande Y. Vermeer 11269057

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors UvA dhr. dr. R.G.F. Winkels Leibniz Center for Law

Faculty of Law University of Amsterdam Roeterseilandcampus, building A, 6th floor 1018 WV Amsterdam KPMG dr. A.W.F. Boer June 26st, 2020

(4)

Abstract

Survivorship bias is the fallacy of focusing on entities that survived a certain selec-tion process and overlooking the entities that did not. This common form of bias can lead to wrong conclusions. AI Fairness 360 is an open-source toolkit, made by IBM, that can detect and handle bias. The toolkit detects bias according to differ-ent notions of fairness and contains several bias mitigation techniques. However, what if the bias in the dataset is not bias, but rather a justified unbalance? Bias mitigation while the “bias” is justified is undesirable, since it can have a serious negative impact on the performance of a prediction tool based on machine learn-ing. The tool can become unusable. In order to make well-informed product design decisions, it would be appealing to be able to run simulations of bias mitigation in several situations to explore its impact. This thesis attempts to create a tool that can run this simulation to explore the effect of survivorship bias mitigation on the performance of a recidivism prediction tool. The results reveal a lack of research on this subject. For that reason, the main contribution of this thesis is an indication of the challenges that come with the creation of such a simulation tool.

(5)

1 |

Introduction

The substitution of human decision making with Artificial Intelligence (AI) tech-nologies increases the dependence on trustworthy datasets. A dataset forms the foundation of a product. When the dataset contains bias, the outcome will be biased too. Therefore, the subject bias is examined a lot in the field of Artificial Intelligence and Data Science. All datasets are subsets of reality. If this subset contains systematic errors, for example, since it incorrectly concentrates on entities with a certain trait, the dataset will be biased. In this case some entities belong to the group of interest, but they are not observed and therefore they will not become part of the dataset. Subsequently, valuable and even essential information can be missed. Incorrectly, some individuals of the group of interest have not “survived” a certain selection process, which made them unobservable. The resulting bias is called survivorship bias. Survivorship bias and the effects of its mitigation are the subjects of this thesis.

Survivorship bias is common in all kind of domains. A clear example is about studying success. Mostly these studies are aimed at studying successful people. However, studying people that tried to be successful and failed, can be at least as important. Apparently, the people that did not succeed had habits, traits or less luck that withheld them from becoming successful. Studying both the people that “survived” and the people that “did not survive” the process of becoming successful, is essential to understand how one can truly become successful. Another example is disease prediction. If a disease prediction tool is trained only on features of people that were diagnosed with the disease. This causes survivorship bias. Namely, the features of people with the disease but without diagnosis are overlooked.

Sometimes, survivorship bias is difficult to recognise, since it requires counter-intuitive thinking. Look at the visible and the invisible; the entities that took the same path but did not “survive”. This awareness is necessary to collect trustworthy datasets which lead to less biased prediction tools. After having a reasonable understanding of the concept of survivorship bias, it is desirable to explore the effect of its mitigation on product performance, since this can help making design decisions. For example, if a dataset is debiased, while the bias was actually an

(7)

explainable unbalance, the accuracy of the predictions can decrease substantially. The prediction tool might be fair, but it will be unusable. However, if the bias is not justifiable, the prediction tool can discriminate individuals and give unreliable results. It is a challenge to balance performance and fairness. A simulation of bias mitigation in prediction tools, can help with taking into account all possible consequences of design decisions to make a well-informed choice. This thesis aims to raise awareness of survivorship bias and to explore the challenges that come with creating a simulation tool to explore the effect of survivorship bias mitigation on product performance. In order to do that, a prototype is created that investigates the effect of survivorship bias in the domain of recidivism prediction. Recidivism is the propensity for relapsing into unjust behaviour. It is valuable to be able to predict recidivism for criminals, since it helps with making decisions in their trial processes. This can benefit the criminal or contribute to the safety of other citizens.

AI Fairness 360 (AIF360), developed by IBM, is an open source Python toolkit that aims to examine, report and to mitigate bias to prevent algorithmic unfairness in an industrial setting and to help fairness researchers with sharing and evaluating their work.[1] The toolkit is used for the prototype since it is a widely utilised toolkit providing a general framework to build upon. Hopefully, in the future, AIF360 will also be able to simulate the effect of design decisions concerning bias mitigation to improve product designs. This thesis aims to contribute to this goal by proposing new research questions, based on the challenges faced during this project and the obtained results.

This document is organised as follows. Section 2 describes the theoretical foundation. It explains the origin of survivorship bias, current development in recidivism prediction, the AI Fairness 360 toolkit of IBM and causal diagrams. Section 3 will address the research context and question. Sections 4, 5 and 6 respectively report the approach, the results and the conclusions. The discussion and future works are combined in section 7.

(8)

2 |

Literature

2.1 The origin of survivorship bias

The term “survivorship bias” originates from the Second World War.[2] America needed new technologies and mathematical insights to stay ahead of its enemies. Therefore, the United States formed the Applied Mathematics Panel, a group of eleven specialised research groups aimed at finding mathematical insights to enhance strategies and armor. The research group of Abraham Wald investigated aircraft survivability to reinforce warplanes. Only 50 percent of the warplanes survived a battle. Even a small increase of this probability would save a significant amount of lives. Their first strategy was to examine every plane that returned from battle and to reinforce the parts that were heavily damaged. However, apparently, the planes that returned from battle were able to return safely with those heavily damaged areas. This insight led them to a new approach. They reinforced all parts that were not heavily damaged, because the damage to those parts were probably the reason the other planes did not survive. This was the start of the concept of survivorship bias. First, they overlooked the attributes of the non-survivors, leading to incorrect conclusions. Then, they changed their perspective and saved a significant amount of lives.

2.2 Recidivism

This thesis examines survivorship bias in recidivism prediction. Recidivism is the tendency to relapse into undesirable behaviour despite the experience of negative consequences of previous similar behaviour. It is desirable to have an assessment tool that can indicate whether there is a risk that a criminal will repeat the crime after release, because this information can be used to make decisions in their trial processes. For example, the police can observe this person more carefully after his or her release. However, if this assessment tool is trained on biased data, the predictions will be biased too. This can lead to unreliable results and

(9)

discrimina-tion based on protected social attributes like race and gender. An example of a criminal risk assessment tool is COMPAS, abbreviation for “Correctional Offender Management Profiling for Alternative Sanctions".[3] The company Northpointe developed the tool in 1998 and since 2000 it is widely used to predict the risk that a criminal will recidivate within two years of assessment. These predictions are based on 137 features about the individual and its past. The performance and fairness of COMPAS is thoroughly explored.

Julia Dressel and Hany Faris did research on the performance of COMPAS.[3] They compared the accuracy of COMPAS with the overall accuracy of human assessment. The participants of the study had to predict the risk of recidivism for a defendant, based on a short description about the defendant’s gender, age and criminal history. While these participants did have little to no expertise in criminal justice, they achieved an accuracy of 62.8 percent. This is comparable with the 65.2 percent accuracy of COMPAS. Thus, non-experts were almost as accurate as the COMPAS software. They also compared the widely used recidivism prediction tool with a logistic regression prediction model based on the same seven features as the human participants. That classifier achieved an accuracy of 66,6 percent. This suggests that a classifier based on seven features performs better than a classifier based on 137 features. Comparing the accuracies of COMPAS, human assessment and a simple classifier can cast serious doubt on the effort of predicting the recidivism risk algorithmically.

Moreover, the protected attribute race was not included in the training dataset of COMPAS, but ProPublica, a nonprofit organisation involved in investigative journalism, indicated racial disparity in its predictions. The assessment tool seemed to overpredict recidivism for black defendants and to underpredict recidi-vism for white defendents. Thus, COMPAS was favouring white defendants over black defendants. Northpointe argued that the analysis of ProPublica had over-looked more standard fairness measures, like predictive parity. Namely, COMPAS can differentiate recidivists and non-recidivists equally well for black and white defendants. This discussion leads to questions about which fairness metrics to use in a particular situation. Namely, there are more than twenty different notions of fairness and it is impossible to satisfy all of them; correcting for one fairness metric, means a larger unbalance for another fairness metric. Since the future of a defendant lies in the balance of the satisfaction of these notions of fairness, it is important to investigate these notions to find the best fairness metric for the situation. IBM created a tool that helps with finding the best fairness metric for each situation. It also provides a platform for researchers to advance and to ease research concerning this topic. This tool is the subject of the next section.

(10)

2.3 AI Fairness 360 and bias mitigation

As mentioned in the introduction, AI Fairness 360 (AIF360) is developed by IBM and it is an open source Python toolkit that can examine, report and mitigate bias to prevent algorithmic unfairness. Moreover, it helps fairness researchers with shar-ing and evaluatshar-ing their work. The toolkit contains more than 70 fairness metrics and it includes 9 bias handling (mitigation) algorithms. In addition, it contains metric explanations for consumers to understand the detected bias. AIF360 is compatible with the scikit-learn package of Python. IBM experimented with three classifiers, namely Logistic Regression, Random forest and Neural Network. Since Logistic Regression is ideal for binary classification problems, this classifier is used in this thesis. The 9 bias mitigation algorithms included in the toolkit each address a certain process of the model. They can be specified as a pre-, in-or post-processing bias mitigation algin-orithm. The bias mitigation technique used in this thesis is a pre-processing algorithm called reweighing. The reweighing al-gorithm does not change labels in the training dataset, but it carefully alters the weights of the learning model to free the dataset of bias.[4]

2.4 Causal diagrams

The AIF360 toolkit of IBM focuses on bias indication and mitigation. However, it does not take into account relations between attributes. The unbalance, indicated by AIF360 as bias, can have a justified cause. Therefore, to make appropriate choices about bias mitigation, it is important to explore cause-and-effect relations between attributes, since it can help with distinguishing between bias and justified unbalance. A mathematical framework used to express what is known about the world, is called Structural Causal Models (SCM).[5]. A SCM consists of coun-terfactual and interventional logic, structural equations and graphical models. A graphical model of causal relations can be called causal graphs, or causal diagrams. A causal diagram is a causal network that contains multiple nodes, or at-tributes, that are connected via arrows in characteristic patterns called junc-tions.[6] There are three basic types of junctions, namely 1) chains, 2) forks and 3) colliders. See figure 2.1 for a visual representation of the three junctions. These junctions are important, since they illustrate the type of causal relations between attributes. A → B → C (a) Chain A ← B → C (b) Fork A → B ← C (c)Collider Figure 2.1: Basic junction types: chain, fork and collider.

(11)

Figure 2.2: A simple causal loop diagram.

Attributes can positively or negatively influence each other and form circular con-nections that can be called feedback loops. These concon-nections and effects can be illustrated by causal loop diagrams. Figure 2.2 illustrates a simple causal loop diagram. The plus near the head of the arrow connecting A to B implies that when A increases, B increases or when A decreases, B also decreases.[7] The minus near the head of the arrow connecting B to A implies that they have an opposite effect. When B increases, A decreases and when B decreases, A increases.

Interpreting the outcome of machine learning models can have an effect on the construction of the model itself. For example, a risk assessment tool is trained on data that is biased towards men, since it contains a disproportionate amount of men with high risk labels. Subsequently, the probability that the tool predicts a high risk score for a new male individual will be larger, compared to a female individual. These predictions are analysed by authorities and they can act accord-ingly. If they concluded from this analyses that male individuals form a larger risk compared to women, they can for example consciously or unconsciously rearrange their surveillance tactics to focus more on male individuals. When new data is collected, this data will probably consist of a larger group of high risk labeled male individuals. Thus, by interpreting the prediction tool, the dataset of the tool will be altered. This amplifies the bias. However, to know whether the unbalance in the data is indeed bias, more information is necessary about the other involved attributes. By constructing causal (loop) diagrams, the cause-and-effect relations between attributes can be exposed. This can reveal relations between attributes that can cause bias or unbalance.

(12)

3 |

Research Question

The usage of an assessment tool to predict recidivism can provoke survivorship bias. For example, figure 3.1(a) represents criminals, the group of interest, in reality. Not all of these criminals are going to be arrested. That depends on their visibility to the police. In this thesis the range of criminals that the police can observe is called focus. When there is an equal amount of male and female criminals, the focus on the group of interest should be like the magnifying class in figure 3.1(b), where both genders are observed equally. However, if the police is biased towards men, their focus is directed at aspects of male criminals, incorrectly giving male criminals a higher chance to appear within their focus, see figure 3.1(c). Criminals can only be arrested if they are observed. Since men have a higher chance of being noticed compared to women, it is likely that more men will be arrested. Subsequently, the dataset will contain more male recidivists in relation to female recidivists. Being a men will become a risk factor for recidivism prediction, while in reality, gender does not have a correlation with recidivism. Since the people that use this tool cannot observe reality, they do not take into account the features of the group of interest outside their focus. This causes survivorship bias, since the police erroneously concentrate on the male gender. Men are more observed and consequently, men are more prevalent in the dataset of the risk assessment tool. This leads to unreliable recidivism predictions.

The dataset of such a risk assessment tool is altered over the years, since every year new individuals are getting arrested. However, if the tool influences the focus, the effect of survivorship bias is increased. Debiasing the dataset used for the training of the risk assessment tool can avoid this, but debiasing can have a significant effect on the performance of the prediction tool. Therefore, it would be desirable to be able to simulate the effect of the bias and bias mitigation in different situations, fair and unfair. Observations of this effect can help making well-informed decisions about bias mitigation. Moreover, the observations can indicate whether making such a prediction tool is effective or whether it is a waste of time, money and effort.

(13)

(a)Group of interest

(b) Fair focus on the group of interest (c) Unfair focus on the group of interest Figure 3.1: The focus on the group of interest determines the dataset that is obtained, since only observed individuals can become part of the dataset. The red individuals are getting arrested. In the fair situation 5 men and 5 women get arrested and in the unfair situation 6 men and 3 women get arrested. Due to the incorrect focus on the male gender, the dataset will be biased.

This thesis explores the process of designing a simulation tool that can demonstrate the effect of correcting or not correcting for survivorship bias on the performance of a recidivism prediction tool in a fair and unfair situation. Figure 3.2(a) is a causal diagram that demonstrates an unfair situation. Being a man directly increases the chance of getting arrested. This provokes survivorship bias. The causal diagram of figure 3.2(b) illustrates a fair situation, since men are getting arrested more often for recidivism, because they indeed recidivate more often. In an unfair situation, it is necessary to debias the dataset before training. But, in a fair situation, debiasing is undesirable, since it unnecessarily decreases the accuracy of the prediction tool. Therefore, four situations are explored. Namely the fair and unfair situations with and without bias mitigation. An overview of these four situations can be seen in figure 3.3. We expect a significant decrease of accuracy in the fair situation with bias mitigation, compared to the other situations.

(14)

(a) Unfair situation (b) Fair situation Figure 3.2: Causal loop diagrams that explain the two situations.

(15)

4 |

Method

4.1 Technological requirements

This paragraph mentions the tools and packages that are used for the simula-tion prototype of this thesis. AIF360 is open-source and can be downloaded on https://github.com/ibm/aif360. Since the toolkit is freely accessible and widely used, it is a stable platform to build upon. This is important for probable future development of a similar simulation tool. Furthermore, the experiments of the creators of AIF360 demonstrated that the toolkit is compatible with the scikit-learn, or skscikit-learn, Python package. The prototype uses its StandardScaler and LogisticRegression modules. Also, AIF360 contains a BinaryLabelDataset class, that can accept dataframes of pandas and convert them into a base struc-ture that can be used in AIF360. In order to do the necessary mathematics and to visualise the results, the packages numpy and matplotlib are also required. In conclusion, the prototype is designed in Jupyter Notebook, since this medium is also used for AIF360 tutorials and it is user-friendly. The Jupyter Notebook worked with Python version 3.6.5.

4.2 Generating the start dataset

Since there is not an actual dataset to start with, the start dataset need to be generated. But, how to make a realistic dataset for this situation? The dataset has to be representative for both possible realities, fair and unfair. This means that the dataset needs to satisfy two requirements. Firstly, the amount of men in the dataset must be larger than the amount of women, because in either realities more men are getting arrested. Secondly, the male population must consist of a higher percentage of high risk labeled individuals compared to the female popula-tion, because in the fair reality men recidivate more. Taking into account these two requirements, the settings in table 4.1 are used to create the dataset. Figure 4.1 demonstrates these settings in a visual representation of the dataset. 60 percent of

(16)

the total population in the dataset is male and 40 percent is female. Furthermore, 70 percent of the population of males is arrested for recidivism. This is 50 percent for the population of females. The percentages meet the two requirements men-tioned above. These high percentages of high risk labeled individuals are chosen in the hope to observe clearer effects of bias mitigation.

Setting Percentage [0-1]

Men in dataset 0.6

Men with high risk score 0.7 Women with high risk score 0.5

Table 4.1: Start dataset settings

Figure 4.1: Visual representation of the dataset.

As mentioned in the technical requirements, the dataset is created as a DataFrame, since this type can easily be converted into a format compatible with AIF360. Initially, this DataFrame consists of two features, namely gender and risk. The probable options of gender are ‘male’ and ‘female’. The risk attribute indicates which individuals got arrested for recidivism, labeled as ‘1’, and which did not get arrested for recidivism, labeled as ‘0’. ‘High risk’ refers to individuals with a ‘1’ label and ‘low risk’ refers to individuals with a ‘0’ label. Gender will be the protected attribute, since discrimination based on this attribute is unacceptable. The recidivism prediction tool predicts the risk that an individual would recidivate, based on actual cases of recidivism (label ‘1’). Therefore, risk is the label attribute. Note the subtle difference between the risk label in the dataset and the predicted risk. The ‘1’ risk label represents a case of recidivism in the dataset, while it represents a predisposition for recidivism in the prediction model.

(17)

Normally, a dataset contains more than one attribute and a label. In an attempt to avoid “obvious” results and making it a more realistic situation, noise is added to the dataset. This will add four more attributes to the DataFrame, namely the attributes age, educational level, crime rate and criminal history. The attributes and the possible values they can take are shown in table 4.2. The last column, “Percentage”, represents the distributions of the attributes for each gender population. In order to avoid unreliable results due to correlation between gender and the noise attributes, the attributes are equally distributed. The attribute age is distributed according to a Gaussian distribution with a mean of 45 and a standard deviation of 8. Thus, most women and men are between 37 and 53 years old. The values of the noise attributes are randomly chosen, since they are considered trivial.

Noise Options Percentage

Age 0-100 Gaussian distribution

Educational level ’A’, ’B’, ’C’, ’D’ All 25%

Crime rate 1, 2, 3, 4 All 25%

Criminal history ’Yes’ (1) or ’No’ (0) 50%

Table 4.2: Noise attributes: age, educational level, crime rate and criminal history.

The BinaryLabelDataset class of AIF360 accepts a onehot encoded DataFrame and converts it into a format that is compatible with the toolkit. Besides the DataFrame, the class needs two other input arguments, namely the label_names and protected_attribute_names. These are respectively ‘risk’ and ’gender_male’. Also, AIF360 need a specification of the privileged and unprivileged group. In the case of this thesis, the females are the privileged group and males are the unprivileged group.

4.3 Recidivism prediction tool

The dataset is split into three parts, namely the training set (50 percent), valida-tion set (30 percent) and the test set (20 percent). As described in the literature, it is proven that AIF360 is compatible with three classifiers, namely Logistic Regres-sion, Random forest and Neural Network. Logistic Regression is the most useful for binary recidivism prediction, so the training set trains a Logistic Regression model. Like explained in the Research Question, the effects are explored with and without debiasing. If debiasing is required, the training dataset is reweighed before training. The threshold is the certainty that an individual should receive a high risk label. For example, a threshold of 0.45 means that if the prediction model

(18)

is 45 (or more) percent certain that the individual should be labeled as high risk, the individual will be classified as high risk. Usually, this threshold is set by the programmer based on the expectations of the outcome of the prediction model. However, since there are no supported expectations in this thesis, the validation set is used to determine the threshold that maximises the accuracy. The test set is used to calculate the accuracy of the prediction tool, given the best threshold received in the validation process. The total learning process is displayed in figure 4.2.

(a) With debiasing

(b)Without debiasing Figure 4.2: One Learning Cycle

4.4 Total simulation

The idea is to be able to evaluate the effect of survivorship bias mitigation on the performance of the prediction tool over a certain period of time. Then it would be possible to predict the performance of the prediction tool after, for example, five years. In order to do this, we assume that one learning cycle represents one year. Every year new individuals will get arrested and therefore become part of the dataset. However, this is influenced by the composition of the people that offend, further referred to as the reality, and possible bias. Subsequently, the composition of the new individuals will be different in the fair situation compared to the unfair situation. Assumptions about how the reality and the bias influence

(19)

the composition of the new individuals have to be made. These are explained in this section.

The individuals that get arrested, are part of a subset of reality. Therefore the realities can be represented by datasets and the new individuals will be a subset of this dataset. In order to create the reality datasets, assumptions have to be made about how these realities would look like. It is assumed that in the unfair situation, visualised in figure 4.3(a), an equal amount of men and women recidivate. Therefore, the dataset of the unfair reality will consist of an equal amount of male and female individuals. Moreover, half of the population recidivate and half offend for the first time. These percentages are not supported by real cases, they are chosen to ease further calculations. In the unfair reality, there exists a bias towards men, thus men have a larger probability of getting arrested compared to women. Consequently, it will seem like men recidivate more often than women, while in reality they recidivate equally. In the fair situation, visualised in figure 4.3(b), the assumption that men recidivate more compared to women is true. The percentage of men with a high risk label is larger compared to the percentage of women with a high risk label. In this case, the dataset still consists of an equal amount of men and women, but 70 percent of the male population and 50 percent of the female population have a high risk label. However, men and women have the same chance of getting arrested.

Figure 4.3: Visual representation of the realities

The risk tool is used to predict the risk for each individual in the reality dataset. Individuals that receive a high predicted risk, have a larger chance of getting ar-rested. Thus, the probability that an individual will get arrested will depend on their gender and their predicted risk score. In order to calculate this conditional

(20)

probability, equation 4.1 is used. A, G and R represent respectively Arrested, Gen-der and Risk. Further details about the calculation of the conditional probabilities can be found in Appendix A.

P (A|G, R) = P (G, R|A) × P (A)

P (G, R) (4.1)

In both realities the chance that a criminal will get arrested is two third and the probability of being a men or women is 50 percent. Moreover getting arrested given a high predicted risk score increases the probability of getting arrested to 70 percent compared to 50 percent for individuals with a low predicted risk. The realities differ by the chance of getting arrested given that the individual is male. In the unfair situation men have a larger chance to be taken into custody compared to women, while in the fair situation their chances are equal. The probabilities used to calculate the conditional probabilities can be observed in table 4.3.

Fair situation Unfair situation

P (A) 2₃ 2₃ P (M ) 1₂ 1₂ P (F ) 1₂ 1₂ P (A|M ) 3₅ 1₂ P (A|F ) 1₂ 1₂ P (1|A) ₁₀7 ₁₀7 P (0|A) 1₂ 1₂

Table 4.3: Assumed probabilities to calculate the conditional probabilities. These depend on M (male), F (female), A (arrested ), 1 (high risk ) and 0 (low risk ).

The reality datasets are divided into four groups. 1. Men with a high predicted risk score

2. Men with a low predicted risk score 3. Women with a high predicted risk score 4. Women with a low predicted risk score

(21)

Figure 4.4: Every learning cycle a subset of the new criminals are getting arrested and become part of the dataset. The composition of this subset depends on the predictions of the model trained in the recent learning cycle and the realities. In order to simulate five years, this process is executed five times.

The conditional probabilities determine how many individuals of each group are added to the subset of the reality. In table 4.4, the conditional probabilities can be found for both the fair and unfair situation. Thus, for example, in the fair situation 70 percent of the group of men with a high predicted risk score is added to the subset.

P (A|M, 1) P (A|M, 0) P (A|V, 1) P (A|V, 0)

Unfair reality 0,841 0.692 0.700 0.500

Fair reality 0,700 0.500 0.700 0.500

Table 4.4: Conditional Probability table representing the chance of being arrested according to gender and predicted risk for each reality. M is male and V is female, 0 represents a low predicted risk score and 1 represents a high predicted risk score.

Note the difference between the predicted risk scores and the risk scores that the individuals have in reality. The individuals are added to the dataset with their original risk score, not with the predicted risk score.

(22)

5 |

Results and Evaluation

The main situation explored in this thesis is based on the dataset described in the method section (see figure 4.1). This default dataset contains 10.000 individuals. About 60 percent of this population is male. 70 percent of the male population and half of the female population have a high risk label. Also, the default dataset contains noise. The performance of the tool is evaluated over a period of five years. This means that data is collected during five learning cycles. The collected information is about the accuracy and the threshold of the prediction tool of the corresponding learning cycle and the amount of high and low risk individuals of the test set. The latter compares the real quantities with the predicted quantities. Each learning cycle, a new dataset of 5000 individuals is evaluated and based on their gender and predicted risk they are added to the dataset.

Besides the default dataset, four other situations are explored.

1. The default dataset without noise, in order to observe the impact of the noise attributes on the results.

2. The default dataset but instead of 5000 new individuals, there are 500 new individuals, to explore whether the same effect occurs with lower quantities. 3. Two datasets like the default dataset, but with other high/low risk ratio’s.

(a) 40 percent of the male population and 30 percent of the female popu-lation recidivates, instead of respectively 70 and 50 percent.

(b) Even lower percentages, namely 10 and 3 percent of the male and female population recidivates.

Each situation is examined with and without debiasing. All values are calculated a hundred times and the mean is processed in the results. This approach is taken to minimise the chance of evaluating outliers. These results are available in Appendix B and will be evaluated in the following sections.

(23)

5.1 70 percent males and 50 percent females

re-cidivate

The results can be found on the pages 29 and 30.

All men are classified as high risk and about one third of the women is classified as low risk. However, if the dataset is reweighed before training, the threshold becomes signif-icantly small. Subsequently, all individuals are classified as high risk. This is the case for both the fair and unfair situation. The accuracy of the first round of each situation varies around 62 percent. This is logical, since almost all individuals are classified as high risk and about 62 percent of the population has a high risk label in reality. Most interesting is the fair situation with debiasing. We expected a significant decrease of the accuracy. However, both with and without debiasing, the accuracy varies barely. Re-spectively, the accuracy decreases with 1.2 and 1.5 percent. In the unfair situation these values are respectively 7.3 and 7.7 percent. This can be explained by the amount of new individuals with high risk labels that are added to the dataset. In the fair situation about 68.9 percent (2100₃₀₅₀ ≈ 0.689) of the new added individuals have a high risk label. This is about 56.4 percent (1926₃₄₁₆ ≈ 0.564) in the unfair situation. Probably the composition of the new individuals of the fair situation is in balance with the precision of the prediction tool. This requires more research, since the precision is not included in this report. In conclusion, reweighing will lower the threshold and therefore all individuals are labeled as high risk. Moreover, the accuracy of the prediction tool seems directly dependent on the amount of high risk labeled individuals.

5.1.1 Without noise

The results of the situation without noise are similar to the results of the situation with noise. They only differentiate in the amount of low risk labeled women without debiasing. This amount is larger without noise. This seems logical, since the noise attributes should make it less obvious that gender is linked to risk labels.

(24)

5.1.2 500 new individuals instead of 5000

The accuracy varies barely in all situations. This seems logical, since the percentage added individuals is significantly low. The test set increases with about 70 individuals each round. In the first round, this is about 3 percent of the 2000 individuals. The amounts of high and low risk labeled individuals and the threshold is comparable with the results of the default situation.

5.2 40 percent males and 30 percent females

re-cidivate

In the unfair situation, almost all individuals have a low predicted risk. Without debias-ing the few people that have a high risk label are male. With debiasdebias-ing, also a few women receive a high risk score. Reweighing ensures that men and women have an equal chance to receive a high risk score. This is in line with the observed results. The accuracy with and without debiasing decreases about 8 percent.

In the fair situation, with debiasing, still almost all individuals are labeled as low risk. However, an increasing amount of women and men are classified as high risk individuals. Interestingly, the accuracy decreases with 12.4 percent. Without debiasing, the amount of men classified as high risk increases fast; from 5 to 2326 in 5 rounds. Almost no women receive a low risk label. The accuracy decreases with 8.4 percent.

Thus, in the 40/30 situation, the accuracy of the fair situation with debiasing meet our expectation. It decreases faster compared to the other three situations (unfair situation with and without debiasing and the fair situation without debiasing). This suggests that the composition of the start dataset has an influence on the possibility to observe the effect of bias mitigation on performance.

(25)

5.3 10 percent males and 3 percent females

recidi-vate

Except one or two, all individuals are labeled as low risk. The accuracy decreases signifi-cantly, namely about 23 percent in the unfair and about 26 percent in the fair situation. This can be explained by the significant addition of new high risk labeled individuals to the dataset each round. In the start dataset, about 720 individuals have a high risk label. In the fair situation about 2100 high risk labeled individuals and in the unfair situation about 1926 high risk labeled individuals are added to the dataset. In the fair and unfair situation, the threshold increases during the learning cycles. This increase is probably caused by the one or two individuals that are correctly labeled as high risk individuals. All other individuals are labeled as low risk.

(26)

6 |

Conclusion

All situations met the requirements described in section 4.2 about the dataset. In all sit-uations there were initially more men part of the dataset than women and the percentage high risk labeled individuals was larger for men compared to women. Even though they met these requirements, all situations gave different results.

The percentage of high risk labeled individuals of the total dataset has a clear effect on the predictions of the recidivism prediction tool. When initially more than half of the population recidivate, the prediction tool will classify all individuals as high risk. On the other hand, when a small percentage has a high risk label, as seen in the 10/3 situation, each individual will be labeled as a low risk. The 40/30 situation was remarkable, since there was a notable difference between the fair and unfair situation. In the unfair situation almost no individuals had a low risk score prediction, while in the fair situation this amount increased significantly. The hypothesized effect, where the accuracy decreases specifically in the fair situation with debiasing, was best seen in the 40/30 situation.

The noise attributes were added to the dataset so that the prediction scores were not only based on gender. The situation without noise supports this. More women had a low risk prediction in the situation without noise compared to the situation with noise. This means that the noise attributes had an effect on predictions and thereby the prediction score were not only based on gender. The reweighing algorithm did also have an effect on the predicted risk labels. In all situations, when the dataset was debiased, either all individuals were classified as high or low risk or an equal, or at least in proportion, amount of men and women were classified as high or low risk.

The conclusions are mainly about dataset settings and less about the effect of sur-vivorship bias mitigation itself. In order to be able to draw conclusions about this effect, first more research has to be done on the dataset settings to find the most representative situation to explore the effect of survivorship bias mitigation. This thesis is a start of this first step.

(27)

7 |

Discussion and future work

A lot of information can be found on detecting bias, notions of fairness and bias mit-igation. However, less information can be found on exploring the cause and effects of bias. Therefore, the design choices of the simulation tool of this thesis cannot be sup-ported by research or real cases. Subsequently, the correctness of the design choices can be discussed. This opens up new possibilities for further research. For that reason, the discussion and future work of this thesis are combined.

Data settings and their effects.

As mentioned in the Conclusion chapter, this thesis eventually explored effects of data settings rather than the effect of survivorship. If these design choices cannot be supported by research, it is impossible to draw conclusions about the effect of survivorship bias mitigation, because the observations can have been influenced by incorrect design choices. Therefore, we should start with asking questions about the effects of the input settings on the performance and fairness of the prediction tool. When we have an adequate understanding of the input settings and their effects, we can determine the correct and realistic settings that are required for the simulation.

There are multiple ways to investigate the effects of input settings. For example, this thesis only contains results about the amount of high or low risk individuals, the accuracy and the threshold. But, there are many more measures than can be explored. For example, the precision and recall of the prediction tool and the fairness metrics available in the AIF360 toolkit. Also, the method to collect the results can differentiate. In this thesis, the “five year” process is repeated hundred times and the mean for each value is documented in this thesis. Moreover, the dataset specifications can be altered and compared, like the 70/50, 40/30 and 10/3 situations explored in this thesis. New insights can be used to create a simulation tool where the design choices can be supported by research. Besides altering design details, the simulation can also be compared to real situations. However, details about prediction tools are sometimes not released to prevent the misuse of this information. Comparing the real versus the simulated situation, can give insight into the correctness of the simulations.

(28)

The bias, its cause and effects.

AIF360 of IBM helps with indicating, investigating and handling bias in data. It can indicate whether individuals are not treated equally. However, AIF360 cannot explore the cause of this unbalance in the data. The unbalance can be justified in certain situations. In other situations it will be called bias. By correcting for unbalances in the dataset or outcome, the prediction tool is consciously altered to comply to definitions of fairness, but when the unbalance represents the real situation this alteration is undesirable. Moreover, as seen in the literature chapter, the correct notions of fairness for the situation are open for debate. The simulation tool derived in the previous paragraph can help with exploring the effect of bias. Besides, research done on the dataset settings can give an adequate understanding of a dataset and its limits. By combining the information about the limits of datasets and the exploration of the effects of bias, we may learn to indicate the cause of the bias.

The ideal tool

Ideally, there would exist a toolkit that, given some information about the situation, detects bias in a dataset, that can give explanations about its cause, for example sur-vivorship bias, and that can demonstrate the long run effect of the detected bias when this bias is (not) mitigated. This tool can not only help policy makers and data scien-tists to indicate bias, it also demonstrates the consequences of bias mitigation for the performance of their product. Since AIF360 can already be used to indicate and explore bias, this tool can be build upon this framework.

To conclude, this thesis demonstrated the challenges of creating a tool that can simulate bias mitigation over a period of five years to explore the effect of survivorship bias mitigation on the performance of a recidivism prediction tool. Still a lot more research has to be done on this topic. Hopefully this thesis can contribute to this research by indicating the challenges ahead of creating such a simulation tool. And maybe, eventually, AIF360 will not only be able to examine, report and mitigate bias, but it will also be able to demonstrate the effect of bias mitigation on product performance to policy makers and data scientists so that they can make well-informed product design decisions.

(29)

Bibliography

[1] Rachel K. E. Bellamy et al. “AI Fairness 360: An Extensible Toolkit for Detect-ing, UnderstandDetect-ing, and Mitigating Unwanted Algorithmic Bias”. In: CoRR abs/1810.01943 (2018). arXiv: 1810.01943. url: http://arxiv.org/abs/ 1810.01943.

[2] Gonzalo Ferreiro Volpi. “Survivorship bias in Data Science and Machine Learn-ing”. In: Towards Data Science (2019). url: https://towardsdatascience. com / survivorship bias in data science and machine learning -4581419b3bca.

[3] Julia Dressel and Hany Farid. “The accuracy, fairness, and limits of predicting recidivism”. In: (2017). issn: 2375-2548. url: https://doi.org/10.1126/ sciadv.aao5580.

[4] Faisal Kamiran and Toon Calders. “Data preprocessing techniques for classifi-cation without discrimination”. In: Knowledge and Information Systems 33.1 (2012), pp. 1–33.

[5] Judea Pearl. “The seven tools of causal inference, with reflections on machine learning”. In: Communications of the ACM 62.3 (2019), pp. 54–60.

[6] Judea Pearl and Dana Mackenzie. The Book of Why. The New Science of Cause and Effect. Penguin Books, 2019. isbn: 978-0-141-98241-0.

[7] Hördur Haraldsson. Introduction to system thinking and causal loop diagrams. Jan. 2004, pp. 4, 20–27.

(30)

(31)

A

|

Conditional Probability table

A Arrested Ac Not arrested G Gender R Risk

P (A|G, R) = P (G, R|A)× P (A) P (G, R)

= P (G, R|A) × P (A)

P (G, R|A) × P (A) +P (G, R|Ac₎_×_{P (A}c₎

P (G, R|A)= P (R|G, A) × P (G|A)

= (P (G) × P (A|G) × P (R|A)) × (P (A|G) × P (G)

P (A) ) P (Ac)= 1 − P (A) P (G, R|Ac)= P (R|G, Ac) × P (G|Ac) = (P (G) × P (Ac|G) × P (R|Ac)) × (P (A c_{|G) × P (G)} P (Ac₎ )

(32)

(33)

Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 757 0 1139 0 1529 0 1912 0 2296 0

Low risk women 401 0 571 0 749 0 923 0 1096 0

Low risk men 357 0 568 0 780 0 989 0 1200 0

Total high risk 1243 2000 1632 2770 2012 3541 2398 4310 2784 5080 High risk women 392 793 571 1141 742 1491 920 1844 1092 2188 High risk men 850 1207 1061 1629 1270 2050 1478 2467 1692 2892

1 2 3 4 5

Threshold 0.010 0.010 0.025 0.029 0.049

Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 754 268 1123 359 1491 482 1858 519 2228 693 Low risk women 399 268 558 359 714 482 873 519 1036 693

Low risk men 355 0 566 0 777 0 985 0 1192 0

1 2 3 4 5

(34)

Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

Total low risk 757 0 1038 0 1315 0 1599 0 1873 0

Low risk women 400 0 574 0 750 0 926 0 1101 0

Low risk men 357 0 464 0 564 0 673 0 771 0

1 2 3 4 5

Threshold 0.010 0.015 0.010 0.010 0.010

Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

Low risk men 354 0 460 0 567 0 671 0 778 0

1 2 3 4 5

(35)

No Noise - Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

Total low risk 774 0 1136 0 1548 0 1932 0 2316 0

Low risk women 398 0 576 0 752 0 922 0 1097 0

Low risk men 376 0 586 0 796 0 1010 0 1219 0

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

No noise - Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

Low risk men 376 0 590 0 797 0 1004 0 1209 0

1 2 3 4 5

(36)

No Noise - Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

Total low risk 772 0 1055 0 1337 0 1613 0 1899 0

Low risk women 399 0 574 0 746 0 923 0 1105 0

Low risk men 373 0 481 0 591 0 690 0 794 0

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

No Noise - Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

Low risk men 376 0 481 0 588 0 689 0 793 0

1 2 3 4 5

(37)

500 - Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

Total low risk 772 0 813 0 851 0 890 0 926 0

Low risk women 399 0 415 0 434 0 452 0 468 0

Low risk men 373 0 397 0 417 0 438 0 458 0

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

500 - Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

Low risk men 376 0 397 0 415 0 438 0 458 0

1 2 3 4 5

(38)

500 - Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

Total low risk 776 0 802 0 828 0 857 0 889 0

Low risk women 398 0 418 0 433 0 452 0 472 0

Low risk men 377 0 384 0 396 0 405 0 417 0

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

500 - Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

Low risk men 375 0 385 0 397 0 407 0 419 0

1 2 3 4 5

(39)

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1293 1999 1587 2595 1893 3190 2187 3783 2485 4381 Low risk women 554 789 677 1040 806 1289 930 1537 1052 1785 Low risk men 739 1210 909 1554 1087 1901 1257 2246 1434 2595

Total high risk 707 1 1010 2 1299 3 1602 6 1900 5

High risk women 235 0 363 1 483 1 609 2 735 2

High risk men 472 1 647 1 816 2 993 4 1165 3

1 2 3 4 5

Threshold 0.390 0.423 0.441 0.452 0.459

Rounds 1 2 3 4 5

Total high risk 709 5 1008 9 1307 8 1604 9 2906 15

High risk women 235 0 360 0 488 0 610 0 738 0

High risk men 473 5 648 9 819 8 994 9 1169 15

1 2 3 4 5

(40)

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1289 1999 1492 2496 1691 2989 1889 3469 2094 3791 Low risk women 553 790 675 1033 800 1282 930 1527 1054 1697 Low risk men 736 1209 817 1463 891 1707 959 1942 1040 2094 Total high risk 711 1 1008 4 1310 11 1613 32 1910 213

High risk women 238 0 360 2 487 5 612 14 736 93

High risk men 473 1 649 3 823 7 1001 18 1174 120

1 2 3 4 5

Threshold 0.390 0.438 0.468 0.486 0.496

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1290 1995 1490 2480 1688 2706 1892 1646 2125 1786 Low risk women 552 789 676 1038 802 1287 926 1538 1052 1786 Low risk men 738 1206 814 1442 886 1418 966 108 1074 0 Total high risk 710 5 1011 21 1314 297 1627 1874 1989 2328

High risk women 238 0 362 0 485 0 614 2 737 3

High risk men 473 5 649 21 829 297 1014 1872 1252 2326

1 2 3 4 5

(41)

Rounds 1 2 3 4 5

Total high risk 149 0 448 0 745 0 1043 0 1334 1

High risk women 24 0 150 0 273 0 399 0 520 0

High risk men 125 0 299 0 472 0 644 0 814 1

1 2 3 4 5

Threshold 0.103 0.205 0.265 0.307 0.334

Rounds 1 2 3 4 5

Total high risk 151 0 448 0 744 0 1044 1 1341 1

High risk women 24 0 149 0 274 0 399 0 522 0

High risk men 127 0 299 0 470 0 644 1 818 1

1 2 3 4 5

(42)

Rounds 1 2 3 4 5

Total high risk 150 0 449 0 749 0 1050 1 1343 1

High risk women 24 0 148 0 276 0 401 0 525 1

High risk men 126 0 301 0 473 0 649 0 818 1

1 2 3 4 5

Threshold 0.103 0.213 0.282 0.332 0.368

Rounds 1 2 3 4 5

Total high risk 150 0 449 1 751 1 1049 1 1348 2

High risk women 24 0 149 0 273 0 398 0 522 0

High risk men 126 0 300 1 477 1 650 1 826 2

1 2 3 4 5

Exploring the effect of survivorship bias mitigation on the performance of a recidivism prediction tool

Survivorship bias

Exploring the effect of survivorship

bias mitigation on the performance

Exploring the effect of survivorship

bias mitigation on the performance

of a recidivism prediction tool

Method and challenges

Abstract

Contents

1

|

Introduction

2

|

Literature

2.1

The origin of survivorship bias

2.2

Recidivism

2.3

AI Fairness 360 and bias mitigation

2.4

Causal diagrams

3

|

Research Question

4

|

Method

4.1

Technological requirements

4.2

Generating the start dataset

4.3

Recidivism prediction tool

4.4

Total simulation

5

|

Results and Evaluation

5.1

70 percent males and 50 percent females

re-cidivate

5.1.1

Without noise

5.1.2

500 new individuals instead of 5000

5.2

40 percent males and 30 percent females

re-cidivate

5.3

10 percent males and 3 percent females

recidi-vate

6

|

Conclusion

7

|

Discussion and future work

Data settings and their effects.

The bias, its cause and effects.

The ideal tool

Bibliography

A

|

Conditional Probability table