• No results found

Exploring the effect of survivorship bias mitigation on the performance of a recidivism prediction tool

N/A
N/A
Protected

Academic year: 2021

Share "Exploring the effect of survivorship bias mitigation on the performance of a recidivism prediction tool"

Copied!
42
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Survivorship bias

Exploring the effect of survivorship

bias mitigation on the performance

(2)

Layout: typeset by the author using LATEX.

(3)

Exploring the effect of survivorship

bias mitigation on the performance

of a recidivism prediction tool

Method and challenges

Ninande Y. Vermeer 11269057

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors UvA dhr. dr. R.G.F. Winkels Leibniz Center for Law

Faculty of Law University of Amsterdam Roeterseilandcampus, building A, 6th floor 1018 WV Amsterdam KPMG dr. A.W.F. Boer June 26st, 2020

(4)

Abstract

Survivorship bias is the fallacy of focusing on entities that survived a certain selec-tion process and overlooking the entities that did not. This common form of bias can lead to wrong conclusions. AI Fairness 360 is an open-source toolkit, made by IBM, that can detect and handle bias. The toolkit detects bias according to differ-ent notions of fairness and contains several bias mitigation techniques. However, what if the bias in the dataset is not bias, but rather a justified unbalance? Bias mitigation while the “bias” is justified is undesirable, since it can have a serious negative impact on the performance of a prediction tool based on machine learn-ing. The tool can become unusable. In order to make well-informed product design decisions, it would be appealing to be able to run simulations of bias mitigation in several situations to explore its impact. This thesis attempts to create a tool that can run this simulation to explore the effect of survivorship bias mitigation on the performance of a recidivism prediction tool. The results reveal a lack of research on this subject. For that reason, the main contribution of this thesis is an indication of the challenges that come with the creation of such a simulation tool.

(5)

Contents

1 Introduction 2

2 Literature 4

2.1 The origin of survivorship bias . . . 4

2.2 Recidivism . . . 4

2.3 AI Fairness 360 and bias mitigation . . . 6

2.4 Causal diagrams . . . 6

3 Research Question 8 4 Method 11 4.1 Technological requirements . . . 11

4.2 Generating the start dataset . . . 11

4.3 Recidivism prediction tool . . . 13

4.4 Total simulation . . . 14

5 Results and Evaluation 18 5.1 70 percent males and 50 percent females recidivate . . . 19

5.1.1 Without noise . . . 19

5.1.2 500 new individuals instead of 5000 . . . 20

5.2 40 percent males and 30 percent females recidivate . . . 20

5.3 10 percent males and 3 percent females recidivate . . . 21

6 Conclusion 22

7 Discussion and future work 23

Appendices 26

A Conditional Probability table 27

(6)

1

|

Introduction

The substitution of human decision making with Artificial Intelligence (AI) tech-nologies increases the dependence on trustworthy datasets. A dataset forms the foundation of a product. When the dataset contains bias, the outcome will be biased too. Therefore, the subject bias is examined a lot in the field of Artificial Intelligence and Data Science. All datasets are subsets of reality. If this subset contains systematic errors, for example, since it incorrectly concentrates on entities with a certain trait, the dataset will be biased. In this case some entities belong to the group of interest, but they are not observed and therefore they will not become part of the dataset. Subsequently, valuable and even essential information can be missed. Incorrectly, some individuals of the group of interest have not “survived” a certain selection process, which made them unobservable. The resulting bias is called survivorship bias. Survivorship bias and the effects of its mitigation are the subjects of this thesis.

Survivorship bias is common in all kind of domains. A clear example is about studying success. Mostly these studies are aimed at studying successful people. However, studying people that tried to be successful and failed, can be at least as important. Apparently, the people that did not succeed had habits, traits or less luck that withheld them from becoming successful. Studying both the people that “survived” and the people that “did not survive” the process of becoming successful, is essential to understand how one can truly become successful. Another example is disease prediction. If a disease prediction tool is trained only on features of people that were diagnosed with the disease. This causes survivorship bias. Namely, the features of people with the disease but without diagnosis are overlooked.

Sometimes, survivorship bias is difficult to recognise, since it requires counter-intuitive thinking. Look at the visible and the invisible; the entities that took the same path but did not “survive”. This awareness is necessary to collect trustworthy datasets which lead to less biased prediction tools. After having a reasonable understanding of the concept of survivorship bias, it is desirable to explore the effect of its mitigation on product performance, since this can help making design decisions. For example, if a dataset is debiased, while the bias was actually an

(7)

explainable unbalance, the accuracy of the predictions can decrease substantially. The prediction tool might be fair, but it will be unusable. However, if the bias is not justifiable, the prediction tool can discriminate individuals and give unreliable results. It is a challenge to balance performance and fairness. A simulation of bias mitigation in prediction tools, can help with taking into account all possible consequences of design decisions to make a well-informed choice. This thesis aims to raise awareness of survivorship bias and to explore the challenges that come with creating a simulation tool to explore the effect of survivorship bias mitigation on product performance. In order to do that, a prototype is created that investigates the effect of survivorship bias in the domain of recidivism prediction. Recidivism is the propensity for relapsing into unjust behaviour. It is valuable to be able to predict recidivism for criminals, since it helps with making decisions in their trial processes. This can benefit the criminal or contribute to the safety of other citizens.

AI Fairness 360 (AIF360), developed by IBM, is an open source Python toolkit that aims to examine, report and to mitigate bias to prevent algorithmic unfairness in an industrial setting and to help fairness researchers with sharing and evaluating their work.[1] The toolkit is used for the prototype since it is a widely utilised toolkit providing a general framework to build upon. Hopefully, in the future, AIF360 will also be able to simulate the effect of design decisions concerning bias mitigation to improve product designs. This thesis aims to contribute to this goal by proposing new research questions, based on the challenges faced during this project and the obtained results.

This document is organised as follows. Section 2 describes the theoretical foundation. It explains the origin of survivorship bias, current development in recidivism prediction, the AI Fairness 360 toolkit of IBM and causal diagrams. Section 3 will address the research context and question. Sections 4, 5 and 6 respectively report the approach, the results and the conclusions. The discussion and future works are combined in section 7.

(8)

2

|

Literature

2.1

The origin of survivorship bias

The term “survivorship bias” originates from the Second World War.[2] America needed new technologies and mathematical insights to stay ahead of its enemies. Therefore, the United States formed the Applied Mathematics Panel, a group of eleven specialised research groups aimed at finding mathematical insights to enhance strategies and armor. The research group of Abraham Wald investigated aircraft survivability to reinforce warplanes. Only 50 percent of the warplanes survived a battle. Even a small increase of this probability would save a significant amount of lives. Their first strategy was to examine every plane that returned from battle and to reinforce the parts that were heavily damaged. However, apparently, the planes that returned from battle were able to return safely with those heavily damaged areas. This insight led them to a new approach. They reinforced all parts that were not heavily damaged, because the damage to those parts were probably the reason the other planes did not survive. This was the start of the concept of survivorship bias. First, they overlooked the attributes of the non-survivors, leading to incorrect conclusions. Then, they changed their perspective and saved a significant amount of lives.

2.2

Recidivism

This thesis examines survivorship bias in recidivism prediction. Recidivism is the tendency to relapse into undesirable behaviour despite the experience of negative consequences of previous similar behaviour. It is desirable to have an assessment tool that can indicate whether there is a risk that a criminal will repeat the crime after release, because this information can be used to make decisions in their trial processes. For example, the police can observe this person more carefully after his or her release. However, if this assessment tool is trained on biased data, the predictions will be biased too. This can lead to unreliable results and

(9)

discrimina-tion based on protected social attributes like race and gender. An example of a criminal risk assessment tool is COMPAS, abbreviation for “Correctional Offender Management Profiling for Alternative Sanctions".[3] The company Northpointe developed the tool in 1998 and since 2000 it is widely used to predict the risk that a criminal will recidivate within two years of assessment. These predictions are based on 137 features about the individual and its past. The performance and fairness of COMPAS is thoroughly explored.

Julia Dressel and Hany Faris did research on the performance of COMPAS.[3] They compared the accuracy of COMPAS with the overall accuracy of human assessment. The participants of the study had to predict the risk of recidivism for a defendant, based on a short description about the defendant’s gender, age and criminal history. While these participants did have little to no expertise in criminal justice, they achieved an accuracy of 62.8 percent. This is comparable with the 65.2 percent accuracy of COMPAS. Thus, non-experts were almost as accurate as the COMPAS software. They also compared the widely used recidivism prediction tool with a logistic regression prediction model based on the same seven features as the human participants. That classifier achieved an accuracy of 66,6 percent. This suggests that a classifier based on seven features performs better than a classifier based on 137 features. Comparing the accuracies of COMPAS, human assessment and a simple classifier can cast serious doubt on the effort of predicting the recidivism risk algorithmically.

Moreover, the protected attribute race was not included in the training dataset of COMPAS, but ProPublica, a nonprofit organisation involved in investigative journalism, indicated racial disparity in its predictions. The assessment tool seemed to overpredict recidivism for black defendants and to underpredict recidi-vism for white defendents. Thus, COMPAS was favouring white defendants over black defendants. Northpointe argued that the analysis of ProPublica had over-looked more standard fairness measures, like predictive parity. Namely, COMPAS can differentiate recidivists and non-recidivists equally well for black and white defendants. This discussion leads to questions about which fairness metrics to use in a particular situation. Namely, there are more than twenty different notions of fairness and it is impossible to satisfy all of them; correcting for one fairness metric, means a larger unbalance for another fairness metric. Since the future of a defendant lies in the balance of the satisfaction of these notions of fairness, it is important to investigate these notions to find the best fairness metric for the situation. IBM created a tool that helps with finding the best fairness metric for each situation. It also provides a platform for researchers to advance and to ease research concerning this topic. This tool is the subject of the next section.

(10)

2.3

AI Fairness 360 and bias mitigation

As mentioned in the introduction, AI Fairness 360 (AIF360) is developed by IBM and it is an open source Python toolkit that can examine, report and mitigate bias to prevent algorithmic unfairness. Moreover, it helps fairness researchers with shar-ing and evaluatshar-ing their work. The toolkit contains more than 70 fairness metrics and it includes 9 bias handling (mitigation) algorithms. In addition, it contains metric explanations for consumers to understand the detected bias. AIF360 is compatible with the scikit-learn package of Python. IBM experimented with three classifiers, namely Logistic Regression, Random forest and Neural Network. Since Logistic Regression is ideal for binary classification problems, this classifier is used in this thesis. The 9 bias mitigation algorithms included in the toolkit each address a certain process of the model. They can be specified as a pre-, in-or post-processing bias mitigation algin-orithm. The bias mitigation technique used in this thesis is a pre-processing algorithm called reweighing. The reweighing al-gorithm does not change labels in the training dataset, but it carefully alters the weights of the learning model to free the dataset of bias.[4]

2.4

Causal diagrams

The AIF360 toolkit of IBM focuses on bias indication and mitigation. However, it does not take into account relations between attributes. The unbalance, indicated by AIF360 as bias, can have a justified cause. Therefore, to make appropriate choices about bias mitigation, it is important to explore cause-and-effect relations between attributes, since it can help with distinguishing between bias and justified unbalance. A mathematical framework used to express what is known about the world, is called Structural Causal Models (SCM).[5]. A SCM consists of coun-terfactual and interventional logic, structural equations and graphical models. A graphical model of causal relations can be called causal graphs, or causal diagrams. A causal diagram is a causal network that contains multiple nodes, or at-tributes, that are connected via arrows in characteristic patterns called junc-tions.[6] There are three basic types of junctions, namely 1) chains, 2) forks and 3) colliders. See figure 2.1 for a visual representation of the three junctions. These junctions are important, since they illustrate the type of causal relations between attributes. A → B → C (a) Chain A ← B → C (b) Fork A → B ← C (c)Collider Figure 2.1: Basic junction types: chain, fork and collider.

(11)

Figure 2.2: A simple causal loop diagram.

Attributes can positively or negatively influence each other and form circular con-nections that can be called feedback loops. These concon-nections and effects can be illustrated by causal loop diagrams. Figure 2.2 illustrates a simple causal loop diagram. The plus near the head of the arrow connecting A to B implies that when A increases, B increases or when A decreases, B also decreases.[7] The minus near the head of the arrow connecting B to A implies that they have an opposite effect. When B increases, A decreases and when B decreases, A increases.

Interpreting the outcome of machine learning models can have an effect on the construction of the model itself. For example, a risk assessment tool is trained on data that is biased towards men, since it contains a disproportionate amount of men with high risk labels. Subsequently, the probability that the tool predicts a high risk score for a new male individual will be larger, compared to a female individual. These predictions are analysed by authorities and they can act accord-ingly. If they concluded from this analyses that male individuals form a larger risk compared to women, they can for example consciously or unconsciously rearrange their surveillance tactics to focus more on male individuals. When new data is collected, this data will probably consist of a larger group of high risk labeled male individuals. Thus, by interpreting the prediction tool, the dataset of the tool will be altered. This amplifies the bias. However, to know whether the unbalance in the data is indeed bias, more information is necessary about the other involved attributes. By constructing causal (loop) diagrams, the cause-and-effect relations between attributes can be exposed. This can reveal relations between attributes that can cause bias or unbalance.

(12)

3

|

Research Question

The usage of an assessment tool to predict recidivism can provoke survivorship bias. For example, figure 3.1(a) represents criminals, the group of interest, in reality. Not all of these criminals are going to be arrested. That depends on their visibility to the police. In this thesis the range of criminals that the police can observe is called focus. When there is an equal amount of male and female criminals, the focus on the group of interest should be like the magnifying class in figure 3.1(b), where both genders are observed equally. However, if the police is biased towards men, their focus is directed at aspects of male criminals, incorrectly giving male criminals a higher chance to appear within their focus, see figure 3.1(c). Criminals can only be arrested if they are observed. Since men have a higher chance of being noticed compared to women, it is likely that more men will be arrested. Subsequently, the dataset will contain more male recidivists in relation to female recidivists. Being a men will become a risk factor for recidivism prediction, while in reality, gender does not have a correlation with recidivism. Since the people that use this tool cannot observe reality, they do not take into account the features of the group of interest outside their focus. This causes survivorship bias, since the police erroneously concentrate on the male gender. Men are more observed and consequently, men are more prevalent in the dataset of the risk assessment tool. This leads to unreliable recidivism predictions.

The dataset of such a risk assessment tool is altered over the years, since every year new individuals are getting arrested. However, if the tool influences the focus, the effect of survivorship bias is increased. Debiasing the dataset used for the training of the risk assessment tool can avoid this, but debiasing can have a significant effect on the performance of the prediction tool. Therefore, it would be desirable to be able to simulate the effect of the bias and bias mitigation in different situations, fair and unfair. Observations of this effect can help making well-informed decisions about bias mitigation. Moreover, the observations can indicate whether making such a prediction tool is effective or whether it is a waste of time, money and effort.

(13)

(a)Group of interest

(b) Fair focus on the group of interest (c) Unfair focus on the group of interest Figure 3.1: The focus on the group of interest determines the dataset that is obtained, since only observed individuals can become part of the dataset. The red individuals are getting arrested. In the fair situation 5 men and 5 women get arrested and in the unfair situation 6 men and 3 women get arrested. Due to the incorrect focus on the male gender, the dataset will be biased.

This thesis explores the process of designing a simulation tool that can demonstrate the effect of correcting or not correcting for survivorship bias on the performance of a recidivism prediction tool in a fair and unfair situation. Figure 3.2(a) is a causal diagram that demonstrates an unfair situation. Being a man directly increases the chance of getting arrested. This provokes survivorship bias. The causal diagram of figure 3.2(b) illustrates a fair situation, since men are getting arrested more often for recidivism, because they indeed recidivate more often. In an unfair situation, it is necessary to debias the dataset before training. But, in a fair situation, debiasing is undesirable, since it unnecessarily decreases the accuracy of the prediction tool. Therefore, four situations are explored. Namely the fair and unfair situations with and without bias mitigation. An overview of these four situations can be seen in figure 3.3. We expect a significant decrease of accuracy in the fair situation with bias mitigation, compared to the other situations.

(14)

(a) Unfair situation (b) Fair situation Figure 3.2: Causal loop diagrams that explain the two situations.

(15)

4

|

Method

4.1

Technological requirements

This paragraph mentions the tools and packages that are used for the simula-tion prototype of this thesis. AIF360 is open-source and can be downloaded on https://github.com/ibm/aif360. Since the toolkit is freely accessible and widely used, it is a stable platform to build upon. This is important for probable future development of a similar simulation tool. Furthermore, the experiments of the creators of AIF360 demonstrated that the toolkit is compatible with the scikit-learn, or skscikit-learn, Python package. The prototype uses its StandardScaler and LogisticRegression modules. Also, AIF360 contains a BinaryLabelDataset class, that can accept dataframes of pandas and convert them into a base struc-ture that can be used in AIF360. In order to do the necessary mathematics and to visualise the results, the packages numpy and matplotlib are also required. In conclusion, the prototype is designed in Jupyter Notebook, since this medium is also used for AIF360 tutorials and it is user-friendly. The Jupyter Notebook worked with Python version 3.6.5.

4.2

Generating the start dataset

Since there is not an actual dataset to start with, the start dataset need to be generated. But, how to make a realistic dataset for this situation? The dataset has to be representative for both possible realities, fair and unfair. This means that the dataset needs to satisfy two requirements. Firstly, the amount of men in the dataset must be larger than the amount of women, because in either realities more men are getting arrested. Secondly, the male population must consist of a higher percentage of high risk labeled individuals compared to the female popula-tion, because in the fair reality men recidivate more. Taking into account these two requirements, the settings in table 4.1 are used to create the dataset. Figure 4.1 demonstrates these settings in a visual representation of the dataset. 60 percent of

(16)

the total population in the dataset is male and 40 percent is female. Furthermore, 70 percent of the population of males is arrested for recidivism. This is 50 percent for the population of females. The percentages meet the two requirements men-tioned above. These high percentages of high risk labeled individuals are chosen in the hope to observe clearer effects of bias mitigation.

Setting Percentage [0-1]

Men in dataset 0.6

Men with high risk score 0.7 Women with high risk score 0.5

Table 4.1: Start dataset settings

Figure 4.1: Visual representation of the dataset.

As mentioned in the technical requirements, the dataset is created as a DataFrame, since this type can easily be converted into a format compatible with AIF360. Initially, this DataFrame consists of two features, namely gender and risk. The probable options of gender are ‘male’ and ‘female’. The risk attribute indicates which individuals got arrested for recidivism, labeled as ‘1’, and which did not get arrested for recidivism, labeled as ‘0’. ‘High risk’ refers to individuals with a ‘1’ label and ‘low risk’ refers to individuals with a ‘0’ label. Gender will be the protected attribute, since discrimination based on this attribute is unacceptable. The recidivism prediction tool predicts the risk that an individual would recidivate, based on actual cases of recidivism (label ‘1’). Therefore, risk is the label attribute. Note the subtle difference between the risk label in the dataset and the predicted risk. The ‘1’ risk label represents a case of recidivism in the dataset, while it represents a predisposition for recidivism in the prediction model.

(17)

Normally, a dataset contains more than one attribute and a label. In an attempt to avoid “obvious” results and making it a more realistic situation, noise is added to the dataset. This will add four more attributes to the DataFrame, namely the attributes age, educational level, crime rate and criminal history. The attributes and the possible values they can take are shown in table 4.2. The last column, “Percentage”, represents the distributions of the attributes for each gender population. In order to avoid unreliable results due to correlation between gender and the noise attributes, the attributes are equally distributed. The attribute age is distributed according to a Gaussian distribution with a mean of 45 and a standard deviation of 8. Thus, most women and men are between 37 and 53 years old. The values of the noise attributes are randomly chosen, since they are considered trivial.

Noise Options Percentage

Age 0-100 Gaussian distribution

Educational level ’A’, ’B’, ’C’, ’D’ All 25%

Crime rate 1, 2, 3, 4 All 25%

Criminal history ’Yes’ (1) or ’No’ (0) 50%

Table 4.2: Noise attributes: age, educational level, crime rate and criminal history.

The BinaryLabelDataset class of AIF360 accepts a onehot encoded DataFrame and converts it into a format that is compatible with the toolkit. Besides the DataFrame, the class needs two other input arguments, namely the label_names and protected_attribute_names. These are respectively ‘risk’ and ’gender_male’. Also, AIF360 need a specification of the privileged and unprivileged group. In the case of this thesis, the females are the privileged group and males are the unprivileged group.

4.3

Recidivism prediction tool

The dataset is split into three parts, namely the training set (50 percent), valida-tion set (30 percent) and the test set (20 percent). As described in the literature, it is proven that AIF360 is compatible with three classifiers, namely Logistic Regres-sion, Random forest and Neural Network. Logistic Regression is the most useful for binary recidivism prediction, so the training set trains a Logistic Regression model. Like explained in the Research Question, the effects are explored with and without debiasing. If debiasing is required, the training dataset is reweighed before training. The threshold is the certainty that an individual should receive a high risk label. For example, a threshold of 0.45 means that if the prediction model

(18)

is 45 (or more) percent certain that the individual should be labeled as high risk, the individual will be classified as high risk. Usually, this threshold is set by the programmer based on the expectations of the outcome of the prediction model. However, since there are no supported expectations in this thesis, the validation set is used to determine the threshold that maximises the accuracy. The test set is used to calculate the accuracy of the prediction tool, given the best threshold received in the validation process. The total learning process is displayed in figure 4.2.

(a) With debiasing

(b)Without debiasing Figure 4.2: One Learning Cycle

4.4

Total simulation

The idea is to be able to evaluate the effect of survivorship bias mitigation on the performance of the prediction tool over a certain period of time. Then it would be possible to predict the performance of the prediction tool after, for example, five years. In order to do this, we assume that one learning cycle represents one year. Every year new individuals will get arrested and therefore become part of the dataset. However, this is influenced by the composition of the people that offend, further referred to as the reality, and possible bias. Subsequently, the composition of the new individuals will be different in the fair situation compared to the unfair situation. Assumptions about how the reality and the bias influence

(19)

the composition of the new individuals have to be made. These are explained in this section.

The individuals that get arrested, are part of a subset of reality. Therefore the realities can be represented by datasets and the new individuals will be a subset of this dataset. In order to create the reality datasets, assumptions have to be made about how these realities would look like. It is assumed that in the unfair situation, visualised in figure 4.3(a), an equal amount of men and women recidivate. Therefore, the dataset of the unfair reality will consist of an equal amount of male and female individuals. Moreover, half of the population recidivate and half offend for the first time. These percentages are not supported by real cases, they are chosen to ease further calculations. In the unfair reality, there exists a bias towards men, thus men have a larger probability of getting arrested compared to women. Consequently, it will seem like men recidivate more often than women, while in reality they recidivate equally. In the fair situation, visualised in figure 4.3(b), the assumption that men recidivate more compared to women is true. The percentage of men with a high risk label is larger compared to the percentage of women with a high risk label. In this case, the dataset still consists of an equal amount of men and women, but 70 percent of the male population and 50 percent of the female population have a high risk label. However, men and women have the same chance of getting arrested.

Figure 4.3: Visual representation of the realities

The risk tool is used to predict the risk for each individual in the reality dataset. Individuals that receive a high predicted risk, have a larger chance of getting ar-rested. Thus, the probability that an individual will get arrested will depend on their gender and their predicted risk score. In order to calculate this conditional

(20)

probability, equation 4.1 is used. A, G and R represent respectively Arrested, Gen-der and Risk. Further details about the calculation of the conditional probabilities can be found in Appendix A.

P (A|G, R) = P (G, R|A) × P (A)

P (G, R) (4.1)

In both realities the chance that a criminal will get arrested is two third and the probability of being a men or women is 50 percent. Moreover getting arrested given a high predicted risk score increases the probability of getting arrested to 70 percent compared to 50 percent for individuals with a low predicted risk. The realities differ by the chance of getting arrested given that the individual is male. In the unfair situation men have a larger chance to be taken into custody compared to women, while in the fair situation their chances are equal. The probabilities used to calculate the conditional probabilities can be observed in table 4.3.

Fair situation Unfair situation

P (A) 23 23 P (M ) 12 12 P (F ) 12 12 P (A|M ) 35 12 P (A|F ) 12 12 P (1|A) 107 107 P (0|A) 12 12

Table 4.3: Assumed probabilities to calculate the conditional probabilities. These depend on M (male), F (female), A (arrested ), 1 (high risk ) and 0 (low risk ).

The reality datasets are divided into four groups. 1. Men with a high predicted risk score

2. Men with a low predicted risk score 3. Women with a high predicted risk score 4. Women with a low predicted risk score

(21)

Figure 4.4: Every learning cycle a subset of the new criminals are getting arrested and become part of the dataset. The composition of this subset depends on the predictions of the model trained in the recent learning cycle and the realities. In order to simulate five years, this process is executed five times.

The conditional probabilities determine how many individuals of each group are added to the subset of the reality. In table 4.4, the conditional probabilities can be found for both the fair and unfair situation. Thus, for example, in the fair situation 70 percent of the group of men with a high predicted risk score is added to the subset.

P (A|M, 1) P (A|M, 0) P (A|V, 1) P (A|V, 0)

Unfair reality 0,841 0.692 0.700 0.500

Fair reality 0,700 0.500 0.700 0.500

Table 4.4: Conditional Probability table representing the chance of being arrested according to gender and predicted risk for each reality. M is male and V is female, 0 represents a low predicted risk score and 1 represents a high predicted risk score.

Note the difference between the predicted risk scores and the risk scores that the individuals have in reality. The individuals are added to the dataset with their original risk score, not with the predicted risk score.

(22)

5

|

Results and Evaluation

The main situation explored in this thesis is based on the dataset described in the method section (see figure 4.1). This default dataset contains 10.000 individuals. About 60 percent of this population is male. 70 percent of the male population and half of the female population have a high risk label. Also, the default dataset contains noise. The performance of the tool is evaluated over a period of five years. This means that data is collected during five learning cycles. The collected information is about the accuracy and the threshold of the prediction tool of the corresponding learning cycle and the amount of high and low risk individuals of the test set. The latter compares the real quantities with the predicted quantities. Each learning cycle, a new dataset of 5000 individuals is evaluated and based on their gender and predicted risk they are added to the dataset.

Besides the default dataset, four other situations are explored.

1. The default dataset without noise, in order to observe the impact of the noise attributes on the results.

2. The default dataset but instead of 5000 new individuals, there are 500 new individuals, to explore whether the same effect occurs with lower quantities. 3. Two datasets like the default dataset, but with other high/low risk ratio’s.

(a) 40 percent of the male population and 30 percent of the female popu-lation recidivates, instead of respectively 70 and 50 percent.

(b) Even lower percentages, namely 10 and 3 percent of the male and female population recidivates.

Each situation is examined with and without debiasing. All values are calculated a hundred times and the mean is processed in the results. This approach is taken to minimise the chance of evaluating outliers. These results are available in Appendix B and will be evaluated in the following sections.

(23)

5.1

70 percent males and 50 percent females

re-cidivate

The results can be found on the pages 29 and 30.

All men are classified as high risk and about one third of the women is classified as low risk. However, if the dataset is reweighed before training, the threshold becomes signif-icantly small. Subsequently, all individuals are classified as high risk. This is the case for both the fair and unfair situation. The accuracy of the first round of each situation varies around 62 percent. This is logical, since almost all individuals are classified as high risk and about 62 percent of the population has a high risk label in reality. Most interesting is the fair situation with debiasing. We expected a significant decrease of the accuracy. However, both with and without debiasing, the accuracy varies barely. Re-spectively, the accuracy decreases with 1.2 and 1.5 percent. In the unfair situation these values are respectively 7.3 and 7.7 percent. This can be explained by the amount of new individuals with high risk labels that are added to the dataset. In the fair situation about 68.9 percent (21003050 ≈ 0.689) of the new added individuals have a high risk label. This is about 56.4 percent (19263416 ≈ 0.564) in the unfair situation. Probably the composition of the new individuals of the fair situation is in balance with the precision of the prediction tool. This requires more research, since the precision is not included in this report. In conclusion, reweighing will lower the threshold and therefore all individuals are labeled as high risk. Moreover, the accuracy of the prediction tool seems directly dependent on the amount of high risk labeled individuals.

5.1.1

Without noise

The results can be found on the pages 31 and 32.

The results of the situation without noise are similar to the results of the situation with noise. They only differentiate in the amount of low risk labeled women without debiasing. This amount is larger without noise. This seems logical, since the noise attributes should make it less obvious that gender is linked to risk labels.

(24)

5.1.2

500 new individuals instead of 5000

The results can be found on the pages 33 and 34.

The accuracy varies barely in all situations. This seems logical, since the percentage added individuals is significantly low. The test set increases with about 70 individuals each round. In the first round, this is about 3 percent of the 2000 individuals. The amounts of high and low risk labeled individuals and the threshold is comparable with the results of the default situation.

5.2

40 percent males and 30 percent females

re-cidivate

The results can be found on the pages 35 and 36.

In the unfair situation, almost all individuals have a low predicted risk. Without debias-ing the few people that have a high risk label are male. With debiasdebias-ing, also a few women receive a high risk score. Reweighing ensures that men and women have an equal chance to receive a high risk score. This is in line with the observed results. The accuracy with and without debiasing decreases about 8 percent.

In the fair situation, with debiasing, still almost all individuals are labeled as low risk. However, an increasing amount of women and men are classified as high risk individuals. Interestingly, the accuracy decreases with 12.4 percent. Without debiasing, the amount of men classified as high risk increases fast; from 5 to 2326 in 5 rounds. Almost no women receive a low risk label. The accuracy decreases with 8.4 percent.

Thus, in the 40/30 situation, the accuracy of the fair situation with debiasing meet our expectation. It decreases faster compared to the other three situations (unfair situation with and without debiasing and the fair situation without debiasing). This suggests that the composition of the start dataset has an influence on the possibility to observe the effect of bias mitigation on performance.

(25)

5.3

10 percent males and 3 percent females

recidi-vate

The results can be found on the pages 37 and 38.

Except one or two, all individuals are labeled as low risk. The accuracy decreases signifi-cantly, namely about 23 percent in the unfair and about 26 percent in the fair situation. This can be explained by the significant addition of new high risk labeled individuals to the dataset each round. In the start dataset, about 720 individuals have a high risk label. In the fair situation about 2100 high risk labeled individuals and in the unfair situation about 1926 high risk labeled individuals are added to the dataset. In the fair and unfair situation, the threshold increases during the learning cycles. This increase is probably caused by the one or two individuals that are correctly labeled as high risk individuals. All other individuals are labeled as low risk.

(26)

6

|

Conclusion

All situations met the requirements described in section 4.2 about the dataset. In all sit-uations there were initially more men part of the dataset than women and the percentage high risk labeled individuals was larger for men compared to women. Even though they met these requirements, all situations gave different results.

The percentage of high risk labeled individuals of the total dataset has a clear effect on the predictions of the recidivism prediction tool. When initially more than half of the population recidivate, the prediction tool will classify all individuals as high risk. On the other hand, when a small percentage has a high risk label, as seen in the 10/3 situation, each individual will be labeled as a low risk. The 40/30 situation was remarkable, since there was a notable difference between the fair and unfair situation. In the unfair situation almost no individuals had a low risk score prediction, while in the fair situation this amount increased significantly. The hypothesized effect, where the accuracy decreases specifically in the fair situation with debiasing, was best seen in the 40/30 situation.

The noise attributes were added to the dataset so that the prediction scores were not only based on gender. The situation without noise supports this. More women had a low risk prediction in the situation without noise compared to the situation with noise. This means that the noise attributes had an effect on predictions and thereby the prediction score were not only based on gender. The reweighing algorithm did also have an effect on the predicted risk labels. In all situations, when the dataset was debiased, either all individuals were classified as high or low risk or an equal, or at least in proportion, amount of men and women were classified as high or low risk.

The conclusions are mainly about dataset settings and less about the effect of sur-vivorship bias mitigation itself. In order to be able to draw conclusions about this effect, first more research has to be done on the dataset settings to find the most representative situation to explore the effect of survivorship bias mitigation. This thesis is a start of this first step.

(27)

7

|

Discussion and future work

A lot of information can be found on detecting bias, notions of fairness and bias mit-igation. However, less information can be found on exploring the cause and effects of bias. Therefore, the design choices of the simulation tool of this thesis cannot be sup-ported by research or real cases. Subsequently, the correctness of the design choices can be discussed. This opens up new possibilities for further research. For that reason, the discussion and future work of this thesis are combined.

Data settings and their effects.

As mentioned in the Conclusion chapter, this thesis eventually explored effects of data settings rather than the effect of survivorship. If these design choices cannot be supported by research, it is impossible to draw conclusions about the effect of survivorship bias mitigation, because the observations can have been influenced by incorrect design choices. Therefore, we should start with asking questions about the effects of the input settings on the performance and fairness of the prediction tool. When we have an adequate understanding of the input settings and their effects, we can determine the correct and realistic settings that are required for the simulation.

There are multiple ways to investigate the effects of input settings. For example, this thesis only contains results about the amount of high or low risk individuals, the accuracy and the threshold. But, there are many more measures than can be explored. For example, the precision and recall of the prediction tool and the fairness metrics available in the AIF360 toolkit. Also, the method to collect the results can differentiate. In this thesis, the “five year” process is repeated hundred times and the mean for each value is documented in this thesis. Moreover, the dataset specifications can be altered and compared, like the 70/50, 40/30 and 10/3 situations explored in this thesis. New insights can be used to create a simulation tool where the design choices can be supported by research. Besides altering design details, the simulation can also be compared to real situations. However, details about prediction tools are sometimes not released to prevent the misuse of this information. Comparing the real versus the simulated situation, can give insight into the correctness of the simulations.

(28)

The bias, its cause and effects.

AIF360 of IBM helps with indicating, investigating and handling bias in data. It can indicate whether individuals are not treated equally. However, AIF360 cannot explore the cause of this unbalance in the data. The unbalance can be justified in certain situations. In other situations it will be called bias. By correcting for unbalances in the dataset or outcome, the prediction tool is consciously altered to comply to definitions of fairness, but when the unbalance represents the real situation this alteration is undesirable. Moreover, as seen in the literature chapter, the correct notions of fairness for the situation are open for debate. The simulation tool derived in the previous paragraph can help with exploring the effect of bias. Besides, research done on the dataset settings can give an adequate understanding of a dataset and its limits. By combining the information about the limits of datasets and the exploration of the effects of bias, we may learn to indicate the cause of the bias.

The ideal tool

Ideally, there would exist a toolkit that, given some information about the situation, detects bias in a dataset, that can give explanations about its cause, for example sur-vivorship bias, and that can demonstrate the long run effect of the detected bias when this bias is (not) mitigated. This tool can not only help policy makers and data scien-tists to indicate bias, it also demonstrates the consequences of bias mitigation for the performance of their product. Since AIF360 can already be used to indicate and explore bias, this tool can be build upon this framework.

To conclude, this thesis demonstrated the challenges of creating a tool that can simulate bias mitigation over a period of five years to explore the effect of survivorship bias mitigation on the performance of a recidivism prediction tool. Still a lot more research has to be done on this topic. Hopefully this thesis can contribute to this research by indicating the challenges ahead of creating such a simulation tool. And maybe, eventually, AIF360 will not only be able to examine, report and mitigate bias, but it will also be able to demonstrate the effect of bias mitigation on product performance to policy makers and data scientists so that they can make well-informed product design decisions.

(29)

Bibliography

[1] Rachel K. E. Bellamy et al. “AI Fairness 360: An Extensible Toolkit for Detect-ing, UnderstandDetect-ing, and Mitigating Unwanted Algorithmic Bias”. In: CoRR abs/1810.01943 (2018). arXiv: 1810.01943. url: http://arxiv.org/abs/ 1810.01943.

[2] Gonzalo Ferreiro Volpi. “Survivorship bias in Data Science and Machine Learn-ing”. In: Towards Data Science (2019). url: https://towardsdatascience. com / survivorship bias in data science and machine learning -4581419b3bca.

[3] Julia Dressel and Hany Farid. “The accuracy, fairness, and limits of predicting recidivism”. In: (2017). issn: 2375-2548. url: https://doi.org/10.1126/ sciadv.aao5580.

[4] Faisal Kamiran and Toon Calders. “Data preprocessing techniques for classifi-cation without discrimination”. In: Knowledge and Information Systems 33.1 (2012), pp. 1–33.

[5] Judea Pearl. “The seven tools of causal inference, with reflections on machine learning”. In: Communications of the ACM 62.3 (2019), pp. 54–60.

[6] Judea Pearl and Dana Mackenzie. The Book of Why. The New Science of Cause and Effect. Penguin Books, 2019. isbn: 978-0-141-98241-0.

[7] Hördur Haraldsson. Introduction to system thinking and causal loop diagrams. Jan. 2004, pp. 4, 20–27.

(30)
(31)

A

|

Conditional Probability table

A Arrested Ac Not arrested G Gender R Risk

P (A|G, R) = P (G, R|A)× P (A) P (G, R)

= P (G, R|A) × P (A)

P (G, R|A) × P (A) +P (G, R|Ac)×P (Ac)

P (G, R|A)= P (R|G, A) × P (G|A)

= (P (G) × P (A|G) × P (R|A)) × (P (A|G) × P (G)

P (A) ) P (Ac)= 1 − P (A) P (G, R|Ac)= P (R|G, Ac) × P (G|Ac) = (P (G) × P (Ac|G) × P (R|Ac)) × (P (A c|G) × P (G) P (Ac) )

(32)
(33)

Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 757 0 1139 0 1529 0 1912 0 2296 0

Low risk women 401 0 571 0 749 0 923 0 1096 0

Low risk men 357 0 568 0 780 0 989 0 1200 0

Total high risk 1243 2000 1632 2770 2012 3541 2398 4310 2784 5080 High risk women 392 793 571 1141 742 1491 920 1844 1092 2188 High risk men 850 1207 1061 1629 1270 2050 1478 2467 1692 2892

1 2 3 4 5

Threshold 0.010 0.010 0.025 0.029 0.049

Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 754 268 1123 359 1491 482 1858 519 2228 693 Low risk women 399 268 558 359 714 482 873 519 1036 693

Low risk men 355 0 566 0 777 0 985 0 1192 0

Total high risk 1246 1732 1612 2377 1981 2990 2349 3688 2718 4253 High risk women 394 525 553 752 709 941 866 1220 1025 1368 High risk men 852 1207 1060 1626 1272 2049 1483 2468 1692 2885

1 2 3 4 5

(34)

Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 757 0 1038 0 1315 0 1599 0 1873 0

Low risk women 400 0 574 0 750 0 926 0 1101 0

Low risk men 357 0 464 0 564 0 673 0 771 0

Total high risk 1243 2000 1662 2700 2085 3400 2501 4100 2927 4800 High risk women 392 792 568 1143 742 1493 917 1843 1092 2194 High risk men 851 1208 1094 1557 1343 1907 1584 2257 1835 2606

1 2 3 4 5

Threshold 0.010 0.015 0.010 0.010 0.010

Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 755 309 1018 378 1280 457 1544 562 1802 660 Low risk women 401 309 558 378 713 457 874 562 1025 660

Low risk men 354 0 460 0 567 0 671 0 778 0

Total high risk 1245 1691 1643 2282 2045 2869 2447 3429 2854 3997 High risk women 394 486 550 730 700 957 866 1178 1020 1384 High risk men 851 1205 1093 1553 1344 1911 1580 2251 1835 2612

1 2 3 4 5

(35)

No Noise - Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 774 0 1136 0 1548 0 1932 0 2316 0

Low risk women 398 0 576 0 752 0 922 0 1097 0

Low risk men 376 0 586 0 796 0 1010 0 1219 0

Total high risk 1226 2000 1608 2770 1992 3540 2379 4311 2765 5081 High risk women 392 789 562 1139 741 1492 912 1834 1086 2184 High risk men 834 1211 1045 1632 1252 2048 1467 2477 1678 2897

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

No noise - Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 777 330 1138 445 1501 598 1865 683 2225 737 Low risk women 400 330 549 445 704 598 861 683 1016 737

Low risk men 376 0 590 0 797 0 1004 0 1209 0

Total high risk 1223 1670 1590 2284 1957 2859 2320 3502 2690 4178 High risk women 388 458 543 646 699 804 851 1029 1015 1294 High risk men 835 1212 1048 1637 1258 2055 1468 2473 1675 2884

1 2 3 4 5

(36)

No Noise - Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 772 0 1055 0 1337 0 1613 0 1899 0

Low risk women 399 0 574 0 746 0 923 0 1105 0

Low risk men 373 0 481 0 591 0 690 0 794 0

Total high risk 1228 2000 1645 2700 2063 3400 2487 4100 2901 4800 High risk women 391 790 565 1139 739 1485 913 1835 1087 2192 High risk men 837 1210 1080 1561 1324 1915 1574 2265 1814 2608

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

No Noise - Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 773 380 1029 554 1290 539 1545 623 1801 810 Low risk women 397 380 547 554 702 539 856 623 1009 810

Low risk men 376 0 481 0 588 0 689 0 793 0

Total high risk 1227 1620 1623 2099 2011 2762 2417 3339 2824 3816 High risk women 392 410 540 534 690 853 847 1080 1007 1206 High risk men 834 1210 1083 1565 1321 1910 1570 2259 1817 2610

1 2 3 4 5

(37)

500 - Unfair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 772 0 813 0 851 0 890 0 926 0

Low risk women 399 0 415 0 434 0 452 0 468 0

Low risk men 373 0 397 0 417 0 438 0 458 0

Total high risk 1228 2000 1265 2077 1303 2154 1341 2231 1382 2308 High risk women 391 790 406 821 422 856 442 894 460 928 High risk men 837 1210 858 1256 881 1298 900 1338 923 1381

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

500 - Unfair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 776 279 810 292 846 281 883 298 922 306 Low risk women 401 279 413 292 431 281 446 298 464 306

Low risk men 376 0 397 0 415 0 438 0 458 0

Total high risk 1224 1721 1264 1782 1301 1866 1338 1922 1372 1989 High risk women 389 511 406 526 420 571 438 585 452 610 High risk men 834 1210 858 1255 881 1296 900 1338 921 1379

1 2 3 4 5

(38)

500 - Fair situation - 70%/50% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred

Total low risk 776 0 802 0 828 0 857 0 889 0

Low risk women 398 0 418 0 433 0 452 0 472 0

Low risk men 377 0 384 0 396 0 405 0 417 0

Total high risk 1224 2000 1268 2070 1312 2140 1353 2210 1391 2280 High risk women 391 789 405 823 424 857 443 895 458 931 High risk men 833 1211 863 1247 887 1283 910 1315 933 1349

1 2 3 4 5

Threshold 0.010 0.010 0.010 0.010 0.010

500 - Fair situation - 70%/50% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 775 268 801 257 826 297 853 295 883 293 Low risk women 399 268 416 257 430 297 446 295 465 293

Low risk men 375 0 385 0 397 0 407 0 419 0

Total high risk 1225 1732 1266 1810 1308 1837 1347 1905 1384 1974 High risk women 392 523 405 564 421 553 436 587 453 625 High risk men 834 1209 861 1246 887 1284 912 1319 931 1350

1 2 3 4 5

(39)

Unfair situation - 40%/30% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1293 1999 1587 2595 1893 3190 2187 3783 2485 4381 Low risk women 554 789 677 1040 806 1289 930 1537 1052 1785 Low risk men 739 1210 909 1554 1087 1901 1257 2246 1434 2595

Total high risk 707 1 1010 2 1299 3 1602 6 1900 5

High risk women 235 0 363 1 483 1 609 2 735 2

High risk men 472 1 647 1 816 2 993 4 1165 3

1 2 3 4 5

Threshold 0.390 0.423 0.441 0.452 0.459

Unfair situation - 40%/30% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1291 1995 1588 2588 1886 3185 2185 3780 2480 4371 Low risk women 555 790 680 1040 805 1294 927 1537 1046 1784 Low risk men 736 1205 909 1548 1081 1892 1259 2243 1433 2587

Total high risk 709 5 1008 9 1307 8 1604 9 2906 15

High risk women 235 0 360 0 488 0 610 0 738 0

High risk men 473 5 648 9 819 8 994 9 1169 15

1 2 3 4 5

(40)

Fair situation - 40%/30% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1289 1999 1492 2496 1691 2989 1889 3469 2094 3791 Low risk women 553 790 675 1033 800 1282 930 1527 1054 1697 Low risk men 736 1209 817 1463 891 1707 959 1942 1040 2094 Total high risk 711 1 1008 4 1310 11 1613 32 1910 213

High risk women 238 0 360 2 487 5 612 14 736 93

High risk men 473 1 649 3 823 7 1001 18 1174 120

1 2 3 4 5

Threshold 0.390 0.438 0.468 0.486 0.496

Fair situation - 40%/30% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1290 1995 1490 2480 1688 2706 1892 1646 2125 1786 Low risk women 552 789 676 1038 802 1287 926 1538 1052 1786 Low risk men 738 1206 814 1442 886 1418 966 108 1074 0 Total high risk 710 5 1011 21 1314 297 1627 1874 1989 2328

High risk women 238 0 362 0 485 0 614 2 737 3

High risk men 473 5 649 21 829 297 1014 1872 1252 2326

1 2 3 4 5

(41)

Unfair situation - 10%/3% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1851 2000 2148 2596 2448 3129 2746 3788 3051 4384 Low risk women 765 788 893 1042 1015 1288 1140 1538 1271 1791 Low risk men 1087 1212 1256 1554 1433 1904 1607 2250 1780 2594

Total high risk 149 0 448 0 745 0 1043 0 1334 1

High risk women 24 0 150 0 273 0 399 0 520 0

High risk men 125 0 299 0 472 0 644 0 814 1

1 2 3 4 5

Threshold 0.103 0.205 0.265 0.307 0.334

Unfair situation - 10%/3% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1849 2000 2148 2596 2448 3192 2745 3788 3045 4384 Low risk women 768 791 890 1039 1018 1292 1142 1542 1262 1785 Low risk men 1082 1208 1258 1557 1431 1900 1603 2246 1782 2600

Total high risk 151 0 448 0 744 0 1044 1 1341 1

High risk women 24 0 149 0 274 0 399 0 522 0

High risk men 127 0 299 0 470 0 644 1 818 1

1 2 3 4 5

(42)

Fair situation - 10%/3% - With debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1850 2000 2051 2500 2251 3000 2451 3500 2657 3999 Low risk women 761 784 891 1039 1015 1291 1141 1541 1274 1798 Low risk men 1089 1215 1160 1461 1236 1709 1310 1959 1384 2201

Total high risk 150 0 449 0 749 0 1050 1 1343 1

High risk women 24 0 148 0 276 0 401 0 525 1

High risk men 126 0 301 0 473 0 649 0 818 1

1 2 3 4 5

Threshold 0.103 0.213 0.282 0.332 0.368

Fair situation - 10%/3% - Without debiasing

Rounds 1 2 3 4 5

real pred real pred real pred real pred real pred Total low risk 1850 2000 2051 2499 2250 3000 2452 3499 2653 3998 Low risk women 767 791 889 1038 1015 1288 1142 1541 1264 1785 Low risk men 1083 1209 1162 1461 1235 1712 1309 1959 1389 2213

Total high risk 150 0 449 1 751 1 1049 1 1348 2

High risk women 24 0 149 0 273 0 398 0 522 0

High risk men 126 0 300 1 477 1 650 1 826 2

1 2 3 4 5

Referenties

GERELATEERDE DOCUMENTEN

[9] observe that ‘sociology is finally being called for by mainstream studies of the European Union (EU) seeking new inspiration.’ Hort [10] argues that ‘the sociology of Europe

doen staan op de polarisatie richting van het punt van de pola- pard-plaat. Dit zijn dus 7 metingen van de polarisatie-richting van dat punt. Ha enige orienterende metingen,

The experimental study utilized a closed motor skill accuracy task, putting a golf ball, to determine the effects of two different cognitive strategies on the

niet van het Belgische Plioceen, maar Wood (1856: 19) noemt de soort wel van Engelse Midden Pliocene

To translate this to brain-computer interfaces: when users are more certain about the mental input they provide, through training, the ac- tual task recognition will contribute more

Het doel van dit onderzoek was om de samenhang te onderzoeken tussen startleeftijd van delinquent gedrag, gezinsstructuur, de aanwezigheid van delinquente gezinsleden en het

(2011) European risk factors’ model to predict hospitalization of premature infants born 33–35 weeks’ gestational age with respiratory syncytial virus: validation with Italian

The definition that is currently used in academic literature about disruptive innovations is: An innovation which introduces a different set of features,