Risk-profile selection bias : what are the effects, and how to correct it?

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

Master’s Thesis in Econometrics

Risk-profile Selection Bias

What are the effects, and how to correct it?

Vincent R. Dieduksman

Student number: 11395907 Date of final version: May 12, 2018 Master’s programme: Econometrics

Supervisor: Prof. dr. M. Worring

(2)

1

Abstract

In real supervised learning scenarios, it is not uncommon that the available training data is subject to selection bias. In some practical settings the selection bias results from a data selection process where population samples are only selected if they meet specific characteristics of predetermined profiles. In fraud detection for example, one uses so-called risk-profiles in order to improve the control efforts by selecting the part of the population where most fraud risks are expected. However, such a risk-profile selection process leads to a specific bias in the available training data. In this thesis we first investigate how risk-profile selection bias affects the predictive performances of traditional classifiers, with the focus on the context of fraud detection. Secondly, we study and evaluate whether transfer learning could correct for risk-profile selection bias. The results of our empirical analysis suggest that risk-risk-profile selection bias will decrease the overall predictive performances of a classifier. However, when considering the context of fraud detection, a surprising result is that risk-profile selection bias may improve the average fraud detection rate when only a top percentage of predicted risk-scores is considered. Furthermore, we propose that a feature-based transfer learning approach is most appropriate for the risk-profile selection bias problem. However, the correction method applied in the empirical analysis did not lead to any improvements.

(3)

2

Statement of Originality

This document is written by Vincent Dieduksman who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

Introduction

1.1 Context and Motivation

Many machine learning algorithms rely on the assumption that the training- and target data are generated from the same probability distribution and feature space. However, in real-world applications this fundamental assumption is often violated, leading to deterioration of predictive performances. A prominent example is the case when machine learning models are trained on data that is subject to so-called selection bias. Selection bias is the problem where proper randomization is not feasible during the data selection process, causing some elements of the population to be less likely to be included than others. Without proper corrections, the available data will not be representative for the statistical population and therefore not appropriate for statistical inference on the population.

The urgency of knowing the effects and implications of selection bias arose a lot of interest in several research disciplines. In econometrics for example, a lot of attention has been paid to the effects of selection bias on causal inference, i.e. on how the reliability and consistency of model parameter estimation are affected [1][2]. In the field of machine learning, selection bias is seen in a broader perspective and considered as a specific type of the so-called dataset shift, a collective name of cases where the joint distributions differ between training- and target data [3]. Machine learning literature has shown that the problem of dataset shift, like selection bias, affects the predictive performances of traditional machine learning models, and that improvements are possible by using appropriate correction methods [4].

In this thesis we will address a specific type of selection bias or dataset shift that has not yet been addressed in earlier literature. In some practical applications it is more natural to think that the available training data is generated from a selection process that is based on so-called profiles, where a profile can be seen as a set of conditions related to the input variables. In practice one might want to control the selection process by creating a large number of profiles and only sample those elements of the population that match one of the profiles. Such a selection process will lead to selection bias since the sampled data will not be representative for the statistical population. Moreover, certain regions of the population domain will never

(7)

CHAPTER 1. INTRODUCTION 6

be sampled under such selection process, i.e. some elements in the statistical population have zero probability of being sampled. A profile-based selection process will therefore lead to a data generating process that not only differs in distribution, but also in sample space or feature space with respect to the statistical population. The latter two properties make profile-based selection bias fundamentally different to the traditional selection bias problem.

A prominent example where a profile-based selection process is consistently applied is the field of fraud detection. Fraud detection is an important topic that is applicable to many industries including banking and financial sectors, insurance, government agencies and law enforcement. When fraud control efforts are difficult, time-consuming and/or costly, one is confronted with the trade-off between optimizing fraud detection and minimizing control costs. In such contexts we need to make the fraud inspection process as efficient as possible. Instead of performing random inspections, one often tries to focus the inspections on the part of the population where most risks are anticipated. Based on domain knowledge and prior experiences, these risks are sometimes categorized using so-called risk-profiles, where risk-profiles correspond to those elements of the population that are assumed to be more risky than others. The use of risk-profiles will likely improve the efficiency of the inspection process.

Figure 1.1: A Flow Chart of a Profile-Based Inspection Process

How a profile-based inspection process in the context of fraud detection generally looks like is depicted in figure 1. The starting point of the inspection process is the presence of a statistical population that needs to be inspected. Risk-profiles are subsequently used to distinguish the risky- and non-risky elements of the population. The former will potentially be inspected, whereas no further inspection is performed on the latter. Such a process alone will have the limitation that one can avoid inspection by learning the risk-profiles that are used. In practice you therefore see that risk-profiles are regularly adapted, changed or replaced. In addition to that, one often combines the process with a number of random inspections in order to add a random component in the whole inspection process. Every inspection will finally result in the knowledge whether fraud is committed or not. The control results make it possible to build predictive machine learning models that can be used to estimate fraud risks. The problem however, is that the available labeled data will be subjected to risk-profile selection bias. The question is whether the predictive models trained on the control results are generalizable to the statistical population.

(8)

1.2 Problem Statement and Research Questions

As traditional selection bias usually leads to implications and deterioration of predictive per-formances, it is reasonable to think that selection bias originating from a risk-profile selection process also affects the reliability of predictive models. What these effects really are, and whether there are existing correction methods that can be used to correct for these effects are questions this thesis aims to answer. The research efforts of this thesis can be summarized into two main research questions and several related sub-questions.

The first research question is formulated as follows:

RQ.1: What are the effects of risk-profile selection bias on the predictive performance and generalizability of traditional classifiers?

In order to answer RQ.1 we will measure the effects of risk-profile selection bias in two different ways, which leads to the following two related sub-questions:

SQ.1: Does risk-profile selection bias lead to a decrease in overall predictive performance?

SQ.2: What are the effects of risk-profile selection bias within the context of fraud detection?

The effects could also depend on the specific machine learning algorithm underlying the predictive model, and thus we will consider a third sub-question which is formulated as follows:

SQ.3: Are the effects of risk-profile selection bias different per machine learning algorithm?

The second research question concerns the investigation of possible correction theory, and is formulated as follows:

RQ.2: Are there effective methods that can make traditional machine learning models more robust to risk-profile selection bias?

In order to answer the second research question we will aim to answer the following two related sub-questions:

SQ.4: What is the theoretical nature of risk-profile selection bias, and how does this differ from the traditional selection bias problem?

SQ.5: Are there appropriate correction methods in the current machine learning literature that form a solution to the problem of risk-profile selection bias?

(9)

Before discussing the research approach, it is important to note that a distinction can be made between two types of risk-profile selection bias. Since the degree and type of risk-profile bias depend on the number, type and size of the profiles that are used, the risk-profile selection bias problem can be divided into two specific cases:

1) Every feature/item of each population variable is included in the risk-profile selection process. 2) There are features/items of population variables that are not included in the risk-profile selection process.

The first case corresponds to real-world settings where a sufficient amount of profiles are used in the selection process so that the available training data contains all features of the population variables. This means that only the sample space of the underlying data generating process will be different with respect to the statistical population, and feature spaces will still be similar. The second case corresponds to an even higher degree of selection bias since some features of population variables are unknown in the training data, leading to a difference in feature space as well. This subtle difference between the two cases is illustrated in figure 1.2.

Figure 1.2: Illustration of Two Different Types of Risk-profile Selection Bias. A risk-profile selection process using profile 1 and 2 leads to a data-generating process that misses some elements of the statistical population, but contains all population features (type 1). A risk-profile selection process based on only profile 1 leads to a data-generating process that misses both elements and features of the statistical population (type 2): feature β of population variable X2is not included in the selection process.

In this thesis we specifically focus on the first type of risk-profile selection bias, and we therefore consider the setting where all features of the population variables are contained in the the available training data.

1.3 Research Approach

In the previous two sections we gave an introduction to the risk-profile selection bias problem, and formulated the research questions that the thesis aims to answer. This section summarizes the approach that was taken in order to obtain answers to the research questions.

To analyze the effects of risk-profile selection bias it must be possible to compare the predic-tive performances of a model that is subjected to risk-profile selection bias, to a model that is

(10)

not. However, the main issue in situations where one faces the problem of risk-profile selection bias is that there is no or too few labeled population data available to train an unbiased model that can constitute as a benchmark. Due to this lack of labeled population data it becomes difficult or even impossible to analyze the effects by using data of real-world settings where risk-profile selection bias is present. To avoid this label-problem we resorted to a Monte Carlo (MC) simulation approach based on empirical datasets where the risk-profile selection bias problem is initially not present, and combine this with a risk-profile selection process that we designed ourselves. By splitting the original dataset into several subsets we made it possible to train and validate both biased and unbiased models. By varying the degree of selection bias we considered multiple biased scenarios, ranging from low- to high selection bias, where in each scenario the same model comparison approach was followed. The effects of risk-profile selection bias were finally measured using the average difference in model performances between the biased models and the unbiased model.

Several specific methods and techniques were used to answer the subquestions related to research question 1. We made use of the Area Under the Receiver Operating Characteristic (AUROC) curve in order to measure the effects of risk-profile selection bias with respect to the overall predictive performances (SQ.1). To answer SQ.2 we followed two different approaches that are common for model validation in the context of fraud detection. The first approach is to measure the effects by analyzing whether risk-profile selection bias affects the fraud detection rates (or accuracy) within a top percentage of (risk)scores. The second approach is based on the trade-off between the missed fraud rate and the reduction of costs, a cost-benefit measure that is commonly used in fraud detection. In order to analyze whether the effects differ between machine learning models (SQ.3), we considered two traditional classifiers: regularized logistic regression, a relative simple model, and the random forest classifier, which is based on a more advanced algorithm.

Finally, by means of existing transfer learning literature we clarified the theoretical nature of risk-profile selection bias, and identified what kind of methods are appropriate for correction (RQ.2). Based on our findings chose to evaluate and test a recently proposed feature-based transfer method.

1.4 Thesis Outline

This section gives a brief overview of how the thesis is structured:

Chapter 2 (Literature Review) Starts with a description of earlier research on selection bias and other related problems. It then elaborates the discussion by giving a short overview of transfer learning theory. This helps to identify the specific nature of risk-profile selection bias, and to categorize it in one of the transfer learning settings. It ends by discussing two popular transfer approaches, and it concludes that a feature-based transfer approach is most appropriate to tackle the risk-profile selection bias problem.

(11)

Chapter 3 (Methodology & Techniques) Gives a detailed explanation of the approach that is taken to answer the research questions. This includes the discussion of all important methods and techniques that are used to analyze the effects of risk-profile selection bias. It ends by giving a short explanation of of the correction method that is evaluated and tested in the empirical analysis.

Chapter 4 (Experiment Datasets) Introduces the two experimental datasets that are used in the empirical analysis. It contains a detailed description of all variables and features per dataset.

Chapter 5 (Empirical Analysis) Starts by formulating the different models that are contained in the empirical analysis. It then discusses the results of the empirical analysis, which are divided into four parts: 1) how risk-profile selection bias affects the overall model performance, 2) how risk-profile selection bias affects the average fraud detection rate within the top percentage of risk-scores, 3) how risk-profile selection bias affects the trade-off differences within the fraud detection context, and 3) how well the proposed correction method was able to improve the predictive performances.

Chapter 6 (Discussion) Contains the main conclusions with respect to the research questions. It then discusses the limitations of the research approach followed in this thesis, and ends by giving suggestions for further research.

(12)

Chapter 2

Literature Review

In this section we will give a theoretical background that will help us better understand the risk-profile selection bias problem, and to know whether there exists correction theory that can form a solution to this problem. We start of by discussing earlier research efforts in the fields of econometrics, machine learning and transfer learning, and conclude that the latter offers the most comprehensive study on phenomena where joint distributions differ between training- and target data. We will then give a short overview of transfer learning fundamentals and -settings that will help us relate the risk-profile selection bias problem to the current transfer learning literature. Finally, we will discuss the so-called instance-based and feature-based transfer approaches and suggest that the latter is most appropriate to correct for risk-profile selection bias.

2.1 Research Backgrounds

The selection bias problem has received a lot of attention in the field of econometrics. In fact, the term selection bias has its origin in econometrics, and it is defined as the systematic error due to a non-random sample of a population. Literature has shown that the presence of selec-tion bias leads to inconsistent parameter estimates, unless corrective measures are taken [5]. In order to assure consistent parameter estimation, many selection models have been proposed [6] of which the Heckman model [1] and Roy model [7] are leading examples. The drawback of the proposed correction methods is that it is only applicable to linear regression models commonly used in econometrics, where consistent parameter estimation is the main focus. Recent machine learning literature however, has studied the selection bias problem in the context where pre-dictive performances are of most interest instead of the underlying mechanisms that generate the data [4][8][9].This literature proposes so-called reweighting methods that can improve pre-dictive performance under selection bias. Reweighting methods appear to be a specific class of correction methods that exist in the transfer learning literature, and will be discussed in more detail in Section 2.4.1.

Although selection bias has been studied in the field of econometrics as a subject on its own

(13)

CHAPTER 2. LITERATURE REVIEW 12

right, the machine learning literature has addressed the more general problem where the joint distribution differs between training- and target domain, of which the selection bias problem is just one example. Several subfields in the machine learning have studied this problem in-parallel to each other. An example is the literature that uses the ‘dataset shift’ terminology [3], a col-lective name of cases where training- and target domain have different probability distributions. Selection bias is then seen as a specific type of ‘shift’. Other examples of dataset shifts are the covariate shift [10], the prior probability shift [10] and the imbalanced data problem. Another related area of study known as transfer learning, deals with the even more general problem of transferring information from a variety of previous different domains, to help with learning, inference and prediction in a new domain. Hence it includes the dataset shift, which consists of the case where the transfer of knowledge covers only two domains: one source domain, and one target domain. Transfer learning is a relative young field within machine learning with applica-tions in many different areas such as text-classification [11], image- and video recognition [12] and activity recognition [13]. Due to its rapid growth over the past years, the transfer learning field now offers the most comprehensive study on problems where the transfer of knowledge is necessary, including issues such as selection bias. Sinno Pan and Qiang Yang [14] published a thorough survey on transfer learning where they reviewed the progress on transfer learning and discussed its relation to the independent efforts on subjects such as selection bias, dataset shift and domain adaptation.

2.2 Transfer Learning Settings

The question is how risk-profile selection bias relates to the existing transfer learning literature. To answer this question, we will briefly discuss transfer learning fundamentals that will help categorize the risk-profile selection bias problem in one of the transfer learning settings. We start of by giving the definition of transfer learning that has been given by Pan and Yang:

Definition (Transfer Learning) Given a source domain DS and learning task TS, a target

do-main DT and learning task TT, transfer learning aims to help improve the learning of the target

predictive function fT(.) in DT using the knowledge in DS and TS, where DS 6= DT, or TS 6= TT.

In the above definition a domain is considered as a pair D = {X , P (X)} , where X is the feature space, P (X) the marginal distribution, and X ∈ X . The condition DS 6= DT means that the

source and target domains are different and implies that either XS 6= XT or PS(X) 6= PT(X).

In a similar way we have task T = {Y, P (Y |X)}, where the condition TS 6= TT implies that

either YS 6= YT or P (YS|XS) 6= P (YT|XT).

There are many practical situations where one of the above transfer learning conditions hold. Consider for example the covariate shift problem where source- and target distributions are different due to the non-stationary character of the environment (e.g. location, time) in which the data is gathered. Another example is the spam-filtering problem, where the distribution

(14)

and list of words change rapidly, leading to differences in both distribution and feature space across domains.

Depending on the different situations between domains and tasks, transfer learning can be categorized into three different settings: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning. How these settings are related to traditional machine learning is summarized in table 2.1 .

Learning Settings Source and Target Domains Source and Target Tasks Traditional Machine Learning The same The same

Inductive The same Different but related Transductive Differnt but related The same Transfer Learning

Unsupervised Different but related Different but related

Table 2.1: Relationship between Traditional Machine Learning and Various Transfer Learning Settings. Adapted from Sinno Jialin Pan & Qiang Yang (2010).

Inductive transfer learning

Inductive transfer learning aims to help improve the learning of the target predictive function fT(.) in DT using the knowledge in DS and TS, while the tasks are different and the domains are

similar (TS 6= TT and DS = DT). It is required to have some labeled data in the target domain

to induce the objective predictive function fT(.). When some or no labeled data is available

in the source domain, inductive transfer learning is similar to the multi-task and self-taught learning problems [15][16].

Transductive transfer learning

In the transductive transfer setting the source and target tasks are the same, while the source and target domains are different (TS = TT and DS 6= DT). It is assumed that there are only

labels available in the source domain, together with some unlabelled target-domain data that is used for transferring information. The degree to which the source- and target domains differ is categorized into two different cases:

a) The marginal distributions of the input data are different, but the feature spaces are the same: PS(X) 6= PT(X), but XS = XT.

b) The feature spaces between the source and target domains are different: XS 6= XT

The popular problems of selection bias, covariate shift and domain adaptation are related the first case where only the distribution is different across domains. Areas facing the problem of different feature spaces are for example text- and image classification, where the input variables of the source- and target domain are not perfectly overlapping.

Unsupervised transfer learning

Finally, in the unsupervised transfer learning setting both the tasks and domains are different across domains (TS 6= TT and DS 6= DT). No labeled data is usually available in both the

(15)

source and target domains during training. The focus usually lies on the implementation of transfer learning to unsupervised problems such as clustering, dimensionality reduction and density estimation [17].

2.3 Categorization of Risk-profile Selection Bias

In order to assign the risk-profile selection bias problem to one of the aforementioned transfer learning settings, we have to answer the following two questions:

1) Are the source and target tasks different? (Condition TS6= TT)

2) And/or, are the source and target domains different? (Condition DS 6= DT)

Since the risk-profile selection process is conditioned on values of the input variables and not on the dependent variable, the bias is introduced via differences in domains rather than differences in tasks. This means that the answer to the first and second question are negative and positive respectively, and that the risk-profile selection bias corresponds to the transductive transfer learning problem.

The next question is whether the risk-profile selection corresponds to the first or second case within the transductive transfer learning setting. Since proper randomization is not achieved under a risk-profile selection process, it is evident that there will be a difference in distribution, and thus (PS(X) 6= PT(X)) . This is also the case for the general problem of selection bias,

however, a subtle and crucial difference between risk-profile selection bias and the traditional form of selection bias is that the support condition only holds in the latter case. The support condition implies that the support of both domains coincide, meaning that all elements of one domain can be sampled out of the other domain with a non-zero probability. This indicates that risk-profile selection bias not only causes differences in distribution between the source-and target domain, but also in sample space or even feature space (as illustrated earlier in figure 1.2). This motivates to classify the transductive transfer learning setting in three, rather than two different sub-settings (see table 2.2)

Source and Target Distribution

Source and Target Sample Space

Source and Target

Feature Space Examples

1 Different The same The same (Sample) Selection Bias; Covariate Shift; Domain Adaptation

2 Different Different The same Risk-Profile Selection Bias Type 1

3 Different Different Different Risk-Profile Selection Bias Type 2

Table 2.2: Categorization of Risk-profile Selection Bias within the Transductive Transfer Learning Setting

Depending on whether all features of the original population variables are included in the selection, the risk-profile selection bias problem can be assigned to either the second or third sub-setting within the transductive transfer learning. Note that a difference in feature space implies a difference in sample space, and a difference in sample space implies a difference in distribution.

(16)

2.4 Transductive Transfer Learning Approaches

Knowing to which transfer learning setting the risk-profile selection bias problem belongs, it becomes more evident what kind of transfer learning method should be used for correction. It appears that there are two different transfer approaches that are commonly used to solve a transductive transfer learning problem: the instance-based transfer-, and the feature-based transfer approach. Both approaches are based on the idea that one can adapt the predictive function learned in the source domain through some unlabelled target-domain data, so that it can be used in the target domain. The way in which this is done differs between the two approaches.

2.4.1 Instance-based transfer

Instance-based transfer methods attempt to transfer knowledge by reweighting the empirical loss function on an instance-based level. The weights are the ratio between the marginal probabilities of the source- and target domain, and can be seen as a penalty value to each instance or observation. The basic idea is to put relatively more weight to instances of which the sampling probability is higher in the target domain than in the source domain.

In recent years several methods are proposed to estimate these ’transfer weights’. One can calculate the transfer weights through direct density estimation on both the source- and target domain, however this can be difficult due to the curse of dimensionality. Sugiyama et al [18] pro-posed a more efficient algorithm known as the Kullback-Leibler Importance Estimation Procedure (KLIEP) to estimate the densities based on minimization of the Kullbkack-leibler divergence. In addition, Huang et al [19] proposed a nonparametric method known as Kernel-Mean Matching (KMM) that avoids density estimation by calculating the transfer weights through the match-ing of means in a reproducmatch-ing kernel Hilbert space (RKHS). A limitation of the instance-based transfer approach however, is that the sampling probability for all possible instances should be non-zero for both the source- and target domain (∀x ∈ DS, DT : PS(x), PT(x) > 0). Without

this condition it is not possible to estimate the transfer weights for each instance. The aforemen-tioned instance-based transfer methods are therefore only applicable to the first transductive transfer learning setting, where the support of DS and DT coincide (XS = XT).

2.4.2 Feature-based transfer

The feature-based transfer approach on the other hand, allows differences in feature spaces and relaxes the support condition. The assumption behind the feature-based transfer approach is that observed data are controlled by a set of latent factors. When two domains are different but related to each other, they may share some latent factors that cause the data distributions to be similar. This is illustrated in figure 2.1 .

The main idea of a feature-based transfer approach is to exploit the common latent factors and to learn a feature transformation φ that spans the source- and target data unto a latent

(17)

sub-CHAPTER 2. LITERATURE REVIEW 16

Figure 2.1: Illustration of Common Latent Factors of A Source and Target Domain. The blue dots represent the common latent factors of the source and target domain. The green (red) dots correspond to latent factors that are specific to the source (target) domain

space where the two data distributions are similar, while preserving the original data structures as much as possible. This is illustrated in figure 2.2 .

Figure 2.2: Illustration of the Feature-based Transfer Approach. Both the source and target domain are embedded in a subspace of shared latent factors by means of a feature transformation φ. The success of the transformation depends on the degree in which the original data structure is preserved

Standard learning algorithms can be applied on the latent subspace to train models based on labelled source domain data, and to make predictions on the unlabelled target domain data.

Recent literature proposes two different ways to learn φ. The first solution is to encode application-specific knowledge to learn the transformation. The idea is to use domain knowledge in order to identify so-called pivot features that can form a bridge between the source- and target domain. These pivot features are then used to align all domain specific features across domains by means of Structural Correspondence Learning (SCL). Such pivot-feature methods are especially applied in the context of natural language processing and text-classification [20]. The second solution is to learn the transformation automatically without any domain knowledge interventions. Pan et al has shown that this can be achieved via dimensionality reduction. They

(18)

first introduced the Maximum Mean Discrepancy Embedding (MMDE) method [21], where a low-dimensional latent feature space is learned in which the distributions between the source-and target domain are similar or close to each other. By projecting the data onto this latent feature space, domain differences decrease or diminish and standard learning algorithms can be applied. They subsequently proposed a more efficient version of MMDE, known as the Transfer Component Analysis (TCA) method [22]. Both methods have been verified by experiments in the context of cross-domain indoor Wi-Fi localization and cross-domain text classification.

2.5 Summary

There are a lot of research fields that have studied problems related to the issue where the available training data has a different distribution and/or different feature space than the target data. The transfer learning field currently offers the most comprehensive study on this subject. Every transfer problem can be categorized into one of the three transfer learning settings: the inductive-, transductive-, and unsupervised transfer learning. Risk-profile selection bias is an example of the specific case within the transductive transfer learning where the training and target data differ not only in distribution, but also in support and/or feature space. Due to the violation of the support condition, classical instance-based transfer methods are not appropriate to correct for risk-profile selection bias. To tackle the problem by means of transfer learning, one should follow a feature-based transfer approach. However, the empirical evidence of feature-based transfer methods is still limited to specific contexts such as cross-domain text classification.

(19)

Chapter 3

Methodology and Techniques

In this section we will give a detailed explanation of the methodology used to answer the research questions. With section 3.1 we start off by discussing the general approach that was taken in order to investigate the effects of risk-profile selection bias on model performance. This includes the choice of using experimental datasets and performing Monte Carlo (MC) simulations. In sections 3.2-3.5 we will discuss several important aspects of the MC simulation approach, such as how we designed our risk-profiles and how we measured the degree of risk-profile selection bias. Finally, in section 3.6 we will briefly discuss Transfer Component Analysis, the correction method that we evaluated during the empirical analysis.

3.1 Monte Carlo Simulation

Our study on the effects of risk-profile selection bias is based on a MC simulation experiment. The MC simulation is performed on experimental datasets. The two experimental datasets that are used in the empirical analysis will be described in section 4. Within each simulation round the experimental dataset is split into multiple subsets so that both unbiased- and biased models can be trained and validated. How the data splitting procedure looks like is described in figure 3.1.

The experimental dataset is the starting point during each simulation round, as it represents the statistical population. A subset of the data is then profiled as risky by means of a set of risk-profiles. Next, the total dataset is randomly split into two distinctive subsets: the source domain, and the target domain. Each domain is a random sample out of the original dataset, and thus both are representative for the statistical population. The source domain is used for model building and can be distinguished into two different training domains: the biased training domain and the unbiased training domain. The biased training domain is the part that contains all data points labeled as risky, and is used to build a biased model. The unbiased training domain on the other hand contains both risky- and non-risky labeled data, and is used to build an unbiased model. This is done by taking a random sample with equal size to the biased training domain. Note that the unbiased training domain equals the source domain.

(20)

CHAPTER 3. METHODOLOGY AND TECHNIQUES 19

Finally, both the biased model and unbiased model are validated by using the target domain.

Figure 3.1: Data Splitting Procedure of A Single Simulation Round

By means of MC simulation the above model comparison experiment is performed a large number of times. The biased- and unbiased models will then be validated and compared based on the average results of the MC simulation. By letting the risk-profiles be randomly created in each simulation round, the MC simulation will cover a wide range of possible combinations of profiles instead of just one particular set of profiles. Due to this variation within the MC simulation, we assume that the average results can lead to general insights with respect to the effects of risk-profile selection bias. Furthermore, since we are interested in the relationship between the degree of risk-profile selection bias and the decrease in predictive performance, the MC simulation must be performed for different degrees of risk-profile selection bias. This can be achieved by considering multiple selection bias scenarios that range from low to high degrees of selection bias, where one MC simulation is performed for each scenario. This is illustrated in figure 3.2.

Figure 3.2: The MC simulations Range from Low to High Degrees of Selection Bias. The data splitting procedure explained in 3.1 is followed for multiple different selection bias scenarios that range from low to high degrees of bias. This leads to multiple MC simulations.

(21)

bias; the larger the degree of selection bias, the larger the average differences will be between the biased- and unbiased model performances.

3.2 The Risk-profiling Process

The profiling process is a crucial part of the MC simulation described above. The way the profiling process is designed determines the type and degree of selection bias to which the biased models are subjected, and will ultimately determine the differences in model performances between the biased- and unbiased models. It is therefore important that the profiling is done in such a way that it mimics risk-profile selection processes in real-world settings. This basically means that a multiple of (different) profiles should be created at the beginning of each simulation round, and that each profile should represent a conjunction of features that indicates high risk. The risk-profiling process we followed is described in Algorithm 1.

Algorithm 1 Risk-profiling Process in Pseudo-code

Require: A dataset that represents the statistical population.

Ensure: A set of risk-profiles, SRP that together produce a biased subset of the statistical

population.

1: WHILE size subset < threshold 1, do: 2: → Randomly select M > 2 variables

3: → Randomly select 1 feature for each of the selected variables

4: → Create a profile based on the first two variables: P_new = DATA[variable 1 == feature 1 & variable 2 == feature 2]

5: WHILE size Pnew > threshold 2, do:

6: → Filter Pnew based on one more variable

7: IF prior positive-class probability (ppp) of Pnew > ppp of population * threshold 3

→ SRP = SRP∪ Pnew and make sure that there are no duplicates.

8: ELSE→ return to 2.

There are three thresholds that govern the risk-profiling process. Threshold 1 is an important parameter that controls the size of the biased subset, and thus the ratio between the sizes of the total dataset and the biased subset. This makes it possible to consider different degrees of risk-profile selection bias. Threshold 2 is important as it limits the size of a profile, which makes it possible to avoid the creation of too large and too general profiles. Furthermore, in real-world applications it is often the case that profiles are designed in such a way that it indicates more risk than other elements of the population. To incorporate this risky character, each profile should have a prior class probability that is higher than in the total dataset. The difference in prior class probability is determined by threshold 3. Finally, as we only consider the case where the feature spaces are similar across domains, we can repeat the above procedure until the feature spaces of the biased subset and total dataset are equal.

(22)

3.3 Measure of Selection Bias

For measuring the degree of risk-profile selection bias we used the ratio between the size of the biased subset and the size of the total dataset. We will call this measure the Biased/Total Ratio (BTR). In the previous section we have seen that when we increase the size of the biased subset, more profiles will be used in the selection process. Furthermore, when more profiles are used, more elements of the population are included. This means that there is a positive correlation between BTR and the number of elements of the population that are included in the selection process. This is shown in figure 3.3, where the percentage of included population elements are plotted against the BTR.

Figure 3.3: The Relation Between BTR and The Number of Population Elements Included in the Selection Process. The points plotted are the average values over S = 50 simulations for 7 different BTR values. In each simulation the selection process described in section 3.2 is re-used to create new profiles and to select a new biased subset. One of the experimental datasets is used as the starting dataset. The standard deviations are shown for both the horizontal and vertical axis.

The absence of population elements within the selection process results in risk-profile selection bias, and thus based on the relation seen in figure 3.3, we assume that BTR is negatively correlated with the degree of risk-profile selection bias.

Besides this measure of selection bias, there are several scientific methods proposed in liter-ature that measure the differences between data distributions. A large group of these methods are so-called statistical distance measures, of which the Kullback-Leibler Divergence is a popular example. A limitation of these parametric distance measures however, is that they all require some intermediate density estimation, leading to the curse of dimensionality in case of large empirical datasets. A non-parametric alternative is the Maximum Mean Discrepancy (MMD) distance measure. MMD measures the distance of two empirical distributions based on the distances between means in a Reproduced Kernel Hilbert Space (RKHS). The main application of MMD is the two sample problem, where MMD is used to test whether two data distributions are significantly different [23][24]. However, the downside of MMD is that it is not a normalized

(23)

measure in the sense that its value cannot be compared across different data domains. For example, if one wants to compare two MMD-distances (where each MMD-distance is based on two datasets), all four datasets should live in the same sample space. The MMD measure is therefore not suitable for the MC simulation approach, as the sample spaces of the biased- and unbiased domain are different in every simulation round.

3.4 Model Building Procedure

We will now discuss the model building procedure that is used to build the biased- and unbi-ased models. This procedure is often followed in the context of fraud detection, and can be summarized into two parts: 1) the estimation of the model (hyper)parameters and 2) a final threshold determination based on cost-benefit trade-offs (see figure 3.4).

Figure 3.4: Model Building Procedure

The first step is to split the available labeled data into two subsets: the training set (large%) and the test set (small%). The training set is then used for model fitting and parameter optimization, for example by means of cross-validation. During this training phase, model parameters are tuned and optimized in such a way that they correspond to a convenient balance between the true positive- and true negative rates (sensitivity and specificity). Once the models are trained and optimized, it is preferred in fraud detection contexts to have an extra test set at hand in order to validate the tuned model a final time on out-of-sample data. This final validation makes it possible to create a cost-benefit curve that visualizes the true trade-offs between costs and benefits in a reliable way. This cost-benefit curve can then be used to adjust or choose the threshold that corresponds to the desires of decision makers.

In this thesis we use the trade-off between the missing fraud rate and the reduction of costs, as this trade-off is often used in fraud detection contexts. See figure 3.5 for an illustration. The threshold value must be chosen as such that it corresponds to the desired trade-off between the missing fraud rate and the reduction of costs. Management or other decision makers can choose for example to accept 19% missing fraud cases so that it can reduce controls by 58%, which

(24)

means that the threshold value must be set at 0.4.

Figure 3.5: Trade-off Between Missing Fraud Rate and the Reduction of Costs. Some points on the trade-off curve are annotated with the corresponding threshold values.

3.5 Validation Criteria

The model performances of the biased- and unbiased models are validated following three dif-ferent approaches. The first approach is to compare the model performances based on the Area Under Receiver Operating Characteristic (AUROC). The AUROC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUROC is threshold independent and is suitable to measure the overall pre-dictive power of a binary classifier. The second approach is to compare the model performances based on the accuracy within the top prediction scores, i.e. top risk-scores. This approach is common in the field of fraud detection. Due to limited resources one usually focuses their control efforts on those samples where most risk is predicted. Models are therefore successful if the accuracy or fraud detection rates are high within the top risk-scores that are predicted by the model.

Finally, models will be validated based on the trade-off between the missing fraud rate and the reduction of costs. When cost-benefit analysis is based on this trade-off, the ques-tion is whether the chosen and anticipated percentages of the missing fraud rate (and/or cost reduction) coincide with the realized percentages when the model is applied on target data. Considering the example given earlier (see figure 3.5), even though a threshold is chosen as such that the model will accept 19% missed fraud cases (permitting 58% cost reduction), the realized percentages could be very different when the model is subjected to risk-profile selection bias. To

(25)

analyze the effects of risk-profile selection bias we will therefore consider the difference between the anticipated trade-off that is chosen during model building, and the real trade-off that re-sults from the final model validation on the target data (population). This application-specific measure will be called the trade-off difference, and consists of two parts: the differences with respect to missing fraud rate (%), and the differences with respect to the reduction in costs (%).

3.6 Correction Method: Transfer Component Analysis

The approach and methods discussed in the previous sections provide a good basis for analyzing the effects of risk-profile selection bias. Another main goal of this thesis is to know whether there are effective methods that can correct for the effects of risk-profile selection bias. In this final section we will discuss the correction method that we applied and evaluated during the empirical analysis. Our literature study has shown that risk-profile selection bias can be seen as a special case within the transductive transfer learning setting to which a feature-based transfer approach is most appropriate. As the experimental datasets do not come from the areas of NLP and text-mining (see chapter 4), we chose to follow an automatic feature-based transfer approach rather than using a pivot-feature method. We chose to evaluate the Transfer Component Analysis method, as it is more efficient than its predecessors. TCA is based on the dimensionality reduction framework proposed by Pan et al [21] [25], of which the step-wise algorithm is shown in Algorithm 2.

Algorithm 2 A Dimensionality Reduction Framework for Transductive Transfer Learning. Adapted from Jialin Pan (2010).

Require: A labeled source domain data set DS = {(xSi, ySi)}, an unlabeled target domain

data set DT = {xTi}

Ensure: Predicted labels YT of the unlabeled data XT in the target domain.

1: Learn a transformation mapping φ such that Dist(φ(XS), φ(XT)) is small, and where φ(XS)

and φ(XT) can preserve properties of XS and XT respectively.

2: Train a classification or regression model f on φ(XS) with the corresponding labels YS

3: Map the unlabeled data xTi’s in DT to the latent space to get new representations φ(xTi)’s,

and use the model f to make predictions f (φ(xTi))’s

4: return φ and f (φ(xTi))’s

The first step is the key step, because after learning the transformation mapping φ, one can follow the normal model training procedure by using the transformed data φ(XS) and φ(XT).

Predictions can then be made for the target domain by only using labeled data from the source domain. This dimensionality reduction framework has two great advantages: 1) most existing machine learning methods can easily be integrated within the framework, and 2) it can be applied to both classification and regression tasks.

The main idea behind TCA is to minimize the MMD distance between the empirical means of the two domains DS = {(xSi, ySi)}

nS

i=1 and DT = {xTi}

nT

(26)

is translated into a corresponding kernel learning problem since it is hard to solve directly. By means of certain conditions in the kernel learning problem it is made possible to preserve most of the variation in the original data, and thus a balance is created between minimizing domain differences and preserving the original data structure. The solution to the kernel learning prob-lem is a linear transformation matrix W that is constructed out of m leading factors (transfer components). The matrix W is finally used to embed the source and target domain data into a m-dimensional latent subspace (where m << ns, nt), on which classic machine learning

algo-rithms can be applied. For a more detailed and mathematical discussion one is referred to the original papers.

(27)

Chapter 4

Experimental Datasets

In this chapter we will give a description of the two experimental datasets that are used in the empirical analysis. These experimental datasets were carefully chosen and consist of mixed data types (mixture of numeric, ordinal and nominal variables), a binary target variable, and reasonable data sizes. Note that the datasets do not correspond to contexts where risk-profile selection bias most often occurs, such as risk-analysis or fraud detection. By ensuring that the data type, data structure and the applied selection process are similar to real-world settings, we expect that the average results of the MC simulations will give general insights on how risk-profile selection bias can affect model performance.

4.1 Adult Census Income

The Adult Census Income (ACI) dataset is taken from the UCI Machine Learning Repository. The UCI ML Repository is an online archive of databases that are used and generated by the machine learning community for the empirical analysis of machine learning algorithms. The ACI dataset is an extracted and pre-processed version of the raw data available at the 1994 Census Database. Part of this pre-processing concerns the deletion of missing values, the removing of cases where the age is lower than 16, and discretizing the gross income into two ranges with threshold $50.000. Table 4.1 shows the descriptive statistics of the ACI dataset. The target variable is separated into two classes: low- and high census income (low=0, high=1). Low or high census income corresponds to earnings of less or more than$50K/y. The input variables consist of binary, categorical and numeric data types. The categorical and binary variables are: education, marital status, native country, occupation, race, relationship and sex. The numeric variables are: age, capital gain, capital loss, final weight (fnlwgt) and working hours per week.

(28)

CHAPTER 4. EXPERIMENTAL DATASETS 27

Type N % Type N %

All 48759 100.0 Divorced 6632 13.6

Target variable Separated 1530 3.1

LABEL Binary Widowed 1517 3.1

0 37093 76.1 Married spouse absent 626 1.3

1 11666 23.9 Native country Binary

Input variables United States 44608 91.5

Age Numeric Restcategory 4151 8.5

<20 2510 5.1 Occupation Categorical

20 - 29 10782 22.1 Prof-specialty 8962 18.4

30-39 12.946 26.6 Craft-repair 6107 12.5

40-49 11.083 22.7 Exec-managerial 6082 12.5

>49 11.466 23.5 Adm-clerical 5602 11.5

Capital gain Numeric Sales 5498 11.3

0 44730 91.7 Other-service 4916 10.1

>0 4029 8.3 Machine-op-inspct 3019 6.2

(Mean = 1078.0) Transport-moving 2353 4.8

(Max = 999999) Handlers-cleaners 2069 4.2

Capital loss Numeric Farming-fishing 1481 3.0

0 46481 95.3 Tech-support 1446 3.0

>0 2278 5.7 Restcategory 1224 2.5

(Mean = 87.6) Race Binary

(Max = 4356.0) White 41690 85.5

Education Categorical Black 4677 9.6

HS-grad 15748 32.3 Asian-Pac-Islander 1517 3.1

Some-college 10860 22.3 Restcategory 875 1.8

Bachelors 8018 16.4 Relationship Categorical

Restcategory 6398 13.1 Husband 19690 40.4 Masters 2655 5.4 Not-in-family 12577 25.8 Assoc-voc 2058 4.2 Own-child 7563 15.5 Assoc-acdm 1597 3.3 Unmarried 5123 10.5 Prof-school 832 1.7 Wife 2304 4.7 Doctorate 593 1.2 Other-relative 1502 3.1

fnlwgt Numeric Sex Binary

<100000 8544 17.5 Male 32602 66.9

100000 - 200000 21686 44.5 Female 16157 33.1

>200000 18529 38.0 Work label Categorical

Working hours p/w Numeric Private 36676 75.2

<40 11657 23.9 Self-emp-not-inc 3859 7.9

40 22772 46.7 Local-gov 3136 6.4

>40 14330 29.4 State-gov 1979 4.1

Marital Status Categorical Self-emp-inc 1695 3.5

Married 22358 45.9 Federal-gov 1414 2.9

Never married 16096 33.0

Table 4.1: Descriptive Statistics - Adult Census Income Dataset. All categories with size <1% of total were put into a rest-category.

(29)

4.2 Bank Marketing

The Bank Marketing dataset (BM) was also obtained from the UCI Machine Learning Reposi-tory and contains information related to a direct marketing campaign of a Portuguese banking institution and its attempts to get its clients to subscribe for a term deposit. The target variable is a binary variable that indicates whether the client has subscribed for a term deposit. A term deposit is a deposit held at a financial institution that has a fixed term, ranging anywhere from a month to a few years. Tabel 4.2 provides information about the 12 input variables. A subset of these are related to the last contact of the current campaign, such as the month in which the last contact was made as well as the number of days since the client was last contacted in a previous campaign. Other variables consist of general information about each client such as age, marital status and education level.

Type N % Type N %

All 40769 100.0 Basic 6y 2264 5.6

Target variable Unkown 1596 3.9

Subscription Binary Housing Categorical

0 36179 88.7 Yes 21366 52.4

1 4590 11.3 No 18419 45.2

Input variables Unkown 984 2.4

Age Numeric Loan Categorical

<30 5635 13.8 No 33605 82.4

30 - 40 16827 41.3 Yes 6180 15.2

40 - 50 10403 25.5 Unkown 984 2.4

>50 7045 17.3 Contact Binary

Job Categorical Cellular 25913 63.6 Admin 10407 25.5 Telephone 14856 36.4 Blue-collar 9232 22.6 Month (10x) Categorical

Technician 6731 16.5 Mrt - Apr 3159 7.7 Services 3963 9.7 May - Aug 32077 78.7 Management 2921 7.2 Sept - Dec 5533 13.6 Retired 1712 4.2 Campaign Numeric

Entrepreneur 1451 3.6 1 17444 42.8

Self-employed 1413 3.5 1 - 5 19988 49.0

Housemaid 1056 2.6 >5 3337 8.2

Unemployed 1009 2.5 Last contact (days) Numeric

Student 874 2.1 Not prev. contact 39673 97.3 Marital status Categorical =<7 1177 2.9

Married 24679 60.5 >7 338 0.8

Single 11493 28.2 Previously contacted Binary

Divorced 4597 11.3 0 35201 86.3

Education Categorical 1 5568 13.7

(30)

Table 4.2 continued from previous page

High school 9464 23.2 Nonexistent 35201 86.3 Basic 9y 6006 14.7 Failure 4220 10.4 Professional course 5225 12.8 Succes 1348 3.3 Basic 4y 4118 10.1

(31)

Chapter 5

Empirical Analysis

The goal of the empirical analysis is to study the effects of risk-profile selection bias on model performance, and to test whether the TCA method can correct for this type of bias. The empirical analysis consists of multiple MC simulations (varying from low to high degrees of selection bias scenarios) where each simulation consists of the training and validation of both biased- and unbiased models. In section 5.4 we also evaluate TCA-corrected models. The effects of risk-profile selection bias will be measured based on the validation criteria discussed in section 3.5.

We start off by introducing the classification algorithms and models, and in section 5.2 we provide a thorough discussion of the final results.

5.1 Models

In the empirical analysis we will consider two traditional classifiers: (L2-Regularized) Logistic Regression (LR), a relative simple model, and the Random Forest Classifier (RF), which is based on a more advanced algorithm. This leads to 6 different types of models, which are shown in table 5.1:

Logistic Regression Random Forest Biased LR-biased RF-biased Unbiased LR-unbiased RF-unbiased TCA-corrected LR-TCA RF-TCA

Table 5.1: The Six Different Types of Models that are contained in the Empirical Analysis

These six models are trained and validated in each simulation round within each of the MC simulations. The model performances are finally compared based on the average results.

(32)

CHAPTER 5. EMPIRICAL ANALYSIS 31

5.2 Results

5.2.1 The Effects On Overall Predictive Performance

Figure 5.1 shows the results of the MC simulations for the ACI dataset, where the model performances were validated using the AUROC performance measure. In figure 5.1a) we see that the predictive performances of all models increase in BTR. This is probably mainly due to the fact that the number of training samples increases when BTR increases. However, it seems that the logistic regression model is more affected by the risk-profile selection bias compared to the random forest classifier, since the AUROC score of the LR-biased model is much lower than the AUROC score of the other three models.

Figure 5.1: AUROC Results - Adult Census Income. The AUROC scores are the average results of the six MC simulation, where the number of simulations S = 750. The standard deviations per MC simulation are included in the left graph.

We can also see this in figure 5.1b), where the percentage deviation of performances between the unbiased and biased model are plotted against BTR. This deviation is approximately zero for the random forest classifier, meaning that the risk-profile selection bias causes little changes in predictive performances when a random forest classifier is used. However, an apparent deviation can be seen for the logistic regression model. For example, in the first MC simulation where the degree of selection bias is the highest, the AUROC score of the LR-biased model is 12% lower than that of the unbiased model. This negative deviation between the biased and LR-unbiased model fades out when BTR increases. This means that the predictive performances of the LR-biased model converges to that of the LR-unbiased model when BTR increases.

The AUROC results for the BM dataset are shown in figure 5.2 in a similar fashion. In 5.2a) we see that the AUROC scores for all four models increase in BTR, and that the worst performance is shown by the LR-biased model. However, this time we also see an apparent

(33)

difference between the RF-biased and RF-unbiased model. This means that the risk-profile selection bias also affected the predictive performances of the random forest classifier. We can also see this in 5.2b), where the negative deviation between the unbiased- to biased model clearly decreases in BTR in case of the random forest classifier. This negative correlation between the AUROC deviation and BTR is not consistently seen for the logistic regression model, which is in contrast to the results that we have seen for the ACI dataset.

Figure 5.2: AUROC Results - Bank Marketing. The AUROC scores are the average results of the six MC simulation, where the number of simulations S = 750. The standard deviations per MC simulation are included in the left graph.

(34)

5.2.2 The Effects On Fraud Detection Rates Within Top Risk-scores

We will now discuss the results concerning the prediction accuracy of the biased- and unbiased models within the top risk-scores. In figure 5.3 the average results for the top 5% are shown for the ACI dataset. A fraud rate of 0.9 indicates that an average fraud detection rate of 0.9 is achieved within the top 5% risk scores. A surprising result is that the biased model performs better (on average) than the unbiased model in case of the random forest classifier. This means that the average fraud detection rate within the top 5% risk scores actually improves when risk-profile selection bias is present. For the second MC simulation for example, where BTR equals 0.06, we see that the average fraud detection rate of RF-biased is almost 2% higher than the average fraud detection rate of the RF-unbiased model (see figure 5.3b). Note that the fraud detection rates of the biased- and unbiased model are approximately the same under logistic regression.

Figure 5.3: Fraud Rates Within the Top 5% Risk Scores - Adult Census Income. The fraud detection rates are the average results of the four MC simulations. The number of simulations S = 400. The standard deviations per MC simulation are included in the left graph.

In a similar fashion the results for the top 10% and top 20% are shown in figure 5.4. It is interesting to see that the higher percentage of top risk scores you consider, the more the relative performance of the biased model will decrease. This is especially the case for the logistic regression model. Consider for example the first MC simulation (BTR = 0.03), where percentage deviation is -1.4%, -6.8% and -12.0% for respectively the top 5-, 10- and 20-%. However, the random forest classifier seems to be quite robust to the risk-profile selection process, as it still performs better than the unbiased model based on the top 20%.

(35)

Figure 5.4: Fraud Rates Within the Top 10% and 20% Risk Scores - Adult Census Income. The fraud detection rates are the average results of the four MC simulations. The number of simulations S = 400. The standard deviations per MC simulation are included in the two graphs at the left.

Finally we will discuss the results for the BM dataset. This time we consider the top 2%, 5% and 10% of risk scores as the BM dataset is quite imbalanced (ratio BM ≈ 1 : 7.8, class-ratio ACI ≈ 1 : 3.2). The results are shown in figure 5.5. A surprising result is that based on the top 2%, both the LR-biased and RF-biased models are performing better than their unbiased counterpart. This is directly seen in 5.5b, as there are only positive values. This suggests that under relatively high imbalanced data, risk-profile selection bias can actually improve the average accuracy within the top risk scores for both logistic regression and the random forest classifier. When considering the top 10% and top 20% risk scores, we again see the pattern that the relative performance of the biased models decreases when higher top percentages are considered. Thus this average improvement due to risk-profile selection bias is only seen up to some minimum percentage of risk-scores.

(36)

Figure 5.5: Fraud Rates Within the Top 2%, 5% and 10% Risk Scores - Bank Marketing. The fraud detection rates are the average results of the four MC simulation experiments. The number of simulations S = 400. The standard deviations per MC simulation are included in the three graphs at the left.

(37)

5.2.3 The Effects On Cost-benefit Trade-offs

We will now discuss the results of the MC simulations where the trade-off difference was taken as the validation criterion. The missing fraud rate (MFR) is targeted at 10% during the model building phase. Figure 5.6 & 5.8 show the results for the ACI dataset, where the true- and tar-geted percentages of MFR and the reduced costs are plotted against BTR. In each sub-figure, the target-LR and target-RF correspond to the targeted percentages during threshold determi-nation, whereas the true-LR and true-RF correspond to the realized percentages resulting from the final model validation on the target domain. LR and RF refer to the logistic regression model and random forest classifier respectively.

Figure 5.6: Trade-off Differences, Biased Models - Adult Census Income. The percentages are the average results of the six MC simulations, where the number of simulations S = 750. The standard deviations per MC simulation are also included.

We start off by discussing the results of the biased models, the case where the logistic regression model and the random forest classifier are trained on data that is subjected to risk-profile selection bias. In Figure 5.6) the average MFR percentages of the biased models (LR-biased and RF-(LR-biased) are plotted against BTR for each of the six MC simulations. Since the MFR was targeted at 10% during cost-benefit analysis, it is obvious that the targeted percentages for both LR-biased and RF-biased are on average 10%, regardless the degree of selection bias. However, a striking result is that the realized percentages LR and true-RF are much higher than the expected value of 10%. Considering the first MC simulation for example, we see that the average realized MFR for the LR-biased and RF-biased are respectively 28% and 22%. Again we see that the logistic regression model is more affected by selection bias than the random forest classifier, and that the effects decrease in BTR. In figure 5.6b) the corresponding average percentages of reduced costs are plotted against BTR. We see here that

(38)

the realized reduced costs are also much higher than the targeted reduced costs for both the logistic regression model and the random forest classifier. Considering the first MC simulation for example (lowest degree of selection bias), we see that the targeted reduced costs are around 40%, whereas the realized reduced costs are around 63%. Thus, we see that both the realized MFR and realized reduced costs are much higher than the MFR and reduced costs that were targeted during the cost-benefit analysis, and that this trade-off difference slightly decreases in BTR. The trade-off difference that is caused by risk-profile selection bias is illustrated in figure 5.7.

Figure 5.7: Illustration Trade-off Difference Based on the ACI results where BTR equals 0.04.

In a similar fashion we will discuss the results of the unbiased models, which are shown in figure 5.8. Note that the BTR values of the horizontal axis correspond to that of the biased model, and that the unbiased model is always trained on unbiased data. We see here that the targeted and realized (average) percentages are approximately the same for either the MFR and the reduced costs. This is the case for both logistic regression as for the random forest classifier. This means the realized trade-off between MFR and reduced costs does not differ from the targeted trade-off when the models are trained on unbiased data.

(39)

Figure 5.8: Trade-off Differences, Unbiased Models - Adult Census Income. The percentages are the average results of the six MC simulation, where the number of simulations S = 750. The standard deviations per MC simulation are also included.

Similar results are found for the Bank Marketing dataset, which are shown in 5.9. The realized percentages of the MFR and reduced costs are again on average much higher than the targeted percentages in case of the biased models (a&b), whereas no apparent differences are seen for the unbiased models (c&d). However, this time the effects of (risk-profile) selection bias are larger for the random forest classifier than for the logistic regression model, which is in contrast to the results of the ACI dataset.

Risk-profile selection bias : what are the effects, and how to correct it?

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Master’s Thesis in Econometrics

Risk-profile Selection Bias

What are the effects, and how to correct it?

Vincent R. Dieduksman

Abstract

Statement of Originality

Contents

Chapter 1

Introduction

1.1

Context and Motivation

1.2

Problem Statement and Research Questions

1.3

Research Approach

1.4

Thesis Outline

Chapter 2

Literature Review

2.1

Research Backgrounds

2.2

Transfer Learning Settings

2.3

Categorization of Risk-profile Selection Bias

2.4

Transductive Transfer Learning Approaches

2.5

Summary

Chapter 3

Methodology and Techniques

3.1

Monte Carlo Simulation

3.2

The Risk-profiling Process

3.3

Measure of Selection Bias

3.4

Model Building Procedure

3.5

Validation Criteria

3.6

Correction Method: Transfer Component Analysis

Chapter 4

Experimental Datasets

4.1

Adult Census Income

4.2

Bank Marketing

Chapter 5

Empirical Analysis

5.1

Models

5.2

Results