• No results found

Master Thesis APPLICATION OF MACHINE LEARNING TO FRAUD DETECTION

N/A
N/A
Protected

Academic year: 2021

Share "Master Thesis APPLICATION OF MACHINE LEARNING TO FRAUD DETECTION"

Copied!
54
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

APPLICATION OF MACHINE LEARNING TO

FRAUD DETECTION

Address fraud within municipal data

Master Thesis

Ondrej Molnar

University of Groningen

Faculty of Economics and Business

MSc. Marketing Intelligence

July 2, 2018

Student number: S3239012

Roer 35

9733 AE Groningen

+31616296578

ondrejmolnar4@gmail.com

(2)

ABSTRACT

This thesis aims to evaluate whether machine learning techniques can improve fraud detection and by doing this, improve the database quality. The paper addresses fraud within the municipal data of a city located in the north of the Netherlands, Groningen. The data used for the analysis contain over 500.000 entries of movement history of individuals who lived in the city in a span of over seven years. By applying eight different algorithms (Logistic Regression, Decision Tree, Bagged Decision Tree, Boosted Decision Tree, Support Vector Machine, Random Forest, K-Nearest Neighbor, and Neural Network), we conclude that the predictive power of these models on the available data does not help predict fraud. In the future, more efforts regarding the data are needed prior to applying these techniques in order to get results.

(3)

Contents

1. INTRODUCTION ... 4

2. LITERATURE REVIEW ... 7

2.1. Data Quality ... 7

2.1.1. Data Quality in CRM ... 9

2.1.2. Data quality and criminology ... 9

2.2. Fraud ... 10

2.2.1. Definition of fraud ... 10

2.2.2. Fraud triangle ... 11

2.2.3. Address fraud ... 12

2.2.4. Variables that are found to drive fraudulent behavior ... 12

2.3. Machine learning ... 15 2.4. Conceptual model ... 17 3. METHODOLOGY ... 18 3.1. Research Design ... 18 3.2. Data ... 18 3.3. Validation ... 18 3.4. Statistical tools ... 19 3.5. Outliers ... 19 3.6. Missing values ... 19 3.7. Variable description ... 20 3.8. Models... 21 3.8.1. Decision Tree ... 21

3.8.2. Bagged Decision Trees ... 21

3.8.3. Boosted Decision Trees ... 22

3.8.4. Logistic Regression ... 22

3.8.5. Random Forest ... 22

3.8.6. KNN ... 23

3.8.7. Artificial Neural Network ... 23

3.8.8. Support Vector Machine ... 23

3.9. Performance measures ... 24

3.9.1. Gini Coefficient... 24

(4)

3.9.3. Top Decile Lift ... 25

3.9.4. Computational time ... 25

3.9.5. False Positives/False Negatives ... 25

3.10. Software used ... 25

4. RESULTS ... 26

5. DISCUSSION ... 28

6. CONCLUSION AND RECOMMENDATIONS ... 30

7. LIMITATIONS ... 32

(5)

4

1. INTRODUCTION

The quality of data is of great importance to companies in the era of digitalization. Many companies use systems that work with large amounts of data and apply the outcomes of the data analysis to decision making. Customer Relationship Management (CRM) is one example, where individual data is used by a company to serve every customer as efficiently as possible. However, what happens when the data about the customer is not correct?

Imagine a company that wants to send a newsletter to consumers. If the data is not correct, it may happen that the wrong selection of products is chosen for a given consumer and the impact of the direct marketing will be none, if not negative. What if the address in the system is wrong or missing? The leaflet with latest products for an existing customer will end up in someone else’s mailbox, losing the chance of growing the intended account.

Inaccuracy in data causes companies to use the marketing budget inefficiently and also missing out on sales due to the wrongful reliance on data that are not precise. On the other hand, if used correctly, the data can reveal useful patterns that will lead companies to increase their performance and sales. If the data are correct, they can even predict consumer behavior and trends. Well-known example of this is the direct marketing of Target, where the data indicated that a young female consumer was pregnant, and therefore selected a variety of pregnancy-related items and sent the suggestions in form of discount coupons. The only problem was, that the girl “forgot” to mention the pregnancy to their parents, and it caused rather shocking discovery for them. However, this shows the power of the data, if the input is correct.

Similarly, apart from companies, other bodies also work with databases but within different contexts. One of the areas that works with data quality is focusing on fraud, where the data available are used to detect or even predict it.

Fraudulent behavior has been around for centuries and persisted and evolved over time to the forms in which we can see it nowadays. When hearing the word “fraud”, many different things come to mind for many different people. In general, fraud can be defined on different levels, such as:

- Identity fraud: Involves stealing of financial or other private information (identity theft) or using

totally invented information to make purchases or gain access to financial accounts (Hinde, 2005) - Corporate fraud: According to Rubin (2007), this may include fraudulent financial reporting,

(6)

5 - Personal fraud: Defined by Titus (2001) as “the intentional deception or attempt to deception of

an individual with the promise of goods, services, or things of value that1 do not exist or in other ways are misrepresented”.

- Click fraud: It is the practice of deceptively clicking on search ads with the intention of either

increasing third party website revenues or exhausting an advertiser’s budget (Wilbur & Zhu, 2009). There are countless situations where fraudulent behavior can be observed. The extent of fraudulent behaviors is even captured by popular movies such as “Catch me if you can”, or “Wolf of Wall-Street”, which shows how embedded this phenomenon is within the society. Another, maybe less known type of fraud is so called address fraud, which is also the focus of this report.

Many countries such as the Netherlands have social plans, programs and policies in place, that take care of its citizens in form of social support and benefits when needed. These benefits are granted to people based on their situation, which is evaluated based on the official data available to the government. However, these data are not always accurate, be it intentionally or unintentionally. One piece of information, that is necessary for evaluating whether a person is entitled to benefits and to what extent, is the address of residence. Every citizen is obliged by law to register the change in address up to four weeks before or within five days after the change (Gemeente Groningen, n.d.), but not everyone does so and not everyone reports the correct address.

Why is having a correctly registered address of each individual an important issue for municipalities? The main answer in terms of this report is that without correctly registered data, potential for fraud arises, which causes that the benefits are not distributed properly where they should be. This form of wrongful registration is called an address fraud. It happens when a person is registered officially at an address, but lives somewhere else (e.g., student registers as living by themselves, but in reality, still living with their parents). The data show, 400 to 550 thousand people are wrongfully registered within the Netherlands (Rijksoverheid, 2017). Considering the population of the country of around 17 million, this is a rather large number. The costs of the fraud in 2016 accounted for 9.7 million euros spent on address checks and visits. However, the returns of these effort resulted in 11 million euros in reclaimed allowances. Therefore, there is a high priority and focus on the issue, as the costs of address fraud are quite large, and the successful detection saves significant amounts of money.

(7)

6 The municipality is responsible for maintaining the quality of the BRP database in which the data on citizens are stored and from where other governmental bodies acquire information for their operations. Judge and Schechter (2009) discovered that data accuracy and completeness are important aspects of the quality of the data. Therefore, the quality of BRP database is the main focus for municipalities in general. In this study, the focus will be put on the municipality of Groningen.

Multiple studies indicate that organizational databases contain substantial errors (Klein, 2000; Morgenstern, 1963; Musgrove, 1974). This can be due to several reasons, such as inaccuracies in data collection and design, or an information hiding or lying (Bagus, 2006).

There are other advantages to having correctly registered information. In case of an emergency or a fire, firefighters know how many people live at an address. Correct information will prevent from citizens being victimized for example by receiving bills created by somebody else, being stripped of the right to housing benefits and so on. From the social perspective, correct information would prevent people from subletting social housing to others, making sure that people in need of those houses will be granted the opportunity. Within the municipality of Groningen, addressing the address fraud is especially important as, since Groningen is a student city, the housing situation is tense, which results in a high turnover in the tenants and addresses change repeatedly. Therefore, there is a high potential for fraud to occur (be it intentional or unintentional) and also to go unnoticed due to the sheer volume of records.

Currently, the fraud detection is based on manual work, where cases are evaluated if a complaint is filed. However, this technique is rather tiresome and time consuming, and it is not possible to analyze all the cases. Furthermore, if a complaint is not filed, there is no investigation, which means that many cases go potentially undetected. There is a growing field in data analytics called Machine Learning, which can help with case detection, without the complaint arising. In this thesis, advanced machine learning techniques will be applied to the database in order to improve efficiency of the system and to detect risky cases of address fraud, without having a complaint filed. While this procedure will add workload to the municipality as more cases need to be investigated, it also improves the quality of the BRP as more cases can be detected and corrected.

(8)

7 For the purposes mentioned above, there is potential to improvement of database quality. This report includes a specific case of fraud detection. The municipality is fully aware of importance of data quality and has approached us to develop machine learning models that would tackle the issue. There already is a national model in place, but municipality of Groningen believes that local model will be tailored to the Groningen as a city that has specific characteristics, and that it will provide better results. Therefore, the research question is as follows:

Can Machine Learning be used to improve the quality in databases?

To answer these questions, municipal data of more than 500.000 observations is used and multiple machine learning techniques are explored to compare the performance. Findings indicate that with the available data, the predictive power of models does not outperform random model and that machine learning in the context of municipal data does not help detecting fraud and improve database quality. This has important contributions from managerial perspectives as it urges managers with the need to invest into better data extraction and data preparation in order to be able to exploit the advanced computing techniques.

The report is divided into the following sections: In chapter 2, the overview of the literature is summarized. chapter 3 will describe machine learning techniques in general and will introduce the research design and how models are constructed. Chapter 4 will summarize the results of the models used and the research. These will be followed by a discussion in chapter 5. Lastly, chapter 6 will include recommendations and final conclusions, followed by the limitations and propositions for further research.

2. LITERATURE REVIEW

In this section, relevant concepts and theories are described together with existing literature on the included topics. Several concepts are mentioned here. First, data quality is included. This is due to the fact that machine learning and predictive measures require accurate data in order to perform well. Secondly, literature on fraud and different types of fraud is summarized, followed by extraction of variables that might have an effect on predicting fraud. Last but not least, machine learning efforts within the area of fraud detection is summarized.

2.1. Data Quality

(9)

8 Quality of data is one of the main issues in database management. But what is quality? There are many different definitions in the literature. Probably the most useful definition is provided by Loshin (2001), who describes quality as “fitness for use”. Courtheoux (2003) derives number of general points to managing marketing databases: (1) the costs linked to an erroneous selection of names for targeting of marketing efforts tends to exceed the costs of quality management programs and (2) at each step of process, additional expense builds up, causing the damages from a defect to grow. Furthermore, the economic costs of defects in a marketing database tends to escalate relatively to the time they are undetected. Since the entry of incorrect data, all the subsequent reports, analyses, models and selections will be inaccurate. Courtheoux (2003) also mentions that customer name and address processing is a core database building block, because marketing databases depend mostly on this information, in order to link together information from different sources and times.

Within the data field, Strong et al. (1997)identify three roles within data manufacturing systems: (1) data producers (those are individuals, organizations or other sources who generate data), (2) data custodians (those who manage computing resources for storing and processing the data gathered, and (3) data consumers (those who use the data). They also define high-quality data as “data that is fit for use by data consumer” which is generally adopted definition. Strong et al. (1997) also define four categories of data quality: (1) Intrinsic data quality – including Objectivity, Accuracy, Believability, and Reputation, (2) Accessibility data quality – Accessibility and Access security, (3) Contextual data quality – Relevancy, Value-Added, Timeliness, Completeness, Amount of data, and (4) Representational data quality – Interpretability, Ease of understanding, Concise, and Consistent representation. If the accuracy of the source is low, the believability and credibility will decrease, leading to data consumers preferring to use another source of data. This can result in loss of influence as a data source. Completeness refers to missing values in the data, which can result in biased analyses or inability to conduct an analysis. Timeliness refers to access to the data in a timely manner, when required. If timely data are presented, decisions based on data can be made without delay, possibly leading to improved performance for the data consumers.

(10)

9

2.1.1. Data Quality in CRM

The data quality is a well-known issue to marketeers, especially those who work with CRM (Customer Relationship Management) systems. For CRM to be successful, customer data need to be accurate and complete. It is crucial when minimizing operational expenses that occur by redundant data or when aiming to improve revenue through better customer targeting and retention (Ramesan, 2004; Peikin, 2003).High quality of the data enables CRM strategies to be efficient and effective, however many organizations do not invest into improvements of the data quality (Abbott et al., 2001). Abbott (2000) also found in her study about customer data held within databases on the UK, that participants agreed that more data helps marketeers to increase the efficiency of their jobs, but the quality of data needs to be high. If the quality is insufficient, the campaigns are hindered and eventually lead to mistrust – linking to the study of Strong et al. (1997), driving the data user to shift to another data source. Reid and Catterall (2005) mentions a business-specific statement of one of the business consultants in telecom industry they have worked with, which states that “Somewhere between 30 and 40 per cent of all customer addresses are non-unique and only made unique by the person who lives there. So, if a person moves out and you have two addresses that are alike then you cannot trace the customer.” Having incorrect or mismatching data makes it impossible to identify a consumer, which in turn leads to inability to customized communications and offers.

In the article of Foss et al. (2002), the main trends in the use and management of customer data are identified. Among others, the identified reasons include (1) risk and resilience, and (2) spread of customer data across the organization. Risk and resilience relate to a series of events, including commercial risks or corporate fraud, which are causing companies to focus more on poor quality data management and exposure to manipulation. The spread of customer data across the organization refers to the fact that the use of data shifts from within-departmental to cross-organizational use. Correctness of the data is crucial as more involved parties rely on the quality and completeness of the data.

2.1.2. Data quality and criminology

(11)

10 In the research of Wulff (2017) on crime analysis, the author argues that the success of analysis depends upon the quality of the data or gathered information and the capability to interpret its meaning. Canter (2000) states that, in order to design a data collection process that is as useful as possible, the process should fulfill the following: (1) Timeliness, (2) Relevancy, and (3) Reliability. Relevancy focuses on a question whether the information is an accurate representation of facts needed for the analysis, pointing out to the necessity of high quality and accuracy of data.

2.2. Fraud

2.2.1. Definition of fraud

Any individual can be susceptible to committing fraud. Albrecht et al. (2011) mentions that “individuals involved in fraud are typically people just like you and me but have compromised their integrity and become entangled in fraud” (p. 33) and also states that “most fraud perpetrators have profiles that look like those of other honest people” (p. 53).

The term “Fraud” applies in many contexts and has many definitions, depending on it. The Oxford Dictionary (Oxford dictionary, n.d.) provides a general description and defines fraud as “wrongful or criminal deception intended to result in financial or personal gain”. Another definition is provided by the Fraud and Corruption Control AS 8001 (2008), where it is defined as “any dishonest activity causing actual or potential financial loss to any person or entity including theft of monies or other property by employees or persons external to the entity and where deception is used at the time, immediately before or immediately following the activity”.

The focus of this thesis is on address fraud. The recent literature has not focused on the address fraud and there is no definition that can be referenced here. However, another type of fraud that is relatively similar to the address fraud is the benefit fraud or welfare fraud. According to McKinney and Johnston (1986), fraud involves “dishonesty, illegality and the intentional wrongful obtaining of either money or benefits from governmental programs”. This type of fraud is generally known as the “welfare fraud”, a term connected to individuals who illegally claim benefits they are not entitled to (Chunn & Gavigan, 2004). The address fraud for the purpose of this thesis can be defined as providing incorrect information about the address of stay to the governmental organizations and the BRP by an individual, whereby mismatch causes the BRP database to be inaccurate, worsening the overall quality of the database.

(12)

11 student who is not aware of the obligation to register the new address. According to Golden et al. (2006), four elements are necessary for a fraudulent act: material false statement (meaning that the incorrect information has been provided by the source), knowledge that the statement was false when it was uttered, reliance on the false statement by the victim, and damage as a result (as cited in Mangala & Kumari, 2015, p. 52). To give an example, a person knowingly provided an incorrect address statement to the municipality. The municipality relied on the statement and registered it in the system. Due to the incorrect address, the fraudster got access to financial support, therefore causing unlawful financial damages to municipality. However, the above definition would mean that the unintentional cases would not be regarded as fraud as the knowledge of the false statement condition would not be met.

In the Netherlands however, it is an obligation of every citizen to update the database by registering at the municipality, according to the Dutch law. Term “Ignorantia juris non excusat” – taken from Latin, meaning that “ignorance of law is not an excuse” can be applied here. Despite the fact that a person breaks the law unknowingly, the records in the database are invalid. Whether by intention or not, if undetected, the wrongful record of the address can potentially lead to benefits being paid to the individuals who are not entitled to them. Since providing an up-to-date BRP is one of the core responsibilities for Gemeente Groningen and also the focus of this thesis, both types are defined here as fraudulent cases.

2.2.2. Fraud triangle

(13)

12

Figure 1.

The Fraud Diamond

2.2.3. Address fraud

The definition of address fraud has already been explained in section 2.2.1. However, the phenomenon and the issue are not yet represented by much scientific literature in the very context. However, fraud and unlawful behavior have been widely researched in many other contexts. This research will be exploratory in nature, selecting variables mostly based on related research within the fraud context and other related fields, expert opinions and availability. The summary of the existing research within address fraud is addressed bellow and in the following sub-sections.

The article by Roach (2010) looks at how some people cognitively generate wrong address when stopped by police. Roach (2010) presumes that “those who give false addresses to police lie to avoid some sanction or punishment, although to benefit another could also be a reason”. While the article focuses on false address giving when in contact with police in order to avoid punishment, the authors also mention that this behavior can also occur in other situations, when one may benefit from the wrong details, such as receiving extra welfare benefits.

2.2.4. Variables that are found to drive fraudulent behavior

(14)

13

2.2.4.1. Motivation

In the Strain theory of Merton (1957), crime happens when there is a gap between the cultural goals of society (e.g., wealth) and the means to achieve these (e.g., education or employment). This mismatch can cause individuals to use illegal means to achieve success. Therefore, variables such as employment status and achieved education can be of importance in the address fraud model.

Considering the motivation factors, it is practical to consult possible predictors with industry experts, who have long-term experience and understanding within the area and know what variables are effective in practice. Within the address fraud context, several experts (Tamara Klamer, Roxane Bansema) within Gemeente Groningen have been interviewed to gather input and insights into what variables can be used to detect flawed data within the address database. There were several things that were mentioned continuously. Those included: (1) whether the person is a student or not – students try to get extra resources for their time at the university by forging their address records, or they simply forget to register or deregister: (2) people with kids – financially challenging to secure all that is needed, therefore forging data in order to gain access to additional financial resources from the government: (3) people who already receive benefits – it is potentially the case that people who receive benefits are the ones who knowingly forged address data in order to gain those benefits, therefore it proves to be a strong predictor: (4) life events such as birth of a child, divorce, marriage, death in a family, loss of job, unemployment, retirement – these variables prove to be a good predictors as well. With a birth of a child, new expenses occur, leading young families to require additional resources. When divorced, family members may end up on the lose side of the income, when their former spouse was the one financing the family. Not having enough money to support the former lifestyle, divorced people turn to obtaining resources in other, potentially also illegal ways. The same applies for death in the family, but also adds the anger and/or breaking of ties as stated in the Social Control theory by Clifford and McKay (1942), which leads to increased chance of committing fraud. Hence, we hypothesize:

H1: Life events contribute to detection of the address fraud.

Based on the Subcultural theory of Cohen (1955), social class can be a strong predictor of criminal activities. He argues that the lower social class cannot aspire to middle class cultural goals, therefore people in this class reject them and create their own system where meeting the peer’s expectations are more important. These include criminal activities. Therefore, class can be considered predictive in fraud prediction research. Furthermore, it is argued that lower class is also not as active in administrative and therefore might not truthfully – whether intentionally or unintentionally – provide facts within the system.

(15)

14

2.2.4.2. Opportunity

The checks of official documents required for change of the address were shifted from pre- to post- control. They are only required if suspicion is raised as to whether the address was correctly provided. Therefore, a larger opportunity is created to input an incorrect address since no documents are required for the change in advance and may never be required. Since the data is collected throughout the period of years when the change from pre- to post- control occurred, a dummy variable can account for the change.

Link to Social Control theory of Hirschi (1969) – states that people conform to social norms due to social bonds. If these are weak or broken, then they are more likely to engage in criminal activities. The components of the theory include: (1) attachment - meaning whether others expect someone to behave in some way: (2) commitment – meaning that if somebody commits to a lifestyle (e.g., marriage, parenting, children etc.), they will be less likely to break the norms, (3) involvement, and (4) belief. Based on this theory, social environment has an impact. Therefore, in terms of address fraud, the following variables could have an impact: family size, marital status, number of children, number of people living in the house, number of people who committed fraud within family members in the past and so on.

H3: Social surroundings factors contribute to detection of the address fraud.

2.2.4.3. Rationalization

In the Social Disorganization theory by Clifford and McKay (1942), the focus is on neighborhood dynamics, rather than on individual characteristics. They found that the delinquency levels were higher in poorer neighborhoods, areas with poor health, or areas with socio-economic disadvantage. Therefore, focus on the geographical areas and spatial variables can have impact on the address fraud model.

H4: Geographical and location related factors contribute to detection of the address fraud.

2.2.4.4. Capability

(16)

15

2.2.4.5. Other variables

Within the credit transaction data, variables are often kept confidential, but they should consist of date/time stamps, current transaction (amount, geographical location, merchant industry code and validity code), transactional history, payment history, and other account information (age of account) (Chan et al., 1999; Ghosh & Reilly, 1994). Based on this, we derive that the history in credit use is an important predictor of fraud within the credit card use. If adapted to address fraud, history of past behaviour such as movements can have an impact on current situation and probability of fraud. This can include number of moves in the past, geographical changes (location of movements), length of stays at an address, etc. Also, within the telecommunications, data can include individual call information such as call duration, type of call, geographical origin or geographical destination (Cortes et al., 2003; Fawcett & Provost, 1997). Based on this, in the case of address fraud, the origin and destination of moves can be a useful predictor.

Telecom data also uses past data such as account summary information, containing information such as average monthly bill, average time between calls, or daily and weekly summaries of individual calls (Cahill et al., 2002; Taniguchi et al., 1998). Therefore, we expect that when these past behaviour variables are applied to the context of address fraud detection, the average time between moves, number of moves, or an average time of stay at an address might be important predictors.

H5: Variables focusing on past behavior contribute to detection of the address fraud.

Rosset et al. (1999) includes de-identified demographic data such as age and ethnicity for the telecommunications customer, indicating that demographics such as the ethnicity and age has some predictive power.

In a medical insurance, Williams (1999) mentions that the data used for fraud detection contains – amongst others - records on gender.Also, criminologist Messerschmidt (1993) argues that for some men, in certain groups, men express their masculinity by committing criminal acts. This suggests that being male can be predictive of criminal activity within this context and that differences between genders exist. Therefore, if these theories are applied to address fraud, gender can be expected to have an effect on predicting address fraud.

H6: Variables regarding demographics contribute to detection of the address fraud.

2.3. Machine learning

(17)

16 as well as insurance fraud (Guha et al, 2015). The same principles can be applied in the case of this thesis – context of address fraud. In order to continue with this section, we shortly provide a brief explanation and classification of different machine learning techniques and how they can be applied.

Machine learning is an application of artificial intelligence (AI) that uses statistical techniques provide

computer systems with the ability to progressively improve performance on a specific task with data, without being explicitly programmed. The aim is to allow computers to learn automatically, without any human intervention or assistance. There are two main types of machine learning methods:

Supervised learning. This form of problems is often defined by data scientists as “classification”

and “regression” problems. In order to perform this task, there is a need of labeled training data, meaning that the dependent variable that needs to be predicted is available in the training dataset. This may include cases of churn/no-churn in the telecommunication data, or fraud/no-fraud in insurance data. Often the input consists of a vector of independent variables (predictors) and the output (dependent variable) is binary – 0/1. Based on these information, the model trains itself on the available data using the available DV, hence the name “supervised learning”. The techniques include Logistic Regression, Decision Trees, Support Vector Machines, Neural Networks, Naïve Bayes classifiers etc.

Unsupervised Learning. This form of problems delineates cases when the data does not include

the desired output, hence the name “unsupervised learning”. The purpose here is to create clusters (e.g., customer segmentation, crime type segmentation). With unsupervised learning, we can explore hidden structure from unlabeled data. Since the data is unlabeled, we cannot infer the accuracy of the output. Popular techniques in this area include clustering or dimensionality reduction (factor analysis).

(18)

17 Risselada et al. (2010) apply machine learning to an insurance and telecom datasets in order to investigate which models perform better in terms of churn prediction and also investigate how long is the model valid and at what rate it needs to be re-estimated. They apply 4 types of models: classification tree, logistic regression, and also bagged versions of these models. They found out that overall classification trees combined with a bagging procedure provide the best predictive performance. Furthermore, bagging procedure improved performance of trees but not logistic regression.

In Guha et al. (2015), machine learning is applied to four different datasets within the vehicle insurance context. They apply logistic regression, Modified Multivariate Gaussian (MVG), Boosting (Modified randomized under-sampling (MRU) and Adjusted minority oversampling (AMO) and Bagging using Adjusted Random Forest (ARF). The findings indicate, that Adjusted Random Forest and MRU perform the best, while the logistic regression underperforms. This is in line with findings of Risselada et al. (2010). However, there are also studies that contradicting findings. Perols (2011) compares in his study the performance of six statistical and machine learning models for financial statement fraud detection under different assumptions of misclassification costs and ratios of fraud firms to nonfraud firms. The results indicate, contrary to previously mentioned articles, that logistic regression and support vector machines perform relatively better compared to an artificial neural network, C4.5, bagging and stacking. The differences in outcomes can occur due to multiple reasons, such as differences in data, context, specifications etc.

H7: Machine Learning techniques contribute to data quality improvements within the database.

2.4. Conceptual model

Figure 2 summarizes the hypotheses based on the literature review in the conceptual model.

Figure 2.

(19)

18

3. METHODOLOGY

3.1. Research Design

The thesis aims to improve the database quality. More specifically, it aims to discover the fraudulent cases in the BRP database by applying machine learning techniques. In order for these techniques to be tested, a list of variables which can be used to predict fraudulent behaviour need to be found. Since most fraud related studies do not reveal the variables and how they have been used for predictive purposes, this research will be exploratory in nature, where variables will be used based on literature but also based on availability.

3.2. Data

The original data consists of 532.811 mutations (data points) from the database of Municipality of Groningen from the period of Jan 2010 – May 2018. Mutation is defined as a change of the address of a person, registered in the system. The database includes details on persons and their moving history as well as some personal and address related variables which will serve as input for the machine learning algorithm. In total, the initial dataset includes 41 variables. Since this thesis focuses on supervised learning, the data used for the analysis need to also include variable that states whether the mutation is fraudulent or not. Since not all mutations within the database are checked for fraud, only a subset of the overall data is used. In total, 20.003 mutations include the fraud check and hence the dependent variable, this constitutes the dataset that was used. Additional 10 variables have been engineered from the existing ones. The final dataset contains 19 usable variables used as an input for the machine learning algorithms. Important fact is that since the provider of data wishes the information to remain confidential, the names of variables and their levels will not be named.

3.3. Validation

In order to validate the models, we have split the final dataset into training set and evaluation set. One issue of the data, relating to machine learning application, is sample balance. Often the case in fraud datasets is that only a small portion of data is fraud positive, meaning that the ratio of fraudulent to non-fraudulent cases is rather low. However, in this case, after having the fraud check, majority of the cases – 13.397 (66.9%) – is fraud positive. This is important since certain machine learning techniques perform well with imbalanced samples, but others do not. As the dataset has enough cases of both levels of the DV, a balanced dataset does not need to be introduced. For the purposes of validation, the data has been split into training set (80%) and test set (20%). The data are summarized in the Table 1 below.

Table 1.

Structure of the datasets

(20)

19

Fraudulent Cases 13.397 13.397 10.710 2.687

Non-Fraudulent cases 6.606 6.606 5.292 1.314

Labeled cases (DV) 20.003 20.003 16.002 4.001

3.4. Statistical tools

Based on the outcomes of the previous research, this thesis considers the following machine learning techniques: Logistic regression, Classification trees, Bagging, Boosting, Random Forest, Support Vector Machine, Artificial Neural Network and KNN. Each of the techniques will be applied to the dataset and the performance will be compared and evaluated based on pre-selected criteria. These include GINI, Hit Rate, TDL, computational time due to its relevance in the business and organizational environment and also rates of false positives and false negatives.

3.5. Outliers

Usually, in order to proceed with analysis of the data, outliers need to be detected. These are the values that stand out from the rest of the data. As an example, Figure 3 shows two continuous variables and their outliers. However, since the outliers are often likely to signal data inconsistency and therefore fraud, it is important to include these in the dataset. For this reason, the outliers are not excluded from the analysis. Furthermore, more flexible machine learning techniques such as Decision Trees handle outliers rather well.

Figure 3.

Outliers within the data

3.6. Missing values

(21)

20 observations where this occurs relative to the size of the dataset, the method is sufficient and less time consuming than other imputation methods.

3.7. Variable description

Based on the literature and previous research, variables from the available database have been extracted. To define a dependent variable, a binomial variable “Fraud” is selected, 1 representing a fraudulent case and 0 representing non-fraudulent case. This is necessary in order to perform the supervised learning techniques, as mentioned in section 2.3. Furthermore, several independent variables have been selected to explore which of them are potential predictor for fraud occurrence and hence data inconsistency. Table 2 shows the nature of the variables. Most of the variables have been derived from the literature review, however, since the exploratory nature of this study, several additional variables were included due to availability. In total the study uses 19variables.

Table 2.

List of selected Independent Variables used for the analysis

Variable Variable Name Description Variable Type Social Factors - -

Economic Factors - -

Demographic Factors Var_A - Var_01 Integer

Var_B - Var_02 Integer

Var_C - Var_03 Binary

Geographic Factors Var_D - Var_04 Binary

Var_E - Var_05 Binary

Life Events - -

Past Behavior Factors Var_F - Var_06 Date Format

Var_G - Var_07 Factor

Var_H - Var_08 Integer

Var_I Var_09 Integer

Var_J - Var_10 Integer

Var_K - Var_11 Factor

Var_L Var_12 Factor

Var_M Var_13 Factor

Var_N - Var_14 Factor

Var_O Var_15 Factor

Var_P - Var_16 Factor

Other factors Var_Q - Var_17 Factor

Var_R - Var_18 Binary

(22)

21

3.8. Models

In the following sections, a selection of models will be introduced and shortly explained.

3.8.1. Decision Tree

Decision Trees are popular in marketing for multiple reasons. First of all, the outcome is very intuitive to understand and thanks to various software packages it creates a nice visualization. The flow-chart like graph classifies every observation into a bucket with certain probabilities of classifying as 0 or 1 based on unique characteristics specified by the observation’s IVs. Since the issue at hand requires classification, CART method (Classification and Regression Trees) is the most suitable.

The tree model is based on dividing the observations into groups that are as similar as possible within and as different as possible between. This term is called purity as described in the book Database Marketing (Blattberg et al., 2008).

The process with a tree development is at follows. A root node splits into child nodes in order to create as pure sub-groups as possible, based on a splitting method called Gini Index. These child nodes further split into further nodes, creating further sub-group, and so on. This process goes on until a specified criterion is met (usually a size of a sub-group or a certain probability). The performance of the split is measured by improvement in purity of the groups (Blattberg et al., 2008).

The trees are also popular due to the robustness of the method and the ability to work well with large data sets. However, as the “no free lunch” theorem specifies, no single model performs the best in every situation. Therefore, other models are required in order to test the performance. Another issue is that if the trees are too large, they will be overfitted. This means that they will perform perfectly on the training dataset but once applied to a new data, the performance will drop rapidly (Kubler et al., 2016). Therefore, it is important to find balanced size of the tree in order to be able to generalize it.

The problem of overfitting can be solved in several ways. Brieman et al. (1984) proposes so called pruning, where the tree removes the branches that add the least to the model and makes it simpler. Another method includes so called ensemble methods, which produce multiple models and make conclusion based on aggregate result. An ensemble of models often delivers better results than just individual models (Song et al., 2014; Risselada et al., 2010).

3.8.2. Bagged Decision Trees

(23)

22 an average or majority vote (Brieman, 1996). While the individual trees are highly sensitive to noises in data such as outliers which can cause overfitting the model, the aggregated of the multiple trees is not, as long as the trees are not correlated. This process increases performance and generalizability as the model is trained using multiple datasets that differ between each other. In case, however, that the correlation occurs, Random Forest can be used.

3.8.3. Boosted Decision Trees

Boosting differs from bagging in the following ways: While in bagging the trees are built individually and averaged, in boosting the trees are built sequentially and are not evaluated as average (Kubler et al., 2016). Instead, the tree assigns different importance to errors made in prior tree and uses weights to assign importance to the IV’s in the further estimation. The dataset used is always the same, compared to bagging, only the weights are updated every time, based on the misclassification of the previous tree. The misclassified observations get higher weights to increase accuracy by classifying them correctly in the following step. The process repeats until a final estimation is done. Benefits of this model involve ability to work with noisy data, no required transformation of variables.

3.8.4. Logistic Regression

Many research problems can be defined as dichotomous questions. Whether a customer buys or not, whether a customer churns or not, whether a person commits a fraud or not, whether the data contains inconsistency or not etc. These issues can be solved using binary logistic regression. Binary logistic regression explores relation and influence between the dependent binary data and continuous or categorical independent variables. It looks very similar to linear regression models, but in this case the regression is made against the logit of dependent variable, not the dependent variable itself. Another condition in order to use binary regression is that the dependent variable needs to be a dummy variable.

Logistic Regression also allows to interpret odds of an event happening. This provides probability that an event will occur divided by the probability that the event will not happen. This provides the researcher with an importance of each of the IV’s in predicting the event of interest.

3.8.5. Random Forest

(24)

23

3.8.6. KNN

KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point -  the outcome is a class membership, where class is a discrete value. An object is classified by a majority vote of its neighbors and classes to which they belong, with the object being assigned to the class most common among its k nearest neighbors.

3.8.7. Artificial Neural Network

Neural network is a computational approximation of the human brain (Ngai et al., 2011), consisting of a number of interconnected elements that transform inputs into outputs. Neural Network consists of 3 layers: (1) Input layers are the front-line neurons to process input information. These are often specified by the researcher in form of the independent variables. (2) Hidden layer consists of neurons which process the inputs from the previous layer and pass their output to the following layer. Lastly, the (3) Output layer that uses the incoming output of previous layer as input to determine the final outcome. The difference is also in the observability. While input and output layers are directly observable, the hidden layers are not. The more hidden layers a network has, the deeper it is. Usually, the deeper the network, the better the outcome. However, more hidden layers result in longer training times, which might not always be a feasible solution. The network performs by modelling the input variables and assigning weights to each connection onto a new layer in the neural network. The weighted sum is processed by an activation function, which results in a binary outcome.

To train the network, the algorithm specifies values for the weights and the threshold b, in order to classify the outcome variable correctly. There are different threshold functions which can be applied to the model - refer to Leeflang et al. (2015).

3.8.8. Support Vector Machine

(25)

24

FINAL MODEL SELECTION

In order to arrive at the most optimal model, firstly, all variables will be run through an algorithm, that can account for non-linearities. For this purpose, a decision tree will be selected to visualize and understand the data better. Following the tree, a logistic regression will be run to look at the significances of individual variables and to reduce the model to a reasonable size. AIC (Akaike Information Criterion) will be used. AIC compares the quality of a set of statistical models to each other. Initially the whole set of variables will be applied to the logistic regression. Based on the significance, the least significant variable will be removed – following stepwise deletion process, and the model fit will be evaluated. This process will be repeated until the AIC does not reduce any more. The remaining variables will be used as input for the remaining algorithms. For more information on AIC criteria, refer to Leeflang et al. (2015).

3.9. Performance measures

In order to evaluate which of the proposed models delivers the best results, several criteria will be looked at. These will include GINI coefficient, Hit Rate, Top Decile Lift (TDL), computational time due to its relevance in the business and organizational environment and rates of false positives and false negatives.

3.9.1. Gini Coefficient

This measure focuses on the model performance across all the sample. Gini coefficient compares the cumulative lift curve which shows the cumulative percentage of customers based on the probability of an event – in our case the fraud - to occur. The curve is then compared to a base model, where each of the data points has the same probability of an event happening. The more the lift curve deviates from the base model, the better the model can predict the fraudsters and non-fraudsters. The coefficient values range from 0 to 1, with one representing the best model, where all the fraudsters are identified in the beginning. For more information on Gini coefficient, refer to Leeflang et al. (2015).

3.9.2. Hit Rate

(26)

25

3.9.3. Top Decile Lift

This measure divides the whole set in deciles (10% blocks) and then looks at the fraction of 1’s in the top decile, divided by the fraction of 1’s in the whole set. This method allows for evaluating the models’ ability to detect customers with high probability of an event occurrence – in our example it is fraud. For calculation and further information, refer to Leeflang et al. (2015).

3.9.4. Computational time

This measure simply measures how long it takes for the model to train. This is also relevant since the model needs to be retrained continuously. The longer it takes, the less feasible it becomes for the company.

3.9.5. False Positives/False Negatives

No model has a 100% correct performance. It happens that in some cases, the model predicts that an event occurs, but in reality, it does not. In this case, we are talking about False positive. On the other hand, if the model predicts that the event does not occur but in reality, it does. This is called False negative. Depending on the situation, minimization of one or the other can have higher priorities. For example, if a model is to predict an earthquake, false negative can have devastating results while false positive not so much. Within fraud detection, Perols (2011) argues that false negative is costlier as fraudsters get away with the fraud.

3.10. Software used

As a tool for processing the data and carrying out the analysis, the R software (R-3.5.0 for Windows 32/64 bit) with the R-Studio environment has been used. Table 3 summarizes the packages used for different machine learning techniques.

Table 3.

Packages and functions used in this paper

Technique R Package Functions used

Decision Tree rpart rpart()

Bagged Decision Trees adabag bagging()

Boosted Decision Trees adabag boosting()

Logistic Regression - glm()

Random Forest randomForest randomForest()

KNN caret knn3()

Artificial Neural Network nnet, caret caret::train()

(27)

26

4. RESULTS

In this section, results of individual machine learning methods will be presented, followed by an overview of a comparison of the models. The results will be based on the measurement criteria specified earlier. The initial step was to include all available variables in the decision tree algorithm for descriptive purposes. The aim was to observe at which variables splits occur and how large can the tree grow. The tree method was selected base on its flexibility, where the algorithm is able to take into account interactions and non-linearities in the data, compared to for example Logistic regression, where the output is a linear solution. To apply the decision tree algorithm, we have used the rpart() package in R. Figure 4 below shows the outcome of a decision tree including all 18 variables.

Figure 4.

Lift curve of the complete Decision Tree and its visualization

(28)

27 therefore achieving seemingly high hit rate, it is so by predicting every case as fraudulent, which is not useful.

From this observation, we aimed to gain an overview of important variables, that could be further used as an input for the remaining models. Since the outcome does not provide us with any splits, the whole set of variables will be used as input for the logistic regression to see the detailed output and reduce the model to a most optimal one, based on the evaluation criteria (AIC). The initial Logistic regression again included all the variables. See the output in Appendix A. After the estimation, we have checked for multicollinearity between the variables. Appendix B shows the VIF scores. Based on these VIS scores, collinear variables have been excluded until the VIF scores were below four and the remaining variables are interpreted as the joint effect. Following this, the stepwise deletion of least significant variables led to the final Logistic regression model based on AIC – see appendix C. This step decreased the AIC from 20396 to 20333 and left only six variables included as predictors. VIF score check does not show any multicollinearity in this model (see Appendix D).

However, there are still no significant variables in the model. The GINI coefficient of -0.01 and TDL of 0.96 tell the same story as the Decision Tree algorithm but reduce the model to more manageable size by removing the least useful variables.

The variables remaining in the final model of Logistic Regression was used as inputs for the remaining algorithms. The results of each of the models are summarized in the Table 4 below.

Table 4.

Performance measure of all models

Measure GINI TDL Hit Rate Time Lapsed False Positive False Negative True Positive True Negative Decision tree -0.01 0.96 67% 29.5 33% 0% 67% 0%

Bagged Decision Trees - - - - - - - -

Boosted Decision Trees - - - - - - - -

Logistic Regression -0.01 0.96 67% 14.8 33% 0% 67% 0%

Random Forest 0.00 0.99 66% 36.6 32% 2% 65% 1%

KNN 0.01 1.03 62% 6.9 29% 8% 59% 4%

Artificial Neural Network -0.01 0.00 67% 156.6 33% 0% 67% 0%

Support Vector Machine 0.00 0.98 67% 224.8 33% 0% 67% 0%

(29)

28 where the hit-rate goes even below the null model. Concerning the GINI coefficients and TDL’s, the models all perform on the same level, scoring from 0.96 to 1.03 and -0.01 to 0.01 respectively. For the curves of the models, please see Appendix E. Considering the time required for each of the models, the KNN needs the least (6.9), followed by Logistic regression (14.8), Decision Tree (9.5), Random Forest (36.6), ANN (156.6), and the longest to rain was SVM (224.8). The Bagging and Boosting algorithm required so much time that it was impossible to execute in our setting.

Variables and their impact

Based on the outcome of the models, none of the included variables holds predictive power. While the literature review argues that the included variables should be able to predict fraud, it is not the case in our research. In the final models, the 6 included variables include geographic factors, demographic and also past behaviour factors, none of them improve the model. However, in logistic regression, Variable 1 is marginally significant (p = .055), indicating that there might be some differences. However, this effect is not large enough and significant enough for the algorithm to make predictions, as we can see in the decision tree, where no split occurred and also in the other models, where majority of the cases is predicted for the whole test set. The possible reasons for the variables not to perform as expected will be discussed in the discussion section.

5. DISCUSSION

In this chapter, the outcome of the analysis is presented and discussed. The variables used and their effects – if any - will be explained. Lastly, the hypotheses will be discussed. For a short overview, refer to the summary of the hypotheses in the Table 5 below.

Table 5.

Hypothesis overview

Hypothesis Subject Outcome

H1 Life events Not-tested

H2 Economic factors Not-tested

H3 Social factors Not tested

H4 Geographical factors Not confirmed

H5 Past behaviour Not confirmed

H6 Demographics Not confirmed

H7 Address fraud detection Not confirmed

(30)

29 The previous research shows that machine learning techniques perform rather well in prediction of outcomes in various contexts, such as in telecom industries and churn prediction (Risselada et al., 2010; Vafeiadis, 2015) or financial statement fraud (Perols, 2011). Within the fraud detection, multiple research (Guha et al., 2015; Song et al., 2014) indicates the success of these methods when applied to the data. This study shows that, contrary to previous findings, machine learning methods do not improve the fraud detection/prediction in the context of municipal data and therefore do not have a positive impact on the database quality. Whether Life events and Economic or Social factors have an effect on fraud detection has not been tested within this paper. What has however been tested are the effects of Past behaviour, Demographics and Geographical factors. The results indicate that there is no predictive power within these factors when predicting fraud within municipal data. Most of the models perform almost identically compared to each other, and they do not outperform the random model.

Song et al. (2014) concludes in their research that ensemble methods outperform other algorithms in the context of financial statement fraud detection. Vafeiadis et al. (2015) Neural Network and Decision Trees outperform other methods. Risselada et al. (2010) apply machine learning algorithms to churn prediction and find that Decision Trees and Bagged Decision Trees procedure work compared to Logistic regression and Bagged version of Logistic Regression. These finding indicate that more flexible techniques, namely Decision Trees and Bagging should be superior in our research as well. However, this is not confirmed since all the methods deliver almost identical results. This, again, is caused by not having good predictors in the data, resulting in all models predicting everything as a majority case, resulting in similar outcomes. The insignificant outcomes of multiple analyses indicate that the variables available for our research have proven to have no effect on the prediction and detection of fraudulent cases within a database. Therefore, the importance of variables was not investigated as none of them serves as an important predictor. Despite not having any significant predictors, the models may seem to perform well on the hit rate measure. While the 67% hit rate is relatively high, it is due to the fact that the model predicts the outcome according to the majority of the cases in the data. Since the proportion of fraud/no-fraud cases within the test set is 67%, this phenomenon is well explained.

(31)

30 There are several reasons why the model performance did not live up to the expectations based on the research. The main issue was, that the number and quality of variables was highly insufficient to predict such a complex issue as fraud. In the research of Guha et al. (2015) on insurance claim fraud, 34 to 62 attributes were available. In the comprehensive survey of Phua et al. (2010), they compare the attributes availability of multiple researches and show that most of the studies that use data-mining approach to fraud detection have access to 10 – 49 attributes.

Despite having access to 19 variables in our research, it was only possible to test three hypotheses of the effects of variables on fraud detection, since all the variables fell into Demographic (n = 3), Geographic (n = 2), and Past behaviour (n = 11) categories. What is more, eight of those variables were in essence capturing the same thing, just at a different level of granularity. The remaining three hypotheses addressing Life events, Social, and Economic factors were not available. This is mainly due to the short time span of the project, where the data contained in different data-sources and in different formats did not make it possible to get access to.

Another aspect is that all the variables specified in the literature review have been taken from different fields and previous research and applied to a topic of address fraud. Since often the performance of machine learning models and also the importance of variables is context specific, it may be the case that the selected variables are not the true predictors of address fraud, despite proving significant in other fields. However, since there is no current research on the topic of address fraud, the process of using “borrowed” variables can be marked as a good starting point. We also assume that more variables mentioned in our conceptual model would allow for better prediction, since it would make the model more complete. This is one of the important criteria of every model based on the theory of Little (1975).

Another potential problem could be the incorrect inputs in the database. The final dataset (n = 20.0003) consists of all the mutations that have been investigated by the employees of Gemeente Groningen. However, it may be the case, that on occasion, the investigators have not arrived at a correct conclusion and mislabeled the mutation as non-fraudulent, when it was in fact fraudulent and vice versa. However, we can assume that if this happens, it is rather a rare event, since the investigation is led by skilled people, trained specifically for this task.

6. CONCLUSION AND RECOMMENDATIONS

(32)

31 The results suggest that machine learning within a given context of municipal data does not perform well in regard to fraud detection and hence does not help improving database quality. While multiple models have been applied to the data in this study, the limited availability of variables did not hold any predictive power and hence the models performed rather poor, where sometimes a null model or a random model performed even better. Our findings are in contrast with existing research, where machine learning proved to be a useful tool in prediction and detection. Figure 5 below shows the summary of a model performance within our research.

Figure 5.

Model performance comparison

As Guha et al. (2015) state, use of machine learning depends more on the quality of data than on model choice. This research confirms, that without proper data, the models do not perform well. Based on our outcomes, we conclude that there is still working to be done in regard to the data and the processes before applying machine learning algorithms to fraud detection within municipal data.

Since multiple research proves that machine learning delivers good results, the differences with our research could be explained by the availability of data in our models was rather low, which could explain this outcome. Companies or organizations need to have access to enough data in order to engage with machine learning application for detective or predictive purposes. In regard to the model selection, this research does not provide answers on which algorithm performs optimally.

(33)

32 of supercomputers and possibly owns only small office desktops. While most techniques do not require large training time, in our study we have not been able to train the algorithms of Bagging and Boosting due to the lack of computing power. Therefore, this needs to be considered when making a decision on which techniques to use. Figure 6 below shows differences in training time in this paper.

Figure 6.

Model Training times

All in all, using machine learning techniques to uncover fraud within municipal data and hence correct the database for inconsistencies is a step that is still not achievable with the data that is currently available. We suggest to first look into what data is available and structure it properly, deriving useful variables, prior to running the machine learning models. By doing this, the outcomes might prove more useful and serve the purpose.

7. LIMITATIONS

In this paper, we have stated several hypotheses, of which only a subset has been addressed due to the availability of resources and data. The literature review specifies large number of variables as potential predictors, however, only a small fraction of them was available for us to include in the analysis. Furthermore, there might be other variables outside the scope of the described literature, which could be predictive in nature as the fraud environment evolves rapidly with new technology. Fraudsters try to adapt constantly to outsmart the ones who try to detect them, which leads to new trends and patterns that emerge, which are not captured by the existing literature.

(34)

33 Furthermore, the analysis is performed on data of a specific time period, more specifically January 2010 – May 2018. According to Risselada et al. (2010), the predictive power of the models is not lasting for too long and the variables included in the models change in significance. Assuming the changing environment within fraud, using the data of such long period may not capture the effects well – meaning that some variable may be effective in the last year of observations but due to ineffectiveness of the same variable in all the previous years, the overall effect is insignificant. Future research could look at only recent cases to capture the relevant effects and provide a model that is able to predict well.

While several machine learning techniques have been included in this report, there are many other options and variations of used techniques that could potentially be more optimal or deliver different results. Therefore, future researchers could investigate and select other techniques that can provide better results. Currently, all the labeled cases we have are subject to previous selection, based on some criteria unknown to us. Therefore, it is reasonable to assume that the sample selection is biased. The reason for a mutation being investigated may be based on a suspicion. The reason for the suspicion may be the underlaying reason why these cases have been selected. However, only a small subset of the fraudulent mutation is selected and if preselected on a specific criterion, they may share common trait, resulting in a bias.

In the future, researchers might consider applying unsupervised learning to the data, in order to determine how the cases would be grouped. Another way of looking at the data can be not to use only the labeled data, but the whole dataset. By having the dependent variable as checked (1) or not checked (0), the algorithms might find significant differences between these cases. Since the bias of selection is present, the outcomes of our research might differ significantly from the proposed method.

The overall informativeness of this paper is rather low due to the unavailability to disclose the nature and details of the used variables. The outcome is rather general and does not provide any details as to which variables and what levels have been used. The confidentiality of these factors applies for the following reason: once a model is used, it starts losing efficiency. Since the fraud is an ongoing battle between fraudsters and those who try to detect them, by disclosing the factors a new way to commit fraud without being detected would be designed much faster. By keeping the variables confidential, the predictive power of models prevails longer.

Referenties

GERELATEERDE DOCUMENTEN

First, an overview of the state of the art research to unsupervised outlier detection algorithms and explanation mechanisms in the financial sector is provided.. A literature review

This thesis expands on current work in anaemia diagnostics by making predictions using various machine learning classifiers, comparing different pre-processing techniques in

Hypothesis 3b: The number of interactions in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn..

Combination of both demographic variables and browsing behaviour variables result in better performance than their separate influence. Influence of

The training data that enters a machine-learning environment might contain personal data (ensuring a role for data protection law); however, once the MLE starts

Muslims are less frequent users of contraception and the report reiterates what researchers and activists have known for a long time: there exists a longstanding suspicion of

Our research has shown that health expenditure data can effectively be used as a basis for T2D screening programmes with state-of-the-art performance, despite the fact it

In this study, we mostly focused on developing several data fusion methods using different machine learning strategies in the context of supervised learning to enhance protein