Eindhoven University of Technology BACHELOR PyWash a Data Cleaning Assistant for Machine Learning Castelijns, Laurens A.

(1)

Eindhoven University of Technology

BACHELOR

PyWash

a Data Cleaning Assistant for Machine Learning

Castelijns, Laurens A.

Award date:

2019

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

(2)

PyWash: a Data Cleaning Assistant for Machine

Learning

Final Bachelor Project

Laurens Castelijns

Department of Mathematics and Computer Science Architecture of Information Systems Research Group

Supervisor:

Dr. ir. Joaquin Vanschoren

Eindhoven, July 2019

(3)

(4)

List of Figures

2.1 The framework bands with weights and deficiencies . . . 11

3.1 The PyWash web interface. . . 15

3.2 The Dash DataTable . . . 15

3.3 Parallel box plots . . . 16

3.4 Automatic rescaling after hiding features. . . 17

3.5 Stacked bar chart . . . 17

3.6 Distribution plot . . . 17

3.7 Parallel coordinates plot . . . 18

3.8 Parallel coordinates plot interactivity. . . 18

4.1 Different best performers after correcting F₁score mistake . . . 22

4.2 Example of stubborn outlier detectors . . . 23

4.3 User interface of outlier detection: input . . . 26

4.4 User interface of outlier detection: output . . . 26

4.5 Scaling UI component . . . 27

4.6 Missing Values UI . . . 28

4.7 Heuristic data type inference . . . 29

4.8 Data types UI. . . 30

A.1 Dash Callback graph . . . 36

(7)

(8)

List of Tables

4.1 F1 scores on the ODDS library before and after correcting F1 score mistake . . . . 23 4.2 Comparison of average F1and ROC-AUC scores on the ODDS library (5 independ-

ent trials) . . . 25

(9)

(10)

Chapter 1

Introduction

The main focus of this report is to guide the reader through the decision-making process and showcase the functionality of PyWash. Even though this is an individual final bachelor project, this project required a lot of collaboration within my circle. Whenever ‘we’ is used in this report I am referring to the other student in my circle and a good friend of mine: Yuri Maas.

1.1 Problem Statement

High-quality data is required for effective training of machine learning models. However, raw data is rarely of sufficient quality. Data cleaning and (pre)processing is therefore needed. Wrangling and cleaning your data until it is sufficiently clean to be trusted and suits your needs involves executing a lot of separate operations. Therefore data cleaning is not our favourite pastime, it can be a time-consuming and frustrating process. The iterative nature of data cleaning paired with the absence of an evaluation methodology is also alarming. Without rigorous validation of the data cleaning performed, machine learning models will be unreliable at best and plain wrong at worst. This process could, therefore, be done in a more structured and automatic manner that streamlines most of this work. There are a few cleaning packages and tools available but they are fragmented, most of them focusing on a single issue. In this final bachelor project, I investigate this problem by designing a data quality framework and (partly) develop a tool that will assist in all of the data cleaning aspects. To summarize, I aim to answer the following questions:

• What are the typical problems encountered when working with raw data?

• How can we assign a data quality score to a dataset?

• How can we solve these problems in a more streamlined and structured manner?

1.2 Scope

This project is divided into five sub-components, a big part of the project is about making data cleaning more streamlined by (semi)automating the tedious work. While the remaining part is more about communicating issues and solutions to humans in an effective way. The initial components as described in the BEP proposal are: measuring data quality, data cleaning, preprocessing, data encoding and a web interface. We still ended up with a division in five components but we deviated a bit from the original 5 components. Measuring data quality and the web interface is still part of the project. The remaining components have been renamed and modified because of the data quality framework we have set up.

(11)

CHAPTER 1. INTRODUCTION

1.3 Method

A data quality detection framework is designed to help with the definition of data cleaning and to get a clear idea of the scope of PyWash. This framework results in a sequence of (data cleaning) operations that solve deficiencies and an evaluation metric that measures how ready your data currently is for machine learning, in other words ”is the data sufficiently clean to trust the analysis?”. Currently, there does not exist such an extensive framework or metric. Therefore, we propose our own data quality detection framework as a part of this final bachelor project. Based on this framework a vertical slice -a bit of every layer- of PyWash [6] is developed. PyWash offers a solution to the deficiencies of the framework and communicates the data quality score in an effective manner.

1.4 Outline

In chapter 2, the data quality framework is described. Chapter 3 describes the web interface. In chapter 4, the decisions and design process concerning the features of PyWash are provided. In chapter 5 I summarize the project and present some of the possible future work.

(12)

Chapter 2

The ABC of Data

In this chapter, I aim to answer the first two research questions. The data quality framework is described in the following sections. I refer to the long paper we wrote for a more comprehensive description of the framework and its deficiencies [5].

2.1 The Framework

In order to (semi)automate data cleaning and preprocessing, we need a clear and measurable definition of data quality. Data readiness levels have been proposed to fit this need, but they require a more detailed and measurable definition than is given in prior works. In our framework, datasets are classified within bands, and each band introduces more fine-grained terminology and processing steps. Scores are assigned to each step, resulting in a data quality score. This allows teams of people, as well as automated processes, to track and reason about the cleaning process, and communicate the current status and deficiencies in a more structured, well-documented manner.

Figure 2.1: The framework bands with weights and deficiencies

(13)

CHAPTER 2. THE ABC OF DATA

A dataset will be ready for certain operations to be performed on it depending on its band.

The bands consist of several weighted dataset deficiencies which reflect what are currently the most important deficiencies that need to be addressed in the current band. An overview of the bands and their deficiencies can be found in table 2.1. The weights that we have given in table 2.1are advisory weights for a generic dataset without a specific target analysis. A description of the bands and the functionality they unlock are given below. Band C (Conceive) refers to the stage that the data is still being ingested. If there is information about the dataset, it comes from the data collection phase and how the data was collected. The data has not yet been introduced to a programming environment or tool in a way that allows operations to be performed on the dataset. The possible analyses to be performed on the dataset in order to gain value from the data possibly haven’t been conceived yet, as this can often only be determined after inspecting the data itself.

Band B (Believe) refers to the stage in which the data is loaded into an environment that allows cleaning operations. However, the correctness of the data is not fully assessed yet, and there may be errors or deficiencies that would invalidate further analysis. Therefore, analyses performed in this stage are often more cursory and exploratory, such as an exploratory data analysis with visualization methods to ascertain the correctness of the data. Skipping these checks might lead to errors or ‘wrong’ results and conclusions.

In band A (Analyze), the data is ready for deeper analysis. However, even if there are no more factual errors in the data, the quality of an analysis or machine learning model is greatly influenced by how the data is represented. For instance, operations such as feature selection and normalization can greatly increase the accuracy of machine learning models. Hence, these operations need to be performed before arriving at accurate and adequate machine learning models or analyses. In many cases, these operations can already be automated to a significant degree.

In band AA (Allow Analysis), we consider the context in which the dataset is allowed to be used. Operations in this band detect, quantify, and potentially address any legal, moral or social issues with the dataset, since the consequences of using illegal, immoral or biased datasets can be enormous. Hence, this band is about verifying whether analysis can be applied without (legal) penalties or negative social impact. One may argue that legal and moral implications are not part of data cleaning, but rather distinct parts of the data process. However, we argue that readiness is about learning the ins and outs of your dataset and detecting and solving any potential problems that may occur when analyzing and using a dataset.

Band AAA is the terminus of our framework. Getting into AAA would mean that the dataset is clean. The data is self-contained and no further input is needed from the people that collected or created the data.

2.2 Data Quality Scores

A dataset has a score between 0 and 1 for each band of our framework, so a dataset can have score 0.9 for band C, 0.8 for band B, 0.10 for A, 0.20 for AA and possibly 0 for band AAA. Datasets start with initial band score values of 0 for every band, as we generally do not know (for certain) which issues the dataset is suffering from that could potentially jeopardize machine learning methods.

The dataset is classified in band C at this stage. We then proceed to check and solve all issues from band C. Each band deficiency that is solved or non-existent contributes to the band score of the dataset. Partially checked or solved deficiencies can grant partial weight scores. A dataset will move to the next band only when it has surpassed a certain threshold score. This also means that a dataset cannot get a band label of A or above when it has a B.60 score, even if the dataset fulfills all band A requirements.

The threshold scores can be determined by the framework users to determine how thoroughly the dataset has to be cleaned before it is able to proceed to further bands. We have set the default threshold for all bands on 0.85 to allow a dataset to advance while it’s not totally perfect, since striving for a perfect dataset may not be achievable or cost-effective in general. The dataset might not be entirely clean when the thresholds are less than 1, as a dataset could advance to the next

(14)

CHAPTER 2. THE ABC OF DATA

band (including band AAA) while not every issue has been checked or fixed yet. That said, the thresholds cannot be set too low (e.g., < 0.65) as datasets wouldn’t be checked properly, which could seriously impact machine learning methods and dataset usability, causing errors and false predictions or estimates.

(15)

Chapter 3

PyWash: Web Interface

A web app based on Flask and React was advised in the BEP proposal. Ultimately, building one from scratch would give us the most flexibility. But time and manpower were too scarce, especially because we had very little experience with building such a web interface using react. Therefore, we decided to leverage Dash, a python framework for web-based analytics applications. Dash is built on top of Plotly, React and Flask. This means that advanced visualizations from Plotly can be used in combination with the baked-in react components from Dash. Dash callbacks make the web interface interactive. Appendix A contains the callback graph of PyWash. The interface, shown in Figure3.1, groups the operations for each band into separate color-coded tabs. Initially, the only band that is accessible is band C. Band C includes interactions to import datasets and merged them together if applicable. Once a dataset is successfully parsed, PyWash will assign it a tab at the top of the screen with its name, subsequently, the other bands will be unlocked and the data will be shown in an interactive data table. The interface supports multiple datasets being loaded at the same time. When a dataset is selected, a description is shown which includes the data quality score and dimensions of the dataset. The data quality score is a placeholder for now and is therefore not operational. Below this description, there is a row of color-coded buttons which are used to switch between the bands from the framework. The colors correspond with figure2.1and are chosen to be reminiscent of a traffic light.

(16)

CHAPTER 3. PYWASH: WEB INTERFACE

Figure 3.1: The PyWash web interface

3.1 The Data Table

We wanted a data table to be the key piece in the web interface. Dash provides a fully interactive DataTable which offers live editing, sorting and filtering. Figure 3.2shows some of these options in action. In this particular instance, I filtered both strings (Sex and Smokstat) and numeric (Age and Alcohol) features. I also sorted cholesterol in descending order. You can highlight a cell by clicking on it. Double-clicking on a cell allows you to edit the value of the cell. Every row and column also has a small cross. Clicking on it will remove that row or column from the data.

Saving the current dataset on disk is possible at any time using export options available below the DataTable. Exporting the data will return the data ‘as is’, thus including any modifications made in the table (filtering, sorting, etc.). Currently, data can be exported as either a csv or arrf file.

Figure 3.2: The Dash DataTable

(17)

3.1.1 Difficulties encountered using Dash

The Dash DataTable is still in alpha and I noticed that during development. One of the most important features to us; virtualization, made the web interface very unstable. This is unfortunate because it would have added a lot to the scalability of the interface. Virtualization saves browser resources by only a rendering a subset of the data at any given instant. However, if in the future virtualization would become more stable, PyWash would only need a few lines changed in order to use it. Another annoyance was that the selecting feature is not that rich yet. In figure3.2there is a checkbox in front of every row. However, upon selecting a row nothing can be done out of the box. I tried to implement a function that if activated removes every selected row from the dataset.

This worked well at first use, but after that the indexes of the DataTable and the data frame in the background became misaligned, making the function unstable after the initial use. Another issue is the method Dash uses to represent empty cells in the data table. I wanted to highlight the missing values by changing the background color of the cells. However, Dash does not provide the syntax needed to conditionally highlight these empty values. I also encountered countless of small to large bugs. In example, boolean values are hidden in the data table, however, when a cell is selected it does show the boolean value.

Dash also has some limitations in that every callback can only have a single output, this resulted in seemingly redundant code. However, during the project Dash received an update which allows multiple outputs to be used with one callback. Because this update was introduced late into the project, it is not used to the full potential. Another feature that was added to Dash late into the project was Dash Testing. Dash Testing is a set of testing APIs which can be used in combination with a Selenium WebDriver to automatically test the interaction inside a web browser. This would have been very useful if it was available at the start of the project.

3.2 Visualizations

The purpose of visualizations is to efficiently interpret the data. Visualizations are a powerful tool that can be used during different phases of analysis. Having an interactive element in visualizations will only improve the ability to explore, interpret and understand the data [13]. One of the advantages of a web interface is that visualizations can be made more interactive than in a Jupyter Notebook. Some of the plots also have interactivity with the data table. Selecting points in the plot will also select this data in the data table. The plots are generated using Plotly. I have created multiple (interactive) visualizations for PyWash that are easily generated. The visualizations can be generated in the ‘Plots’ tab in figure3.1. After a plot is selected, a Dash loading component is used while the plot is being generated in the background.

3.2.1 Box Plot

A box plot is a visualization of numeric data. It shows the levels of the quartiles, a horizontal line depicts the level of the median, and it has ‘whiskers’ at the top and bottom that indicate possible outliers. It is commonly used to detect outliers.

Figure 3.3: Parallel box plots

(18)

Figure 3.3shows how this visualization looks like in PyWash. Hovering across a feature with your cursor will display the quartile levels, minimum and maximum value such as shown in the figure. It has to be noted that because of the distinct ranges of the features, it is hard to see any outliers in the features with a smaller range in figure3.3. There is an interactive solution for this problem, in the legend to the right of the visualizations, users can hide features by clicking on the name of the feature. The visualization will automatically re-scale its axes to the new subset of the data (Figure3.4).

Figure 3.4: Automatic rescaling after hiding features

3.2.2 Stacked Bar Chart

The stacked bar chart is primarily used to visualize the distribution of values in categorical features.

Figure3.5below shows the stacked bar chart generated on a subset of the Titanic dataset. As the

‘Cabin’ feature demonstrates, the stacked bar chart can also be used to highlight missing values.

Hovering across a feature with your cursor will show the labels and frequency such as seen in the

‘Sex’ feature in figure3.5.

Figure 3.5: Stacked bar chart

3.2.3 Distribution Plot

In this visualization, shown in figure 3.6, a single numeric feature can be explored. The plot combines a histogram, kernel density estimation and a rug plot (one-dimensional scatter plot) into a single interactive plot. This plot can be helpful in determining the distribution of a feature.

Sturges’ formula [24] is used to determine the bin size of the histogram.

Figure 3.6: Distribution plot

(19)

3.2.4 Parallel Coordinates

The parallel coordinates plot (or PCP) is a popular technique to visualize multivariate data [11].

In a parallel coordinates plot, each feature has its own axis and these axes are placed parallel to each other. Records are then plotted on these axes, creating a polygonal chain across the features.

This way relationships can be explored. When most of the lines between two axes are somewhat parallel to each other, it might indicate a positive relationship between the two examined features.

However, when most lines cross (creating X-shapes), it indicates a negative relationship [12].

Figure 3.7: Parallel coordinates plot

Figure3.7is an example of a parallel coordinates plot in PyWash. The plot is richly interactive in order to explore relationships and outliers. Users can select a feature to be color-coded, reorder the axes, and drag the lines along the axes to filter regions. Figure3.8 has reordered axes and is filtered (pink lines along the axes), showing a clear relationship between age and annual income.

Figure 3.8: Parallel coordinates plot interactivity

(20)

Chapter 4

PyWash: design process and decisions

The framework described in chapter 2 is the theoretic foundation PyWash is built on. Implement- ing a solution for every deficiency in figure 2.1 in combination with presenting the data quality score in an effective manner is the end goal of PyWash. However, reaching this goal within the limited time-frame would be ambitious. Therefore, choices had to be made. PyWash currently handles column types and missing values from band B and both outlier detection and feature scaling from band A. A case can be made that some of the other deficiencies can also be (partly) solved by utilzing the visualizations and the interactive DataTable of PyWash as seen in the previous chapter. In order to offer solutions for the described deficiencies, I was asked to leverage existing packages and improve them when needed. Some of the packages I used are staples of a data scientists toolkit such as scikit-learn [4] and pandas [19]. Others are lesser-known packages such as PyOD [28] and datacleanbot, a package that is part of a recent MSc Thesis by Ji Zhang [25]. This chapter will explain the design process and justifies the decisions that have been made.

4.1 The SharedDataFrame

As mentioned in chapter 2, before deficiencies can be solved the data has to be ingested. In other words, introduce the data to PyWash in a way that allows operations to be performed on the dataset. To facilitate this we created a ‘SharedDataFrame’. The SharedDataFrame is an abstract data type that wraps around a pandas DataFrame. This enables PyWash to maintain a data quality score and to call functions onto the SharedDataFrame. A dataset has passed band C when a dataset is loaded successfully into a SharedDataFrame. The SharedDataFrame is designed as an API. This means that the SharedDataFrame together with its properties and functions can be used independently from the web interface.

4.2 Band A: Outlier Detection

An outlier is a data point that differs significantly from other observations. Outliers can have a significant influence on the reliability of models because not every machine learning model is robust to outliers. In this section, I focus on unsupervised outlier detection in multivariate data. I took the implementation of automatic outlier detection from datacleanbot [26] as a starting point.

In this implementation, landmarking meta-features are used to recommend an approach. Pairs of F1 scores from outlier detection methods and meta-features are used to set up a regression learner. The regression learner then predicts the accuracy of each algorithm based on the meta- features of a new dataset. After using the package for the first time, I mostly encountered user experience problems such as that there was no clear visual feedback of which records are marked

(21)

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

as an outlier in the table that is displayed. I also did not get the option to keep a subset of the detected outliers, it was either drop them all or keep them. The visualizations that are provided are also hard to work with - especially with high-dimensional data- since there is no filter or zoom functionality available. Furthermore, there are only three outlier detection algorithms considered:

iForest, LOF, and OCSVM. This is acknowledged by Zhang in the future work section where it is also stated that a further project should search for more meaningful meta-features for outlier detection. The training set on which the recommender is trained is also too small. Zhang also mentioned how you can let users choose to run multiple algorithms and then present the results in an appropriate matter [25]. I started searching for more outlier detectors. I came across the relatively new package PyOD that offers a wide variety of models to detect outliers [28]. After trying all of them out I selected ten of the algorithms that were the least prone to spewing out errors. The algorithms can be separated into 4 groups, the following sections will shortly describe the groups and the algorithms that belong to them.

4.2.1 Linear Models

Features in data are generally correlated. This correlation provides the capacity to make predictions from one another. The concepts of prediction and detection of outliers are closely linked.

This is because outliers are values that, based on a specific model, deviate from anticipated (or predicted) values. Linear models concentrate on using correlations (or lack of correlation) to spot possible outliers [1].

One-Class Support Vector Machines (OCSVM)

OCSVM learns the decision boundary (hyperplane) that separates the majority of the data from the origin, in which the distance from this origin to all such hyperplanes is maximal. Only a user defined fraction of data points can lie outside the hyperplane, these data points are regarded as outliers. Compared to inliers, outliers should contribute less to the decision boundary [2].

Principal Component Analysis (PCA)

PCA is a dimension reduction technique, it uses singular value decomposition to project the data into a lower-dimensional space. The covariance matrix of the data is decomposed into eigenvectors.

These eigenvectors capture most of the variance of the data and outliers can, therefore, be deduced from them [23,1].

Minimum Covariance Determinant (MCD)

MCD detects outliers by making use of the Mahalanobis distance. The Mahalanobis distance is the distance between a point P and a distribution D. The Mahalabonis distance takes the idea of computing the number of standard deviations away P is from the mean of D and applies it to multivariate data [9].

4.2.2 Proximity-Based

In proximity-based methods, the idea is to model outliers as isolated points based on similarity or distance functions. Methods based on proximity are one of the most prevalent approaches used in outlier detection [1]. However, because of the curse of dimensionality some of these approaches are known to deteriorate [29].

Local Outlier Factor (LOF)

LOF measures the local deviation of density, in other words: the relative degree of isolation. The anomaly score of LOF depends on how isolated the data point is in relation to the neighborhood

(22)

surrounding it. The locality is given by measuring the distance to the k-nearest neighbors. A data point that has a substantially lower density than its neighbours is considered to be an outlier [3].

Clustering-Based Local Outlier Factor (CBLOF)

CBLOF requires a cluster model that was generated by a clustering algorithm and combines it with the input data set to calculate an anomaly score. The clusters are classified into small and large clusters based on (user-defined) parameters alpha and beta. The anomaly score is then calculated based on the cluster size to which the point belongs and the distance to the closest big cluster [10].

Histogram-based Outlier Score (HBOS)

HBOS builds a histogram for each feature individually and then calculates the degree of ‘outlying- ness’. HBOS is significantly faster than other unsupervised outlier detectors because it assumes independence of the features. However, this does come at the expense of precision [8].

k Nearest Neighbors (kNN)

In kNN, each point is ranked based on its distance to its k^thnearest neighbor. The top n points of this ranking are then considered to be outliers [20]. Because of its simplicity, kNN is an efficient algorithm.

4.2.3 Probabilistic

In probabilistic models, data is modeled as a closed-form distribution of probability. The parameters of this closed-form probability distribution model are then learned. Data points that do not fit well with the probability distribution are then selected as possible outliers [1].

Angle-Based Outlier Detection (ABOD)

Where most of the approaches discussed above are based on the distance between points, ABOD focuses on comparing the angle between pairs of distance vectors. If the spectrum of angles observed for a point is wide, the point is enclosed in all directions by other points, suggesting that the point is part of a cluster. However, if the spectrum is small, then it suggests that the point is positioned outside the clusters and the point is regarded as an outlier [15].

4.2.4 Outlier Ensemble

In an outlier ensemble, the results from different algorithms are combined to create a more robust model. A single algorithm with heterogeneous hyperparameters can also be used.

Isolation Forest (IForest)

The Isolation Forest technique is based on decision trees. The Isolation Forest ‘isolates’ data points by randomly selecting and then randomly selects a split value from the range of the selected feature.

The idea is that outliers need fewer partitions in order to be isolated [17].

Feature Bagging

Feature Bagging fits several base detectors on sub-samples of the dataset and uses combination methods to improve the predictive accuracy and lower the impact of overfitting. The sub-samples are the same size as the input dataset, but take a randomly selected subset of features [16].

(23)

4.2.5 Re-training the Meta-Learner

As mentioned earlier, a regression learner is trained. Luckily the code that was used to train this recommender was available in the datacleanbot repository [26]. I started training using the new selection of algorithms. The recommender is trained on a collection of about 30 datasets from the ODDS (Outlier Detection DataSets) library [21]. At this moment I still had an 11th algorithm in the mix from PyOD. A new technique leveraging neural networks called Generative Adversarial Active Learning (GAAL) for Unsupervised Outlier Detection [18]. This technique was the best performer on 15/30 of the ODDS datasets. However, this is where I noticed what I assume is a mistake in Zhang’s work, upon closer inspection I did not use GAAL correctly and it predicted zero outliers for each dataset and yet it had the best F₁score for fifteen out of the thirty datasets.

It turned out that during the benchmarking phase Zhang relabeled the outliers from the ODDS library to -1 and inliers to 1. This was probably done because the implementations used for the three outlier detection algorithms also use these exact labels. The scikit-learn package is used to compute the F1 scores . However, the implementation from scikit-learn returns the ”F1 score of the positive class in binary classification” [4], which in this case are actually the inliers. Rectifying this mistake completely changed the outcome of the benchmark (Figure 4.1) and the F1 scores took a nosedive (Table 4.1). This is where I fully realized that outlier detection is just not that good yet, it is really about providing your best effort.

Figure 4.1: Different best performers after correcting F1 score mistake

Figure4.1also shows that each of the candidate algorithms except feature bagging is the best performer for at least one of the 30 datasets examined. This meant that I had a big problem in

(24)

Algorithm/Model faulty F₁ score correct F₁score

Isolation Forest 0.838360 0.341635

Local Outlier Factor 0.808331 0.255489

One-Class Support Vector Machines 0.735338 0.138838

k Nearest Neighbors 0.854541 0.318051

Angle-Based Outlier Detection 0.832736 0.250102 Clustering-Based Local Outlier Factor 0.841360 0.305549

Feature Bagging 0.813383 0.236490

Histogram-based Outlier Score 0.834360 0.301485 Minimum Covariance Determinant 0.844349 0.321214 Principal Component Analysis 0.834046 0.332901

Table 4.1: F₁scores on the ODDS library before and after correcting F₁ score mistake

the training of my new recommender because I had so few labeled datasets. I searched for more labeled outlier datasets and I scrambled together 50 datasets in total. This is obviously still not enough to get a reasonable recommender. I was reluctant in creating my own labeled datasets by injecting noise into existing datasets because of the danger of overfitting on this kind of specific outlier. I also thought sampling from the existing 50 datasets would suffer from the same problem.

Many of the landmarkings used in datacleanbot also require the class label in order to compute the meta-features. This means that for any machine learning task that does not use class labels, this way of recommending an approach is not suitable. Because of these reasons, I dropped the recommender based on meta-features and explored other options.

4.2.6 Casting an Outlier Ensemble

Now my idea was to just show the prediction for all ten models and say that there is definitely an outlier if all 10 models flag a record as an outlier. In practice, this did not work. Figure4.2shows a fabricated example where I manually added some monstrous outliers; -999999 in columns with only positive values, yet both HBOS and OCSVM do not recognize the record as an outlier.

Figure 4.2: Example of stubborn outlier detectors

In figure4.2 there is also the column ‘label score’. This was part of the follow-up idea to use a threshold score, i.e. when 7 or more algorithms flag a record, the record is flagged as an outlier.

However, this was also unreliable and at this point, I was basically trying to make my own outlier ensemble without fully realizing it. As mentioned in section 4.2.4, the general idea of an outlier ensemble is to combine the results from different algorithms to create a more robust model. In the end, I ended up using the Locally Selective Combination in Parallel Outlier Ensembles (LSCP) framework because it was also part of the PyOD package and has been published very recently [27].

Locally Selective Combination in Parallel Outlier Ensembles (LSCP)

LSCP is similar to Feature Bagging in many ways. LSCP combines outlier detectors by emphas- izing data locality. The idea is that certain types of outliers are better detected when examining them at a local scale instead of a global scale. LSCP defines the local region of a test instance

(25)

by finding clusters of similar points in randomly selected subspaces. It then identifies the most suitable detector by measuring similarity relative to a pseudo ground truth. This pseudo ground truth is generated by taking the maximum outlier (anomaly) scores across detectors. The Pear- son correlation is used to compute similarity scores between the base detectors and the pseudo ground truth scores in order to evaluate the local competency of base detectors. Note that the pseudo outlier scores are not converted to binary labels. The detector with the highest similarity is selected as the most competent local detector for that specific region [27].

4.2.7 Benchmark

Before benchmarking this outlier ensemble I also decided to change the benchmarking process a bit compared to the benchmark that Zhang used [26]. All of the models used have a ‘contamination’ parameter that should represent the proportion of outliers in the data set. This parameter is used to define the threshold on the decision function. Zhang argued that it is most fair to set the contamination at the actual outlier percentage because then F₁ scores are optimal [25]. How- ever I think this produces an unrealistic setting, the actual outlier proportion is rarely known in unsupervised outlier detection. Therefore, I benchmarked using the same contamination estimate that will be used in the package itself. The estimate that is being used is relatively simple, it is the count of rows that contain one or more outliers in any feature according to the conventional method (value more than 3 standard deviations away from the mean) divided by the total amount of rows. This estimation works well with normally distributed data, but should not be taken for granted. I also computed ROC-AUC scores since that metric is widely used in outlier research.

Evaluation metrics used in benchmark

Before I provide the equations for both ROC-AUC and F1-score some definitions need to be laid down. A binary classifier can have four outcomes:

• True Positive (TP): Correctly classified outlier.

• True Negative (TN): Correctly classified inlier.

• False Positive (FP): Inlier classified as outlier (Type I error).

• False Negative (FN): Outlier classified as inlier (Type II error).

The count of these outcomes can be used to compute precision and recall.

P recision = T P T P + F P Recall = T P

T P + F N

The F1-score combines these rates and calculates the harmonic mean [7]:

F1= 2 · P recision · Recall P recision + Recall

The second evaluation metric is the (ROC-)AUC score. A ROC graph is a two-dimensional graph in which the true positive rate (also known as recall) is plotted on the Y-axis and the false positive rate (FPR) is plotted on the X-axis. The false positive rate is given by:

F P R = F P F P + T N

Comparing two different ROC graphs with each other is impractical. ROC graphs are therefore often reduced to scalar values. A popular approach is to calculate the area under the ROC curve.

This is abbreviated as AUC. The AUC is the probability that an outlier detector will assign a higher score to a randomly selected outlier than a randomly selected inlier [7].

(26)

Benchmark results

Algorithm/Model F1 score ROC-AUC score

Isolation Forest 0.239277 0.582948

Local Outlier Factor 0.175718 0.553066

One-Class Support Vector Machines 0.136554 0.537597

k Nearest Neighbors 0.240187 0.583309

Angle-Based Outlier Detection 0.151532 0.553013 Clustering-Based Local Outlier Factor 0.218542 0.563856

Feature Bagging 0.173795 0.554503

Histogram-based Outlier Score 0.209788 0.572330 Minimum Covariance Determinant 0.221578 0.558286 Principal Component Analysis 0.238771 0.582671 LSCP (LOF(5, 10, 20, 30, 40, 50, 100, 150, 200)) 0.207018 0.572485 LSCP (PCA, IForest, kNN) 0.244237 0.587088

Table 4.2: Comparison of average F1 and ROC-AUC scores on the ODDS library (5 independent trials)

The results of the benchmark are in table 4.2, we can see that the F1 scores are slightly worse, which was to be expected since the contamination estimate is not that accurate. I first tried the LSCP configuration as described in the paper. The paper uses a pool of 50 LOF detectors with a randomly selected number of neighbours in the range of [5,200] [27]. Because of the long run time of running 50 detectors, I lowered it to 9 handpicked numbers from this range. However, as seen in table4.2, this did not result in significantly better results. The paper also mentions that working with heterogeneous base classifiers was proven to be a success in different classification problems and that performance improvement is expected when diverse base detectors are used [27].

Therefore, I tried to give heterogeneous base detectors a shot by combining the best-performing algorithms from table 4.2. Three algorithms seem to have significantly better performance in the benchmark: Isolation Forest, k Nearest Neighbours, and Principal Component Analysis. I selected these algorithms as my outlier ensemble for LSCP. Luckily, the three best-performing algorithms are also some of the most efficient algorithms in my pool of outlier detectors. As table 4.2 demonstrates this configuration got the highest F₁ and ROC-AUC score of them all, although only with a small margin. This small margin means that -even though this may be one of the more robust approaches- I cannot recommend a specific outlier detection method with much confidence. Also, many of the algorithms examined depend on one or more user-defined parameters. The algorithms have been tested using their default parameter values, resulting in sub-optimal performance. In the end, the algorithm selection problem depends too much on the context and characteristics of the dataset. Therefore there is no one-size-fits-all approach and recommending based on the dataset is also complicated as I have found.

4.2.8 Implementation in PyWash UI

However, we can still assist the user of PyWash by making it easy to experiment with a multitude of algorithms in order to discover possible outliers. Figure 4.3 shows the input fields that are used for outlier detection. Users can input their own contamination score or estimate it using the earlier mentioned method. Then it is time to select an outlier detection algorithm, there are 2 presets available. These presets are the 2 LSCP configurations that have been benchmarked in the previous section. The user can select any algorithm from the dropdown menu when two or more algorithms are selected, LSCP is leveraged to detect the outliers.

(27)

Figure 4.3: User interface of outlier detection: input

When ‘Detect Algorithms’ is clicked (hidden by the dropdown menu in figure4.3), the settings are handed to the backend. The backend will return a data frame with 3 columns added to the original data: ‘anomaly score’, ‘prediction’ and ‘probability’. Figure4.4is an example of such an output. The prediction column holds the binary classification made, outliers are highlighted with a red background. Probability is defined as the unified outlier score as described in the paper by Kriegel et al. [14]. The visualizations can now also make use of the newly added columns, this can lead to some insightful visualizations.

Figure 4.4: User interface of outlier detection: output

4.3 Band A: Feature Scaling

In feature scaling, the data is transformed such that it is within a specific range. Some scaling methods such as standardization also change the distribution of the data. For some machine learning models, it is beneficial to scale the data. Feature scaling has been implemented by writing a wrapper around the scikit-learn package for easy access. Both min-max normalization and standardization are available.

4.3.1 Min-Max Normalization

Min-max normalization rescales the data to a new range, usually [0, 1]. Min-max normalization brings the values closer to the mean which lowers the impact of outliers, note that this can be

(28)

Figure 4.5: Scaling UI component

either a bad or good thing depending on the machine learning model used. The formula is:

x⁰= a +(x − min(x))(b − a)

max(x) − min(x) (4.1)

Where a and b are the left and right bound of the range given, x is the original data and x⁰ is the normalized data.

4.3.2 Standardization

Standardization makes it such that the mean of the data is zero and has unit-variance. In some machine learning methods, it is an issue when a feature has a variance that is larger than the other features because it dominates the objective function. Standardization can, therefore, be applied in order to make the data appear more as standard normally distributed data. The formula is:

x⁰= x − ¯x

σ (4.2)

Where x is the original data, ¯x is the mean of the data, σ is the standard deviation and x⁰ is the standardized data.

4.3.3 Implementation in PyWash UI

Figure4.5shows how scaling is handled in the UI. There is a radio button available to choose the scaling method. Below it is a dropdown menu that automatically only holds numerical features as items, multiple features can be selected. Then the user can input a range if no range is given it will default to [0, 1]. Finally, the ‘scale!’ button has to be clicked to execute the scaling.

4.4 Band B: Missing Values

Whenever a record is missing values, we talk about missing values. Missing values are encoded as a variety of characters such as null, N/A, na, ?, and -1. The data points or features with missing data should be identified and possibly either removed or repaired (e.g. imputed). Missing data can have multiple underlying causes. These causes are often called a missing data mechanism.

There are three types of missing data mechanisms [22]:

• Missing completely at random (MCAR): Data points are missing completely at random.

(29)

• Missing at random (MAR): Data points that are missing are related to other observed variables in the dataset.

• Missing not at random (MNAR): Data points that are missing are related to the values of that variable itself.

Identifying the missing data mechanism is important because the performance of the chosen handling method depends on the missing data mechanism. I leveraged the missing data handling from datacleanbot [26] for PyWash, but modified it in a way that allows it to be used as an API. This makes it easier for the web interface to leverage this prior work. Datacleanbot cannot accurately infer what missing mechanism is present and requires a target class. Therefore, the user still has to decide which missing data mechanism is in place. However, it does recommend an approach to deal with the missing values once a missing data mechanism is chosen.

4.4.1 Implementation in PyWash UI

Figure 4.6: Missing Values UI

Figure 4.6 shows the implementation in the UI. At the top of this section, there is an indicator that shows the number of empty cells (or absence of empty cells). Below it is a field where users can add specific values to be identified as missing by the detection. Then there is a set of radio buttons, where users can select the missing data mechanism or choose to just instantly remove all of the missing values. Finally, there is a button that will activate the process of handling the missing values. Datacleanbot will remove missing values by using one of the following methods [25]:

• MCAR: list deletion, mean, mode, k nearest neighbor, matrix factorization, multiple imputation.

• MAR: k nearest neighbor, matrix factorization, multiple imputation.

• MNAR: multiple imputation.

4.5 Band B: Data Types

Columns should have the correct data type. A counterexample would be that a column is labeled as integer while it is effectively a non-ordinal encoding of categorical values (e.g. 1=blue, 2=green, 3=red). This hampers the performance of machine learning models. As a starting point for this deficiency, I took again datacleanbot [26], in datacleanbot both a Bayesian method and more of

(30)

a heuristic approach is used. The Bayesian method determines the statistical types of features.

I felt like the Bayesian approach went a bit too much into detail and on top of that, it required some heavy dependencies. Besides, a statistical type such as categorical can also be discovered in a heuristic manner. Therefore I stripped the heuristic approach from it and adapted it to be more suitable for the SharedDataframe we have made. This means that I only infer data types that are valid in pandas. These are: object, float64, int64, bool, category, and datetime64.

4.5.1 Heuristic Approach

Figure 4.7: Heuristic data type inference

Figure4.7displays the heuristic method used. At line 36 some weights are initialized for the types in the line above it. Line 39 and 40 initialize a sample of 10 percent of all records to be used later. The inference starts at line 43, the feature is assigned the boolean type when there are only 2 unique values and these values are of type integer or can be converted to an integer. Note how I do not convert to a boolean value when the values cannot be converted to integers by making use of the try/except block. This is done because you can lose track of what the boolean values truly represent (i.e. the strings male and female). When there are less than 10 unique values

(31)

present I assign the categorical type, this is done at line 49-50. From line 51 onwards the weights and sample that were initialized earlier are used in a similar way as datacleanbot. This where the remaining data types are inferred: datetime64, float64, int64, and object.

4.5.2 Implementation in PyWash UI

Figure 4.8: Data types UI

Upon loading a dataset in PyWash the data types will automatically be inferred using the described method and presented in a table. Figure 4.8 is an example of such a table. The table has a dropdown feature for every cell, this way users can still easily change data types if the inference was wrong or a different type is preferred.

(32)

Chapter 5

Conclusions

In this final bachelor project, we identified common problems of raw data, set up a data quality framework around these problems and developed a (semi)automated data cleaning tool called PyWash that includes an interactive web interface. By doing this, I have succeeded in finding answers to the research questions stated in the introduction. PyWash currently can assist in a multitude of data deficiencies: parsing data, merging data, data types, missing values, feature scaling, outlier detection, and feature selection. It improves upon earlier work by being based on a data quality framework, having a more user-friendly, interactive interface and is more broadly applicable by not being limited to supervised machine learning. PyWash is available on GitHub [6].

5.1 Future Work

PyWash is not even remotely complete. The most straightforward addition is to add the remaining deficiencies from the framework. Additionally, a logger should be implemented. The logger allows PyWash to compute the data quality score and is necessary to document what changes have been made to the dataset. From a developing perspective, it needs more automatic testing and documentation. Also, there are still many edge cases and bugs left to iron out. Furthermore, the features that are currently implemented can be extended upon and improved.

Web Interface

The web interface could use extensive descriptions for each section to lower the barrier for inexper- ienced users. More visualizations can be added. As mentioned earlier, Dash is also a constantly evolving platform, therefore, we should also be on the lookout for new functionality to add to the web interface.

Outlier Detection

The outlier detection section should give the user the option to change the parameters of the chosen outlier detector. More outlier detectors can be made available. The contamination estimation can be improved. It can be researched whether it is possible to accurately recommend a group of outlier detectors (linear model, proximity-based, etc.).

Feature Scaling

More elaborate scaling methods can be added.

(33)

CHAPTER 5. CONCLUSIONS

Missing Values

Improvements can be made in the way missing values are highlighted. The user should also get more feedback and input on how the missing values are repaired.

Data Types

The heuristic detector can be improved to catch more edge cases. PyWash should allow the user to easily change the parameters of the heuristic detection.

(34)

Bibliography

[1] Charu C. Aggarwal. Linear models for outlier detection. In Outlier Analysis, pages 65–110.

Springer International Publishing, December 2016. 20,21

[2] Mennatallah Amer, Markus Goldstein, and Slim Abdennadher. Enhancing one-class support vector machines for unsupervised anomaly detection. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, ODD ’13, pages 8–15, New York, NY, USA, 2013. ACM. 20

[3] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander. LOF. ACM SIGMOD Record, 29(2):93–104, June 2000. 21

[4] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Work- shop: Languages for Data Mining and Machine Learning, pages 108–122, 2013. 19,22 [5] Laurens Castelijns and Yuri Maas. The ABC of Data, A Classifying Framework for Data

Readiness. 2019. 11

[6] Laurens Castelijns and Yuri Maas. PyWash. https://github.com/pywash/pywash, 2019. 10, 31

[7] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. 24

[8] Markus Goldstein and Andreas Dengel. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. 09 2012. 21

[9] Johanna Hardin and David M Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625–638, January 2004. 20

[10] Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers. Pat- tern Recognition Letters, 24(9-10):1641–1650, June 2003. 21

[11] Julian Heinrich and Daniel Weiskopf. State of the Art of Parallel Coordinates. In M. Sbert and L. Szirmay-Kalos, editors, Eurographics 2013 - State of the Art Reports. The Eurographics Association, 2013. 18

[12] Alfred Inselberg. Multidimensional detective. In Proceedings of VIZ ’97: Visualization Con- ference, Information Visualization Symposium and Parallel Rendering Symposium, pages 100–

107, Oct 1997. 18

[13] Muzammil Khan and Sarwar Shah. Data and information visualization methods, and interactive mechanisms: A survey. International Journal of Computer Applications, 34:1–14, 12 2011. 16

(35)

BIBLIOGRAPHY

[14] Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unify- ing outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, pages 13–24. Society for Industrial and Applied Mathematics, April 2011. 26

[15] Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 444–452, New York, NY, USA, 2008. ACM. 21

[16] Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceed- ing of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD 05. ACM Press, 2005. 21

[17] Fei T. Liu, Kai M. Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE Interna- tional Conference on Data Mining, pages 413–422, Dec 2008. 21

[18] Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiang- nan He. Generative adversarial active learning for unsupervised outlier detection. CoRR, abs/1809.10816, 2018. 22

[19] Wes McKinney et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, pages 51–56. Austin, TX, 2010. 19 [20] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining

outliers from large data sets. SIGMOD Rec., 29(2):427–438, May 2000. 21 [21] Shebuti Rayana. ODDS library, 2016. 22

[22] Donald B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976. 27

[23] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and Liwu Chang. A novel anomaly detection scheme based on principal component classifier. 01 2003. 20

[24] Herbert A. Sturges. The choice of a class interval. Journal of the American Statistical Association, 21(153):65–66, 1926. 17

[25] Ji Zhang. Automatic Data Cleaning. Master’s thesis, Technische Universiteit Eindhoven, the Netherlands, November 2018. 19,20,24,28

[26] Ji Zhang. datacleanbot. https://github.com/Ji-Zhang/datacleanbot, 2018. 19, 22, 24, 28

[27] Yue Zhao, Maciej K. Hryniewicki, Zain Nasrullah, and Zheng Li. LSCP: locally selective combination in parallel outlier ensembles. CoRR, abs/1812.01528, 2018. 23,24,25

[28] Yue Zhao, Zain Nasrullah, and Zheng Li. Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019. 19,20

[29] Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–

387, August 2012. 20

(36)

Appendix A

Callback Graph

(37)

APPENDIX A. CALLBACK GRAPH

FigureA.1:DashCallbackgraph

Eindhoven University of Technology BACHELOR PyWash a Data Cleaning Assistant for Machine Learning Castelijns, Laurens A.