Eindhoven University of Technology BACHELOR A Data Cleaning Assistant Quadt, Thomas J.

(1)

Eindhoven University of Technology

BACHELOR

A Data Cleaning Assistant

Quadt, Thomas J.

Award date:

2020

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

Eindhoven University of Technology

Bachelor End Project A Data Cleaning Assistant

Thomas Quadt

Department of Mathematics and Computer Science

Supervised by

Dr. ir. Joaquin Vanschoren Data Mining

30 June 2020

(3)

Declaration

I hereby certify that the material, which I now submit for assessment on the programmes of study leading to the award of Bachelor of Science, is entirely my own work and has not been taken from the work of others except to the extent that such work has been cited and acknowledged within the text of my own work.

No portion of the work contained in this thesis has been submitted in support of an application for another degree or qualification to this or any other institution.

———————————–

Thomas Quadt 30 June 2020

(4)

1 Introduction

1.1 Problem Statement

Raw data is usually riddled with a variety of imperfections. Missing values, anomalies, inconsistencies and many more problems are inherent to nearly any dataset.

Some imperfections are easily noticed and solved. More often, however, these imperfections lie hidden in the data. Untreated, they can greatly undermine the performance of data mining techniques. The results and efficiency of data ana- lyses thus relies not only on the data mining techniques applied, but also on how well a raw dataset has been preprocessed. Therefore, it is no surprise that data preprocessing is seen as a mandatory part of data analysis [1].

Data preprocessing is defined as the process of cleaning, transforming, reducing and imputing data. It can take an astounding amount of time [2]. Over 50% of the time and effort spent on a data analysis occurs in the preprocessing phase [3]. Besides being extremely time- and labor-intensive, data preprocessing is also a very tedious task. According to a study reported by Forbes, 57% of data scientists find data preprocessing the most tedious task of their job [4]. The combination of being very tedious and being very time intensive can cause data scientists to be inclined to only do some elementary preprocessing steps. This results in not utilizing the full power of various data preprocessing techniques,and therefore having sub-optimal data and corresponding results.

Nowadays, during the rise of big data, datasets have become even more prone to all of the aforementioned imperfections. Along with an increase in imperfections, the amount of data has increased immensely. From all data we have up to 2018, 90%

has been generated in the last two years [5]. Along with a huge shortage of data scientists (in the United States of America alone there was a shortage of nearly 152.000 data scientists in 2018 [6]), increasing numbers of data and imperfections call for the automation of various phases in the data analysis workflow. Data preprocessing being the most promising one.

(6)

1.2 Project

My final bachelor project is based on fulfilling the need for the automation of data preprocessing. Last year a group of two students worked on a very similar project, creating a tool named PyWash. This year four students, including myself, will improve this tool. PyWash is a data cleaning assistant package for Python. It is meant to be an easy to use, automated data cleaning tool, which would improve and speed up the data preprocessing phase of data analysis.

My three colleagues and I will focus on different subtasks in improving this tool.

All of these subtasks need to be integrated and combined seamlessly. This requires a great deal of interaction and cooperation between the four of us.

The subtasks I will be working on are the following: Data type inference, outlier detection, and providing the user with a way to reproduce results. My fellow students will work on improving the user interface, missing value imputation, and automatic data readiness level detection. The research questions I would like to answer regarding the aforementioned subtasks are the following:

• How can we detect the column types of a dataset with minimal user interaction?

• How can we detect outliers in a dataset with minimal user interaction?

• How can we provide the user a way of replicating the results of the automated data preprocessing?

1.3 Outline

This report is divided into multiple sections. In Section 2 the data type inference is discussed. In Section 3 outlier detection is described. In Section 4 the event logger is described and finally in section 5 a summary is presented as well as a discussion of future work.

(7)

2 Column Type Inference

An important part of data preprocessing is column type inference. Which data mining technique is applicable depends, among others, on the different types of independent variables present [7]. For example, a k-nearest-neighbour algorithm cannot handle categorical data, usually represented as strings and booleans. The type of the dependent variable matters a great deal as well. Generally speaking there are two classes of supervised machine learning problems: regression problems: where you predict a continuous variable and classification problems: where you predict a categorical variable [8]. The appropriateness of a data mining technique varies greatly between these two classes of problems. In an artificial neural network, a single neuron would make sense for a regression problem, whereas a final softmax layer is utilized for classification problems. Besides the final layer of a neural network, the choice of a loss function also depends on the problem class.

Wrongly interpreted data types can have detrimental effects on the results of a data analysis.

Inferring the data type of each column might seem simple at first. However, given the erroneous nature of raw data, specifically missing values and anomalies, the task of inferring data types becomes considerably harder. Missing values and anomalies in the context of inferring data types are defined as follows: missing data is defined as any data value that is supposed to denote the absence of a data value, whereas anomalies are defined as values whose data type is different from the column type [9]. Missing values are encoded in a variety of ways. They can be encoded with values like −1 or strings such as ’missing’. When not taken into account, missing values could have a negative influence on the inference of data types. For a integer column with missing values denoted as ’missing’ the inferred type might be non-numerical. Besides posing problems for inferring data types, a missing value denoted as −1 in a numerical column could also greatly skew the results of a data analysis. Anomalies interfere in a similar way with the process of type inference. Note that anomalies occur differently depending on the data type of the column. Most of the current approaches to inferring data types fail to deal with missing values and anomalies in a satisfactory way [9]. The

(8)

method implemented in the data cleaning tool, on the contrary, does take missing values and anomalies into account. In the following subsections I will introduce the proposed method, evaluate the performance and show the implementation

2.1 Method

The implemented method, Ptype, is created by Ceritli et al. [9] and uses a probabilistic robust type inference method to infer data types accurately, while also detecting missing values and anomalies at the same time. This technique allows for the detection of the following five data types:

• Integers

• Floats

• Booleans

• Strings/Category

• Dates

Missing values and anomalies are annotated with their respective annotators.

Note that the technique actually detects three subclasses of strings, namely, ’string’,

’gender’ and ’all identical’. In the implementation of our tool these subclasses are all defined as categorical variables. Therefore, no further distinction will be made.

Furthermore, various different formats of dates are detected, however, they all generalize to the same datetime object in pandas.

The method ptype works a lot like the general principle of maximum likelihood estimation. A model of the data is created, the likelihood of each data type given the model and data is calculated, and then the most probable data type is inferred as column type. In the following sections I will first give an in-depth explanation of how the model of the data is created, then I will discuss how the likelihood of a certain data type given the data can be calculated. Note that Ceritli et al. [9] also discuss their method. To enhance understanding of the technique and replicability, I intend to give a more intuitive and in-depth explanation of their technique.

(9)

Model: Defining the generative process. Ptype assumes a column of data is generated in three steps. First the data type of a column is generated from the set of possible data types. Secondly, the data type of each row is generated. A row can have three different types. It can either be an actual value (thus having the same type as the column type), a missing value or an anomaly. Finally, each data value is generated according to its type. Each of these steps uses a different likelihood function to generate the respective components.

If there are K different data types in the possible set of data types, the column type t ∈ {1...K} is generated by a discrete Uniform distribution with K possible values. There exists no prior knowledge regarding what column type is more likely to happen and therefore every column type is assumed to be equally likely to occur.

In mathematical terms this means that

column type t ∼ U(1, K)

.

The row type zi of row i can take one of three different values, namely column type t, missing value denoted as m or anomaly denoted as a. zi thus has a discrete distribution defined by three probabilities. One for each possible row type. Note that the probability of each row type differs per column type. For example, a date column might be more prone to anomalies. Whereas a gender column might be more prone to missing values. Therefore, there is a probability for each column-, row type combination. In mathematical terms this means that:

row type zi =











t with probability π^tt

m with probability π^mt

a with probability π^at

Note that the subscripts denote that these probabilities are dependent on the column type t and the superscript means that this is the probability of getting the specific row type value.

Finally, the data value xi of each row i has a distribution dependent on the row

(10)

type zi. In other words, the likelihood of a value xi occurring is different per row type. For example, the likelihood of the value −1 occurring is very small if the row type zi is date zi = t = date. On the contrary, when the row type is a missing value, zi = m, the likelihood that data value −1 occurs is far higher. In mathematical terms this means that xi has the following distribution:

row value xi = p(x_i|z_i)

This conditional probability is estimated by a collection of Probabilistic Finite State Machines (PFSMs). A PFSM consists of a finite set of states S, an alphabet P, initial probabilities I, transition probabilities p(sj−1, xj, sj) and final state probabilities F [10]. Each PFSM generates a set of strings. The probability of a string θ with k symbols to be generated by the PFSM can be calculated by [10]:

P r(θ) = I(s0) ∗

k

Y

j=1

P (sj−1, xj, sj) ∗ F (sk)

Each string corresponds to a possible data value. For example, the PFSM used to represent the integer data type (see Figure 1) can make strings in the domain of {− inf, ...0, 1, ... inf}. The associated probability of generating each string s is p(s|t = integer), which is acquired by using the formula above. The probability of generating string 707 emitted by the PFSM in Figure 1, with equal initial probabilities, can be calculated by multiplying the initial state probabilities with the transition probabilities and the final state probabilities. This corresponds to 0.5 ∗ 0.1 ∗ p(0) ∗ p(7) ∗ F (s₂). Ceritli et al. [9] have created a PFSM for each data type, as well as for missing values and anomalies. This means that each row type has its own PFSM representing it. Therefore, using the appropriate PFSM, the conditional probability of xi given any row type zi can be estimated. The use of PFSMs for missing values and anomalies along with estimating the column type by considering all rows makes the system robust against both anomalies and missing values. Anomalies and missing values are detected when the row value has a very low probability of belonging to the column type. For example a string has a low probability of occurring in a column with column type integer and is

(11)

therefore likely to be an anomaly. Finally note that the initial, transition and final state probabilities are first handcrafted and then improved by using discriminative learning.

Figure 1: Illustration of the PFSM implemented by Ceritli et al. [9] representing the integer data type where p = ^{1−F (s}₁₀ ²⁾

In conclusion the generative process of this model is as follows:

column type t ∼ U(1, K)

row type zi =











t with probability π^tt

m with probability π^m_t a with probability π^at

row value xi = p(x_i|z_i) Where p(xi|zi) is estimated using the appropriate PFSM.

Model: Defining the likelihood of having column type t = k. Now that we have a model to describe how a column is generated, we can estimate the likelihood of having a specific column type t = k given the data we have. This probability is defined as p(t = k|~x). The probability of xi given zi is defined as

(12)

p(x_i|z_i)and can be estimated using the initial probability, transition probabilities and final state probability of the appropriate PFSM [10]. For the specific column type t = k, we know that zi can take three different values: k, m, a with probability π_k^k, π_k^m, π^a_k of occurring respectively. Under the assumption that data values are conditionally independent given t, we thus know that

p(~x|t = k) =

N

Y

i=1

(π_k^kp(x_i|z_i = k) + π^m_kp(x_i|z_i = m) + π^k_ap(x_i|z_i = a))

Furthermore, note that

p(t = k|~x) ∝ p(t = k)p(~x|t = k)

All that is left to do is estimate p(t = k). By our assumption the column type is Uniformly distributed. Therefore, p(t = k) can be estimated easily and is equivalent to 1/K. In conclusion, the likelihood of having column type t = k is calculated using the following formula:

p(t = k|~x) ∝ p(t = k)

N

Y

i=1

(π_k^kp(x_i|z_i = k) + π_k^mp(x_i|z_i = m) + π_a^kp(x_i|z_i = a)) (1)

Finally, to infer the data type t of a column, we calculate p(t = k|~x) for each different data type k and take the most probable data type (the maximum of the calculated probabilities).

Predicting categories from integer columns. Ptype predicts the data types of columns, however, a seemingly numerical column could actually be categorical. For example, a column denoting the number of times a patient has visited the hospital might be predicted as numerical, yet is actually categorical. Several algorithms require different procedures for categorical columns. In our case the missing value handling deals with categorical columns differently than from numerical columns. A missing value in a count column, like described in the example,

(13)

should not get the value 1.5 or any other decimal value. Therefore, it is important to know which columns are categorical. After all column types have been predicted, a small and simple heuristic derived from the implementation of last year [11], is applied to filter some categorical columns out. The heuristic is applied to every column which has the datatype integer. Every column with less than 10 unique values is set to be a categorical column. This might not be as accurate as sophisticated statistical procedures, however, this heuristic is meant to mark the most obvious columns as categories. Users can, if desired, use the dropdown menus to change the columns to categorical themselves.

2.2 Evaluation

The method just described has been implemented by Ceritli et al. [9] in a python project available at https://github.com/tahaceritli/ptype-dmkd. They tested this method extensively using 43 test datasets with a total of 610 columns. They achieved an accuracy of 92% when using their model with hand crafted weights and probabilities. After training their model their accuracy increased with one percentage point to 93%. Using the 43 test sets, I have tested the implementation of their model to ensure the code on their github page works well enough to implement into PyWash.

When cleaning the 43 datasets, I ended up with 613 columns, instead of 610. Also the distribution of column types present in the test set was slightly different. This discrepancy however was negligible. The overall accuracy of the ptype model I tested was 92%. Furthermore, a normalized confusion matrix can be found in Figure 2. This normalized confusion matrix shows that the method works very well for all data types. As can be expected, due to a large variety in encoding styles, predicting columns of type boolean is the least accurate.

(14)

Figure 2: Normalized confusion matrix of ptype

When trying to train the model on my laptop, using the provided training datasets, I ran into memory issues. After careful consideration, I decided that a one percentage point increase was negligible enough to use the ptype model with handcrafted values instead.

Besides, predicting data types, ptype also detects missing values and anomalies.

Ceritli et al. [9] tested this using the 43 datasets. On average these 43 datasets had 11% anomalies and missing values. Around 6.7% of the data was annotated as a missing value or anomaly with a certainty of over 50% by ptype. Of this 6.7%

of data, 5.3% was actually a missing value or anomaly, and has been annotated correctly. This means that around half of all missing values or anomalies were

(15)

detected by this method. I have decided to not test these numbers myself for two reasons. First detecting missing values and anomalies is not the main purpose of my implementation in PyWash2. Secondly, the test results for data type inference were extremely similar to those of Ceritli et al. [9] hence there is no reason these numbers will not be.

2.3 Implementation

This method has been implemented in PyWash and requires little to no user interaction. As soon as a user uploads a dataset to PyWash the column types, anomalies and missing values are predicted. The ideal scenario would be to in- stantaneously set the column of a dataset to the predicted type, however, since anomalies can be present, this is not always possible. In order to change the column type, we first need to remove all anomalies present in the dataset. Remov- ing data points is something that should not be done lightly. Therefore, we have implemented a mechanism to let the user review whether predicted anomalies are indeed anomalies. In Figure 3, you can see all of the possibilities PyWash offers a user to interact and review the predicted anomalies. The figure shows a double dropdown menu, where the user can select a column in the first menu and see all of the anomalies in the second menu. The second menu can be used to select individual anomalies corresponding to the column selected in the first menu. A user can either manually select anomalies or all anomalies can be selected at once by pressing the ’select all’ button. Anomalies can also be deselected by pressing the

’x’. After selecting the anomalies there are two options on how to deal with them.

The user can either unmark them or handle them by replacing the anomalies with missing value annotators. These options correspond to the two buttons beneath the double dropdown menu.

(16)

Figure 3: User Interface of dealing with anomalies

Besides changing predicted anomalies, a user can also change the predicted column types. Whenever a user believes a predicted column type is wrong, another double dropdown menu, as depicted in Figure 4, can be used. In the Figure, the first menu allows the user to select a column, whereas the second menu allows the user to change the predicted type to a new type. Note that the actual type of the column will not be changed immediately, to give the user the option to revert changes.

Furthermore, all anomalies should be removed before changing actual datatypes.

After the user presses the start cleaning button, all remaining anomalies will be replaced with missing value annotators and only then the column types will be changed accordingly.

Figure 4: User Interface of changing the predicted column type

(17)

Finally, the predicted missing values will not be shown in this stage. However if time allows my fellow students will be able to use these predicted values in the missing value handling stage. The concept is that predicted missing value annotators will be set as missing value annotators. The user can either remove or accept these and use them to handle the missing values. Furthermore, the predicted certainty of every column type is also stored. This could be communicated to the user to allow them to use this information in the choice of changing column types.

It could also be used to determine the data score in the data readiness levels. At the moment I am not sure whether my fellow students will use this, hence I will refrain from discussing this any further.

In conclusion, our implementation of Ptype is automated for a great deal, yet at the same time allows the user, if desired, to change faulty predictions with great ease.

(18)

3 Outlier Detection Ensemble

Although the exact definition of an outlier depends on the detection algorithm used, hidden assumptions in the data, and is often very case specific, there exist some general definitions [12]. A widely known and accepted definition is: an outlier is any data point that appears to deviate markedly from other members of the sample in which it occurs [13]. The detection of these outliers takes a prominent part in any data mining application. The effect of outliers can be detrimental to the results of data mining applications, for example, an outlier can greatly skew the coefficients of a linear regression. In other cases, such as fraud detection, outliers are points of great significance and the goal of the data analysis is to detect them.

The field of outlier detection is well studied. Therefore, numerous algorithms have been designed and proposed. Last year a small subset of those algorithms was implemented in PyWash [11]. The algorithms that were implemented can be separated into four groups of algorithms. Linear models, proximity based methods, a probabilistic method and two ensemble methods. The implementation focused on finding the most appropriate algorithm for a specific dataset and recommending that specific algorithm or combination of algorithms to the user. For large datasets such an approach works well. However, when the dataset is on the smaller side, an advanced ensemble approach could combine the results of all of the implemented algorithms and greatly improve the performance of the outlier detection.

In the following subsections I will introduce the proposed method, evaluate the performance and show the implementation.

3.1 Method

An ensemble consists of three components: the set of algorithms used, the organization (modular or ensemble), and finally the method of combining the results [14]. The set of algorithms used consists of the algorithms implemented last year by Castelijns (2019) [11]. The following algorithms are used:

• One Class Support Vector Machine (OCSVM) [15]

• Principal Components Analysis (PCA) [16]

(19)

• Minimum Covariance Determinant (MCD) [17]

• Local Outlier Factor (LOF) [18]

• Histogram-based Outlier Score (HBOS) [19]

• k-Nearest Neighbours (kNN) [20]

• Angle-Based Outlier Detection (ABOD) [21]

• IForest [22]

The ensemble method Feature Bagging [23] is not included in the set of algorithms used by the proposed ensemble for the sake of reducing time complexity. Further- more our algorithm already samples features at random. The method Clustering- based Local Outlier Factor (CBLOF) [24] has not been included since it is extremely prone to run into errors. For automated tasks this can halt the automation which would reduce the user experience.

The organization will be an ensemble. Running multiple algorithms on a dataset and combining the results afterwards.

The proposed method to combine scores consists of a few steps. First the scores of each algorithms are determined on a subset of the data, then, to increase the score difference between outliers and inliers, algorithms vote on scores likely to be an outlier. Afterwards, with the intention to increase the importance of capable algorithms, weights are determined based on the individual capacity of each algorithm on a specific dataset. Finally all of these components are used to combine the scores of each algorithm. A threshold is computed and observations exceeding the threshold are marked as outliers. A more in depth explanation of each step is provided below.

• Computing the scores F : First, for each algorithm a subset of the d features of random size d/2 to d − 1 is chosen. Next the algorithm is applied to the subset. The outlier scores of all algorithms are standardized and stored together in a matrix F where each column represents an algorithm and each row an observation. A entire column in F thus corresponds to all outlier scores produced by a single outlier detector.

(20)

• Determining the voting matrix V : The algorithms all vote on which scores are outliers. This voting is done by calculating the inter quartile range (IQR) of the scores produced by an algorithm and assigning a vote for each score which is greater than 1.5 ∗ IQR. Voting is repeated for each algorithm and thus for each column in F . The paper describing this method [25] left me a bit confused whether a score column in F should receive votes from all algorithm or just the algorithm that provided said scores. I implemented both approaches and voting on all scores performed better in nearly every case. Furthermore, the author of the paper [25] confirmed my experiments and notified me that indeed all algorithms should vote on each score. The process of determining votes results in a matrix with a column for each algorithm and a row for each observation. Note that this matrix is the same size as the matrix F . Essentially each generated score has a corresponding number of votes. This voting matrix will be used to increase the difference between outliers and inliers.

• Computing the weight vector w: Besides a voting matrix, a weight vector W is created with the intention to measure the individual capacity of each algorithm for the specific dataset used. Pasillas-Díaz and Ratté [25]

propose two different ways of creating the weight matrix. I have chosen to use the ensemble of detectors with correlated votes (EDCV) approach because the performance of the approaches was similar and this approach was elaborated the most in their paper. EDCV calculates a correlation matrix by computing the correlations between the different standardized scores of the algorithms. The correlation of each score column in F to each other score column in F is calculated whilst taking into account that the correlation with itself should not be included. The resulting weight vector w is obtained by taking the average of each row in the correlation matrix. These weights therefore correspond to the average correlation of each algorithm to all the other algorithms. Note that an algorithm that produces scores close to the other algorithms will have a high average correlation and therefore a high weight.

• Combining the scores of each algorithm: Using the standardized scores

(21)

F, the vote matrix V and the weight vector W , we can combine the scores of all algorithms to a single score. The exact formula for calculating the final score of one observation i is as follows:

F f inal_i = PT

j=1(F (i, j) ∗ V (i, j) ∗ W (j) T

Note that T is the number of algorithms used and that this final score is calculated for each observation. The use of the votes differentiates the scores of outliers and inliers, since outliers will get more votes and these votes increase the final outlier score.

• Computing a threshold: Nearly all outlier detection techniques use one of the following three thresholding methods: Standard deviation, median absolute deviation, or inter quartile range (IQR) [26]. All of these methods calculate a statistic based on the anomaly scores and use this statistic to calculate the threshold. For example, the IQR marks all scores higher than Q3 + c ∗ IQR, where c usually is 1.5, as outliers. All of these methods calculate statistics based on the scores including the outliers. The threshold is therefore also affected by the outlier scores. Yang, et al. (2019) [26] propose a solution to this problem. This solution is a two stage thresholding technique that first filters out all of the obvious outliers by calculating the threshold using any of the aforementioned statistics. Next, the scores lower than the first threshold are used in computing the second threshold by any of the statistics. The resulting threshold is then used as the final threshold and any socre higher than that threshold is marked as an outlier. Yang, et al. 2019 [26] noted that for all three statistics their two stage thresholding technique significantly improved the performance of the outlier detection algorithms.

The statistic that I have chosen to use in thresholding the outlier scores is the IQR statistic. This statistic performed best in their experiments. Further- more, it is more robust to high outlier scores than the standard deviation, since it does not use all values to compute a threshold.

(22)

3.2 Evaluation

To test the performance of the ensemble method I have used datasets acquired from Outlier Detection DataSets (ODDS). ODDS provides a large collection of datasets with ground truth outlier labels acquired from a wide variety of domains.

The datasets used can be found in Table 1.

Dataset # points # dimensions # outliers (%)

Annthyroid 7200 6 534 (7.42%)

Arrhythmia 452 274 66 (15%)

BreastW 683 9 239 (35%)

Cardio 1831 21 176 (9.6%)

Glass 214 9 9 (4.2%)

Ionosphere 351 33 126 (36%)

Letter Recognition 1600 32 100 (6.25%)

Lympho 148 18 6 (4.1%)

Mammography 11183 6 260 (2.32%)

Mnist 7603 100 700 (9.2%)

Musk 3062 166 97 (3.2%)

Optdigits 5216 64 150 (3%)

Pendigits 6870 16 156 (2.27%)

Pima 768 8 268 (35%)

Sattelite 6435 36 2036 (32%)

Satimage-2 5803 36 71 (1.2%)

Shuttle 49097 9 3511 (7%)

Speech 3686 400 61 (1.65%)

Thyroid 3772 6 93 (2.5%)

Vertebral 240 6 30 (12.5%)

Vowels 1456 12 50 (3.4%)

WBC 278 30 21 (5.6%)

Wine 129 13 10 (7.7%)

Table 1: List of all datasets used in measuring the performance of outlier detection algorithms.

There are many different ways to measure the performance of outlier detection algorithms. One of the most common ones is the area under curve (AUC). In the context of outliers the AUC is the probability that an outlier detection algorithm

(23)

inlier [27]. This score thus ranges from 0 to 1, where higher scores correspond to better performances. Note that an AUC of 1 by no means indicates that the outlier detection method will be able to perfectly predict all outliers. To evaluate the performance of the ensemble algorithm and to decide on which implementation to use I created two different binary options. The ensemble algorithm could either use the recommended algorithms in the old implementation of PyWash, namely PCA, kNN and IForest or every algorithm implemented in the old version of Py- Wash. Furthermore, sampling features, as explained earlier, could be applied to reduce running time. Alternatively the entire featureset could be used. These four options of implementing the ensemble method are also compared to the IForest algorithm, since this was the best performing algorithm last year. Furthermore the LOF algorithm has also been tested to compare the results with a simpler outlier detection method. The experimental setup consists of running each algorithm ten times on all of the datasets mentioned in Table 1. Running each algorithm multiple times is done to take the randomness of algorithms into account. The results of the experiment can be found in Table 2.

The results show that using all algorithms greatly increases the running time, however also greatly increases the performance of the ensemble method. The running time more than quadruples, yet the performance increases from 0.767 to 0.835 for the ensemble without sampling. Generally speaking, the performance differ- ences between outlier detection methods are quite small, thus I believe that this increase in performance more than justifies the increased running time. Further- more, taking a subset of the features decreases the running time slightly while only decreasing the AUC with 0.001 percentage point. Given the decrease in time and constant performance I believe that feature sampling should be used.

Comparing the proposed ensemble method to the LOF algorithm you can see that LOF performs drastically worse than the chosen ensemble method (AUC of 0.674 compared to 0.835). The drastic increase in performance definitely justifies the increase in running time. On the contrary, the IForest algorithm performs reason- ably well with low running time. For this implementation, however, I have chosen to recommend the ensemble method over the IForest method. The increase in run-

(24)

Ensemble avg AUC avg Time (s) Top Algorithms

No Sampling 0.767 6.36 Top Algorithms

Sampling 0.763 5.39

All Algorithms

No Sampling 0.835 27.7 All Algorithms

Sampling 0.834 25.3

LOFN/A 0.674 7.56

IForest

N/A 0.818 5.94

Table 2: Average AUC and average running time of a variety of ensemble imple- mentations on all datasets

(25)

ning time does still come with a significant increase in performance. Furthermore, PyWash aims to automate aspects of data cleaning and preparation as much as possible. The outliers will be presented to the user and the user can choose to deal with them according to his desires. A better performance means that the user would have to do less tedious and arbitrary tasks to find outliers himself. This could possibly be prevented by using a slightly better yet longer outlier detection algorithm, which in this case is the proposed ensemble method. In conclusion, the ensemble method with all algorithms and feature sampling suits the task best compared to other outlier detection methods.

3.3 Implementation

The outlier detection ensemble method has been integrated into PyWash seamlessly. The ensemble method requires no user interaction whatsoever. If the user chooses to use the advanced ensemble method all algorithms will be used as well as feature sampling. However, since this method can be quite time consuming in the case of very large datasets, I have chosen to give the user the option to use the outlier detection methods as implemented by Castelijns [11] last year. Therefore, as default the advanced ensemble method will be used, but if desired, the user can select the other method by selecting the quicker, sloppier method. The user has, besides the algorithm, also a choice in how to deal with outliers. The user can choose to mark the outliers in an additional column, the outliers can be removed from the data immediately, or no outlier detection can be applied. The user is presented a dropdown menu with those options and can choose according to his desires. This User Interface can be seen in Figure 5

Figure 5: User Interface of choosing the outlier method

(26)

In Figure 6 you can see how a user can review which rows are outliers. The last column ’prediction’ is 1 when the row is predicted to be an outlier and 0 otherwise.

Furthermore, visualizations can be made to review the outliers.

Figure 6: User Interface showing how a user can inspect outliers

(27)

4 Event Logger

Replicability of results is a vital part of data analysis. The user should be able to replicate its steps and get the same results. Obviously, since there is a level of randomness to some of the algorithms employed in PyWash, getting the exact same result will be difficult. However, to give the user the possibility to retrace its steps, I have made an event logger. This event logger traces two major components of PyWash. First, it finds the versions of all necessary package, and secondly all of the major subroutines are tracked. Both of these components are written to a text file in real time. The versions of all packages are required, since different versions of packages could result into different outputs. Tracking subroutines is necessary to allow the user to retrace its steps.

Saving the version of all main packages can be done in multiple ways. The version of the packages could be stored when each package is imported, however, this hurts the workflow of PyWash. Some packages are only imported very far down the line.

In between package imports, a lot of subroutines might have been executed. To maximize the readability of the event log it is of the utmost importance that the component that shows the package versions is separated from the component that shows the subroutines. Therefore, I have chosen to find the version of every required main package at the beginning of a PyWash user cycle. Whenever a user starts PyWash, all versions of packages are stored and printed to the event log.

Checking the versions of all packages is done without importing the packages to minimize the running time. The event log showing the versions of packages can be seen in Figure 7

(28)

Figure 7: Example output of the event logger

Tracking all subroutines is done in real time. Whenever a subroutine is finished the corresponding event log entry is written to the event log. There were two options of writing the subroutines to the event log. First, all of the subroutines could have been written as function calls with their corresponding arguments. Secondly, textual descriptions of the subroutines could be used to give an intuitive, yet less precise description of the subroutine. I have chosen for the second option, since a user will have no knowledge of the function calls of PyWash. Furthermore, some arguments are not readable for humans. Storing subroutines with textual descriptions, including only the necessary arguments of function calls, allowed me to give an intuitive, yet precise description of the process a user has gone through while using PyWash. An example of the logging of subroutines can be found in Figure 7

(29)

5 Conclusion

In this bachelor project, a group of four students, including myself implemented an automated data cleaning tool PyWash. My contribution to this tool is threefold:

predicting the data type of a column, detecting outliers, and logging events. In dealing with these problems I aimed to answer the following research questions:

• How can we detect the column types of a dataset with minimal user interaction?

• How can we detect outliers in a dataset with minimal user interaction?

• How can we provide the user a way of replicating the results of the automated data preprocessing?

For the first research question I implemented a method called Ptype [9]. This method can predict column types along with missing value annotators and anomalies accurately and automatically. The second research question is answered by using a method aimed at combining multiple outlier detection algorithms, weigh- ing them based on the expected performance and increasing the distance between outliers and inliers [25]. This method also requires no user interaction whatsoever.

Finally, the third research question was answered by building an event log that tracks subroutines in real time and also tracks the versions of all main packages.

For all of these solutions I have presented a method, I have evaluated, if necessary, the method and provided examples of the implementation in PyWash. As compared to last year I have improved PyWash in a variety of ways. Data type prediction has been implemented in a sophisticated way, whereas last year a simple and rudimentary heuristic was used. Outlier detection has been improved significantly and finally a way to retrace a users steps is presented. PyWash is available on github [28].

5.1 Future Work

There is still a long way to go before PyWash is finished and can accurately pre- process any dataframe in automated fashion. Although most basic preprocessing

(30)

techniques have been implemented there are a variety of additions to be made.

For example, most of our algorithms are be unable to deal with time data. Fur- thermore, the user interface could be made more intuitive. Besides these general remarks, I will present some possibilities for future work for all three of my subtasks.

Data Type prediction At the moment the data type prediction can accurately predict a variety of basic data types. However, besides basic data types a method to predict statistical data types could be implemented. I would like to refer to an article written by Valera & Ghahrmani (2017) [7] in order to do this. The proposed method uses a Bayesian approach to predict column types such as count, interval, or real valued numerical. Combining this method with the basic data type prediction could improve results of subsequent algorithms like missing value handling.

Outlier Detection There are two directions to go to in order to improve the outlier detection. First, the current implementation could be revised and improved. For example, the current implementation is quite slow. It could be sped up by using vectorized algorithms. Secondly, after an mail conversation with the writers of the advanced ensemble method[25], a different, supposedly even better, outlier detection method designed by them was recommended to me. At the moment there was not enough time to switch to their outlier method [29], however, in the future this method could be implemented and compared to the current method.

Event logger The event logger could be improved by adding additional information. For example, in between subroutines, datasets can be stored in csv files to allow the user to go back a step quickly. I have chosen not to implement this because of extensive storage space, however, allowing the user to opt into storing datasets in between could improve the user experience.

(31)

References

[1] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak and F. Herrera,

‘A survey on data preprocessing for data stream mining: Current status and future directions’, Neurocomputing, vol. 239, pp. 39–57, 2017, issn: 18728286.

doi: 10.1016/j.neucom.2017.01.078.

[2] S. García, J. Luengo and F. Herrera, ‘Data Preprocessing in Data Mining’, Intelligent Systems Reference Library, vol. 72, 2015, issn: 18684408. doi:

10.1007/978-3-642-04898-2{\_}51.

[3] S. Zhang, C. Zhang and Q. Yang, Data preparation for data mining, 5-6.

2003, vol. 17, isbn: 4159822665. doi: 10.1080/713827180.

[4] Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. [Online]. Available: https://www.forbes.com/sites/

gilpress / 2016 / 03 / 23 / data - preparation - most - time - consuming - least-enjoyable-data-science-task-survey-says/#22d8cd606f63. [5] How Much Data Do We Create Every Day? The Mind-Blowing Stats Every-

one Should Read. [Online]. Available: https://www.forbes.com/sites/

bernardmarr/2018/05/21/how- much- data- do- we- create- every- day- the-mind-blowing-stats-everyone-should-read/#2d5f8aff60ba.

[6] August Workforce Report 2018. [Online]. Available: https://economicgraph.

linkedin.com/resources/linkedin-workforce-report-august-2018. [7] I. Valera and Z. Ghahramani, ‘Automatic discovery of the statistical types of

variables in a dataset’, 34th International Conference on Machine Learning, ICML 2017, vol. 7, pp. 5380–5388, 2017.

[8] Types of Machine Learning Algorithms You Should Know. [Online]. Avail- able: https://towardsdatascience.com/types-of-machine-learning- algorithms-you-should-know-953a08248861.

[9] T. Ceritli, C. K. I. Williams and J. Geddes, ‘ptype: Probabilistic Type In- ference’, no. 2006, pp. 1–27, 2019. [Online]. Available: http://arxiv.org/

abs/1911.10081.

[10] E. Vidal, I. C. Society, F. Thollard, C. D. Higuera, F. Casacuberta, I. C.

Society and R. C. Carrasco, ‘Probabilistic Finite-State Machines — Part I’, vol. 27, no. 7, pp. 1013–1025, 2005.

(32)

[11] L. Castelijns, ‘PyWash : a Data Cleaning Assistant for Machine Learning’, no. July, 2019.

[12] I. Ben-Gal, ‘Outlier Detection’, in Data Mining and Knowledge Discovery Handbook, Springer-Verlag, May 2006, pp. 131–146. doi: 10.1007/0-387- 25465-x{\_}7.

[13] V. Barnett and T. Lewis, ‘Outliers in Statistical Data, 2nd ed.’, Biometrical Journal, vol. 30, no. 7, pp. 866–867, Jan. 1988, issn: 03233847. doi: 10.

1002/bimj.4710300725. [Online]. Available: http://doi.wiley.com/10.

1002/bimj.4710300725.

[14] A. M. Canuto, M. C. Abreu, L. de Melo Oliveira, J. C. Xavier and A. d. M.

Santos, ‘Investigating the influence of the choice of the ensemble members in accuracy and diversity of selection-based and fusion-based methods for ensembles’, Pattern Recognition Letters, vol. 28, no. 4, pp. 472–486, 2007, issn: 01678655. doi: 10.1016/j.patrec.2006.09.001. [Online]. Available:

www.elsevier.com/locate/patrec.

[15] M. Amer, M. Goldstein and S. Abdennadher, ‘Enhancing one-class Support Vector Machines for unsupervised anomaly detection’, in Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, ODD 2013, 2013, isbn: 9781450323352. doi: 10.1145/2500853.2500857.

[16] M. L. Shyu, S. C. Chen, K. Sarinnapakorn and L. Chang, ‘A Novel Anom- aly Detection Scheme Based on Principal Component Classifier’, 3rd IEEE International Conference on Data Mining, 2003, issn: 1860949X. doi: 10.

1007/11539827-18.

[17] J. Hardin and D. M. Rocke, ‘Outlier detection in the multiple cluster set- ting using the minimum covariance determinant estimator’, Computational Statistics and Data Analysis, 2004, issn: 01679473. doi: 10.1016/S0167- 9473(02)00280-3.

[18] M. M. Breuniq, H. P. Kriegel, R. T. Ng and J. Sander, ‘LOF: Identifying density-based local outliers’, SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, issn: 01635808. doi: 10.1145/335191.

335388.

(33)

[19] M. Goldstein and A. Dengel, ‘Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm’, undefined, 2012.

[20] S. Ramaswamy, R. Rastogi and K. Shim, ‘Efficient algorithms for mining outliers from large data sets’, SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, issn: 01635808. doi: 10.1145/335191.

335437.

[21] H. P. Kriegel, M. Schubert and A. Zimek, ‘Angle-based outlier detection in high-dimensional data’, in Proceedings of the ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, 2008, isbn:

9781605581934. doi: 10.1145/1401890.1401946.

[22] F. T. Liu, K. M. Ting and Z. H. Zhou, ‘Isolation forest’, in Proceedings - IEEE International Conference on Data Mining, ICDM, 2008, isbn: 9780769535029.

doi: 10.1109/ICDM.2008.17.

[23] A. Lazarevic and V. Kumar, ‘Feature bagging for outlier detection’, in Pro- ceedings of the ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, 2005. doi: 10.1145/1081870.1081891.

[24] Z. He, X. Xu and S. Deng, ‘Discovering cluster-based local outliers’, Pattern Recognition Letters, 2003, issn: 01678655. doi: 10.1016/S0167-8655(03) 00003-5.

[25] J. R. Pasillas-Díaz and S. Ratté, ‘An Unsupervised Approach for Combin- ing Scores of Outlier Detection Techniques, Based on Similarity Measures’, Electronic Notes in Theoretical Computer Science, vol. 329, pp. 61–77, 2016, issn: 15710661. doi: 10.1016/j.entcs.2016.12.005.

[26] J. Yang, ‘Outlier Detection : How to Threshold Outlier Scores ?’, no. Decem- ber, 2019. doi: 10.1145/3371425.3371427.

[27] T. Fawcett, ‘An introduction to ROC analysis’, Pattern Recognition Letters, 2006, issn: 01678655. doi: 10.1016/j.patrec.2005.10.010.

[28] Pywash2/Pywash2. [Online]. Available: https : / / github . com / Pywash2 / Pywash2.

[29] J. R. Pasillas-Díaz and S. Ratté, ‘Bagged Subspaces for Unsupervised Outlier Detection’, Computational Intelligence, vol. 33, no. 3, pp. 507–523, Aug.

(34)

2017, issn: 08247935. doi: 10.1111/coin.12097. [Online]. Available: http:

//doi.wiley.com/10.1111/coin.12097.

Eindhoven University of Technology BACHELOR A Data Cleaning Assistant Quadt, Thomas J.

Eindhoven University of Technology

Bachelor End Project A Data Cleaning Assistant

Thomas Quadt

30 June 2020

Declaration

Contents

1 Introduction

1.1 Problem Statement

1.2 Project

1.3 Outline

2 Column Type Inference

2.1 Method

2.2 Evaluation

2.3 Implementation

3 Outlier Detection Ensemble

3.1 Method

3.2 Evaluation

3.3 Implementation

4 Event Logger

5 Conclusion

5.1 Future Work

References