• No results found

An outlier is a data point that differs significantly from other observations. Outliers can have a significant influence on the reliability of models because not every machine learning model is robust to outliers. In this section, I focus on unsupervised outlier detection in multivariate data. I took the implementation of automatic outlier detection from datacleanbot [26] as a starting point.

In this implementation, landmarking meta-features are used to recommend an approach. Pairs of F1 scores from outlier detection methods and meta-features are used to set up a regression learner. The regression learner then predicts the accuracy of each algorithm based on the meta-features of a new dataset. After using the package for the first time, I mostly encountered user experience problems such as that there was no clear visual feedback of which records are marked

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

as an outlier in the table that is displayed. I also did not get the option to keep a subset of the detected outliers, it was either drop them all or keep them. The visualizations that are provided are also hard to work with - especially with high-dimensional data- since there is no filter or zoom functionality available. Furthermore, there are only three outlier detection algorithms considered:

iForest, LOF, and OCSVM. This is acknowledged by Zhang in the future work section where it is also stated that a further project should search for more meaningful meta-features for outlier detection. The training set on which the recommender is trained is also too small. Zhang also mentioned how you can let users choose to run multiple algorithms and then present the results in an appropriate matter [25]. I started searching for more outlier detectors. I came across the relatively new package PyOD that offers a wide variety of models to detect outliers [28]. After trying all of them out I selected ten of the algorithms that were the least prone to spewing out errors. The algorithms can be separated into 4 groups, the following sections will shortly describe the groups and the algorithms that belong to them.

4.2.1 Linear Models

Features in data are generally correlated. This correlation provides the capacity to make predic-tions from one another. The concepts of prediction and detection of outliers are closely linked.

This is because outliers are values that, based on a specific model, deviate from anticipated (or predicted) values. Linear models concentrate on using correlations (or lack of correlation) to spot possible outliers [1].

One-Class Support Vector Machines (OCSVM)

OCSVM learns the decision boundary (hyperplane) that separates the majority of the data from the origin, in which the distance from this origin to all such hyperplanes is maximal. Only a user defined fraction of data points can lie outside the hyperplane, these data points are regarded as outliers. Compared to inliers, outliers should contribute less to the decision boundary [2].

Principal Component Analysis (PCA)

PCA is a dimension reduction technique, it uses singular value decomposition to project the data into a lower-dimensional space. The covariance matrix of the data is decomposed into eigenvectors.

These eigenvectors capture most of the variance of the data and outliers can, therefore, be deduced from them [23,1].

Minimum Covariance Determinant (MCD)

MCD detects outliers by making use of the Mahalanobis distance. The Mahalanobis distance is the distance between a point P and a distribution D. The Mahalabonis distance takes the idea of computing the number of standard deviations away P is from the mean of D and applies it to multivariate data [9].

4.2.2 Proximity-Based

In proximity-based methods, the idea is to model outliers as isolated points based on similarity or distance functions. Methods based on proximity are one of the most prevalent approaches used in outlier detection [1]. However, because of the curse of dimensionality some of these approaches are known to deteriorate [29].

Local Outlier Factor (LOF)

LOF measures the local deviation of density, in other words: the relative degree of isolation. The anomaly score of LOF depends on how isolated the data point is in relation to the neighborhood

20 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

surrounding it. The locality is given by measuring the distance to the k-nearest neighbors. A data point that has a substantially lower density than its neighbours is considered to be an outlier [3].

Clustering-Based Local Outlier Factor (CBLOF)

CBLOF requires a cluster model that was generated by a clustering algorithm and combines it with the input data set to calculate an anomaly score. The clusters are classified into small and large clusters based on (user-defined) parameters alpha and beta. The anomaly score is then calculated based on the cluster size to which the point belongs and the distance to the closest big cluster [10].

Histogram-based Outlier Score (HBOS)

HBOS builds a histogram for each feature individually and then calculates the degree of ‘outlying-ness’. HBOS is significantly faster than other unsupervised outlier detectors because it assumes independence of the features. However, this does come at the expense of precision [8].

k Nearest Neighbors (kNN)

In kNN, each point is ranked based on its distance to its kthnearest neighbor. The top n points of this ranking are then considered to be outliers [20]. Because of its simplicity, kNN is an efficient algorithm.

4.2.3 Probabilistic

In probabilistic models, data is modeled as a closed-form distribution of probability. The para-meters of this closed-form probability distribution model are then learned. Data points that do not fit well with the probability distribution are then selected as possible outliers [1].

Angle-Based Outlier Detection (ABOD)

Where most of the approaches discussed above are based on the distance between points, ABOD focuses on comparing the angle between pairs of distance vectors. If the spectrum of angles observed for a point is wide, the point is enclosed in all directions by other points, suggesting that the point is part of a cluster. However, if the spectrum is small, then it suggests that the point is positioned outside the clusters and the point is regarded as an outlier [15].

4.2.4 Outlier Ensemble

In an outlier ensemble, the results from different algorithms are combined to create a more robust model. A single algorithm with heterogeneous hyperparameters can also be used.

Isolation Forest (IForest)

The Isolation Forest technique is based on decision trees. The Isolation Forest ‘isolates’ data points by randomly selecting and then randomly selects a split value from the range of the selected feature.

The idea is that outliers need fewer partitions in order to be isolated [17].

Feature Bagging

Feature Bagging fits several base detectors on sub-samples of the dataset and uses combination methods to improve the predictive accuracy and lower the impact of overfitting. The sub-samples are the same size as the input dataset, but take a randomly selected subset of features [16].

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

4.2.5 Re-training the Meta-Learner

As mentioned earlier, a regression learner is trained. Luckily the code that was used to train this recommender was available in the datacleanbot repository [26]. I started training using the new selection of algorithms. The recommender is trained on a collection of about 30 datasets from the ODDS (Outlier Detection DataSets) library [21]. At this moment I still had an 11th algorithm in the mix from PyOD. A new technique leveraging neural networks called Generative Adversarial Active Learning (GAAL) for Unsupervised Outlier Detection [18]. This technique was the best performer on 15/30 of the ODDS datasets. However, this is where I noticed what I assume is a mistake in Zhang’s work, upon closer inspection I did not use GAAL correctly and it predicted zero outliers for each dataset and yet it had the best F1score for fifteen out of the thirty datasets.

It turned out that during the benchmarking phase Zhang relabeled the outliers from the ODDS library to -1 and inliers to 1. This was probably done because the implementations used for the three outlier detection algorithms also use these exact labels. The scikit-learn package is used to compute the F1 scores . However, the implementation from scikit-learn returns the ”F1 score of the positive class in binary classification” [4], which in this case are actually the inliers. Rectifying this mistake completely changed the outcome of the benchmark (Figure 4.1) and the F1 scores took a nosedive (Table 4.1). This is where I fully realized that outlier detection is just not that good yet, it is really about providing your best effort.

Figure 4.1: Different best performers after correcting F1 score mistake

Figure4.1also shows that each of the candidate algorithms except feature bagging is the best performer for at least one of the 30 datasets examined. This meant that I had a big problem in

22 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

Algorithm/Model faulty F1 score correct F1score

Isolation Forest 0.838360 0.341635

Table 4.1: F1scores on the ODDS library before and after correcting F1 score mistake

the training of my new recommender because I had so few labeled datasets. I searched for more labeled outlier datasets and I scrambled together 50 datasets in total. This is obviously still not enough to get a reasonable recommender. I was reluctant in creating my own labeled datasets by injecting noise into existing datasets because of the danger of overfitting on this kind of specific outlier. I also thought sampling from the existing 50 datasets would suffer from the same problem.

Many of the landmarkings used in datacleanbot also require the class label in order to compute the meta-features. This means that for any machine learning task that does not use class labels, this way of recommending an approach is not suitable. Because of these reasons, I dropped the recommender based on meta-features and explored other options.

4.2.6 Casting an Outlier Ensemble

Now my idea was to just show the prediction for all ten models and say that there is definitely an outlier if all 10 models flag a record as an outlier. In practice, this did not work. Figure4.2shows a fabricated example where I manually added some monstrous outliers; -999999 in columns with only positive values, yet both HBOS and OCSVM do not recognize the record as an outlier.

Figure 4.2: Example of stubborn outlier detectors

In figure4.2 there is also the column ‘label score’. This was part of the follow-up idea to use a threshold score, i.e. when 7 or more algorithms flag a record, the record is flagged as an outlier.

However, this was also unreliable and at this point, I was basically trying to make my own outlier ensemble without fully realizing it. As mentioned in section 4.2.4, the general idea of an outlier ensemble is to combine the results from different algorithms to create a more robust model. In the end, I ended up using the Locally Selective Combination in Parallel Outlier Ensembles (LSCP) framework because it was also part of the PyOD package and has been published very recently [27].

Locally Selective Combination in Parallel Outlier Ensembles (LSCP)

LSCP is similar to Feature Bagging in many ways. LSCP combines outlier detectors by emphas-izing data locality. The idea is that certain types of outliers are better detected when examining them at a local scale instead of a global scale. LSCP defines the local region of a test instance

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

by finding clusters of similar points in randomly selected subspaces. It then identifies the most suitable detector by measuring similarity relative to a pseudo ground truth. This pseudo ground truth is generated by taking the maximum outlier (anomaly) scores across detectors. The Pear-son correlation is used to compute similarity scores between the base detectors and the pseudo ground truth scores in order to evaluate the local competency of base detectors. Note that the pseudo outlier scores are not converted to binary labels. The detector with the highest similarity is selected as the most competent local detector for that specific region [27].

4.2.7 Benchmark

Before benchmarking this outlier ensemble I also decided to change the benchmarking process a bit compared to the benchmark that Zhang used [26]. All of the models used have a ‘contamina-tion’ parameter that should represent the proportion of outliers in the data set. This parameter is used to define the threshold on the decision function. Zhang argued that it is most fair to set the contamination at the actual outlier percentage because then F1 scores are optimal [25]. How-ever I think this produces an unrealistic setting, the actual outlier proportion is rarely known in unsupervised outlier detection. Therefore, I benchmarked using the same contamination estimate that will be used in the package itself. The estimate that is being used is relatively simple, it is the count of rows that contain one or more outliers in any feature according to the conventional method (value more than 3 standard deviations away from the mean) divided by the total amount of rows. This estimation works well with normally distributed data, but should not be taken for granted. I also computed ROC-AUC scores since that metric is widely used in outlier research.

Evaluation metrics used in benchmark

Before I provide the equations for both ROC-AUC and F1-score some definitions need to be laid down. A binary classifier can have four outcomes:

• True Positive (TP): Correctly classified outlier.

• True Negative (TN): Correctly classified inlier.

• False Positive (FP): Inlier classified as outlier (Type I error).

• False Negative (FN): Outlier classified as inlier (Type II error).

The count of these outcomes can be used to compute precision and recall.

P recision = T P T P + F P Recall = T P

T P + F N

The F1-score combines these rates and calculates the harmonic mean [7]:

F1= 2 · P recision · Recall P recision + Recall

The second evaluation metric is the (ROC-)AUC score. A ROC graph is a two-dimensional graph in which the true positive rate (also known as recall) is plotted on the Y-axis and the false positive rate (FPR) is plotted on the X-axis. The false positive rate is given by:

F P R = F P F P + T N

Comparing two different ROC graphs with each other is impractical. ROC graphs are therefore often reduced to scalar values. A popular approach is to calculate the area under the ROC curve.

This is abbreviated as AUC. The AUC is the probability that an outlier detector will assign a higher score to a randomly selected outlier than a randomly selected inlier [7].

24 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS LSCP (LOF(5, 10, 20, 30, 40, 50, 100, 150, 200)) 0.207018 0.572485 LSCP (PCA, IForest, kNN) 0.244237 0.587088

Table 4.2: Comparison of average F1 and ROC-AUC scores on the ODDS library (5 independent trials)

The results of the benchmark are in table 4.2, we can see that the F1 scores are slightly worse, which was to be expected since the contamination estimate is not that accurate. I first tried the LSCP configuration as described in the paper. The paper uses a pool of 50 LOF detectors with a randomly selected number of neighbours in the range of [5,200] [27]. Because of the long run time of running 50 detectors, I lowered it to 9 handpicked numbers from this range. However, as seen in table4.2, this did not result in significantly better results. The paper also mentions that working with heterogeneous base classifiers was proven to be a success in different classification problems and that performance improvement is expected when diverse base detectors are used [27].

Therefore, I tried to give heterogeneous base detectors a shot by combining the best-performing algorithms from table 4.2. Three algorithms seem to have significantly better performance in the benchmark: Isolation Forest, k Nearest Neighbours, and Principal Component Analysis. I selected these algorithms as my outlier ensemble for LSCP. Luckily, the three best-performing algorithms are also some of the most efficient algorithms in my pool of outlier detectors. As table 4.2 demonstrates this configuration got the highest F1 and ROC-AUC score of them all, although only with a small margin. This small margin means that -even though this may be one of the more robust approaches- I cannot recommend a specific outlier detection method with much confidence. Also, many of the algorithms examined depend on one or more user-defined parameters. The algorithms have been tested using their default parameter values, resulting in sub-optimal performance. In the end, the algorithm selection problem depends too much on the context and characteristics of the dataset. Therefore there is no one-size-fits-all approach and recommending based on the dataset is also complicated as I have found.

4.2.8 Implementation in PyWash UI

However, we can still assist the user of PyWash by making it easy to experiment with a multitude of algorithms in order to discover possible outliers. Figure 4.3 shows the input fields that are used for outlier detection. Users can input their own contamination score or estimate it using the earlier mentioned method. Then it is time to select an outlier detection algorithm, there are 2 presets available. These presets are the 2 LSCP configurations that have been benchmarked in the previous section. The user can select any algorithm from the dropdown menu when two or more algorithms are selected, LSCP is leveraged to detect the outliers.

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

Figure 4.3: User interface of outlier detection: input

When ‘Detect Algorithms’ is clicked (hidden by the dropdown menu in figure4.3), the settings are handed to the backend. The backend will return a data frame with 3 columns added to the original data: ‘anomaly score’, ‘prediction’ and ‘probability’. Figure4.4is an example of such an output. The prediction column holds the binary classification made, outliers are highlighted with a red background. Probability is defined as the unified outlier score as described in the paper by Kriegel et al. [14]. The visualizations can now also make use of the newly added columns, this can lead to some insightful visualizations.

Figure 4.4: User interface of outlier detection: output