• No results found

In order to (semi)automate data cleaning and preprocessing, we need a clear and measurable definition of data quality. Data readiness levels have been proposed to fit this need, but they require a more detailed and measurable definition than is given in prior works. In our framework, datasets are classified within bands, and each band introduces more fine-grained terminology and processing steps. Scores are assigned to each step, resulting in a data quality score. This allows teams of people, as well as automated processes, to track and reason about the cleaning process, and communicate the current status and deficiencies in a more structured, well-documented manner.

Figure 2.1: The framework bands with weights and deficiencies

CHAPTER 2. THE ABC OF DATA

A dataset will be ready for certain operations to be performed on it depending on its band.

The bands consist of several weighted dataset deficiencies which reflect what are currently the most important deficiencies that need to be addressed in the current band. An overview of the bands and their deficiencies can be found in table 2.1. The weights that we have given in table 2.1are advisory weights for a generic dataset without a specific target analysis. A description of the bands and the functionality they unlock are given below. Band C (Conceive) refers to the stage that the data is still being ingested. If there is information about the dataset, it comes from the data collection phase and how the data was collected. The data has not yet been introduced to a programming environment or tool in a way that allows operations to be performed on the dataset. The possible analyses to be performed on the dataset in order to gain value from the data possibly haven’t been conceived yet, as this can often only be determined after inspecting the data itself.

Band B (Believe) refers to the stage in which the data is loaded into an environment that allows cleaning operations. However, the correctness of the data is not fully assessed yet, and there may be errors or deficiencies that would invalidate further analysis. Therefore, analyses performed in this stage are often more cursory and exploratory, such as an exploratory data analysis with visualization methods to ascertain the correctness of the data. Skipping these checks might lead to errors or ‘wrong’ results and conclusions.

In band A (Analyze), the data is ready for deeper analysis. However, even if there are no more factual errors in the data, the quality of an analysis or machine learning model is greatly influenced by how the data is represented. For instance, operations such as feature selection and normalization can greatly increase the accuracy of machine learning models. Hence, these operations need to be performed before arriving at accurate and adequate machine learning models or analyses. In many cases, these operations can already be automated to a significant degree.

In band AA (Allow Analysis), we consider the context in which the dataset is allowed to be used. Operations in this band detect, quantify, and potentially address any legal, moral or social issues with the dataset, since the consequences of using illegal, immoral or biased datasets can be enormous. Hence, this band is about verifying whether analysis can be applied without (legal) penalties or negative social impact. One may argue that legal and moral implications are not part of data cleaning, but rather distinct parts of the data process. However, we argue that readiness is about learning the ins and outs of your dataset and detecting and solving any potential problems that may occur when analyzing and using a dataset.

Band AAA is the terminus of our framework. Getting into AAA would mean that the dataset is clean. The data is self-contained and no further input is needed from the people that collected or created the data.

2.2 Data Quality Scores

A dataset has a score between 0 and 1 for each band of our framework, so a dataset can have score 0.9 for band C, 0.8 for band B, 0.10 for A, 0.20 for AA and possibly 0 for band AAA. Datasets start with initial band score values of 0 for every band, as we generally do not know (for certain) which issues the dataset is suffering from that could potentially jeopardize machine learning methods.

The dataset is classified in band C at this stage. We then proceed to check and solve all issues from band C. Each band deficiency that is solved or non-existent contributes to the band score of the dataset. Partially checked or solved deficiencies can grant partial weight scores. A dataset will move to the next band only when it has surpassed a certain threshold score. This also means that a dataset cannot get a band label of A or above when it has a B.60 score, even if the dataset fulfills all band A requirements.

The threshold scores can be determined by the framework users to determine how thoroughly the dataset has to be cleaned before it is able to proceed to further bands. We have set the default threshold for all bands on 0.85 to allow a dataset to advance while it’s not totally perfect, since striving for a perfect dataset may not be achievable or cost-effective in general. The dataset might not be entirely clean when the thresholds are less than 1, as a dataset could advance to the next

12 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 2. THE ABC OF DATA

band (including band AAA) while not every issue has been checked or fixed yet. That said, the thresholds cannot be set too low (e.g., < 0.65) as datasets wouldn’t be checked properly, which could seriously impact machine learning methods and dataset usability, causing errors and false predictions or estimates.

Chapter 3

PyWash: Web Interface

A web app based on Flask and React was advised in the BEP proposal. Ultimately, building one from scratch would give us the most flexibility. But time and manpower were too scarce, especially because we had very little experience with building such a web interface using react. Therefore, we decided to leverage Dash, a python framework for web-based analytics applications. Dash is built on top of Plotly, React and Flask. This means that advanced visualizations from Plotly can be used in combination with the baked-in react components from Dash. Dash callbacks make the web interface interactive. Appendix A contains the callback graph of PyWash. The interface, shown in Figure3.1, groups the operations for each band into separate color-coded tabs. Initially, the only band that is accessible is band C. Band C includes interactions to import datasets and merged them together if applicable. Once a dataset is successfully parsed, PyWash will assign it a tab at the top of the screen with its name, subsequently, the other bands will be unlocked and the data will be shown in an interactive data table. The interface supports multiple datasets being loaded at the same time. When a dataset is selected, a description is shown which includes the data quality score and dimensions of the dataset. The data quality score is a placeholder for now and is therefore not operational. Below this description, there is a row of color-coded buttons which are used to switch between the bands from the framework. The colors correspond with figure2.1and are chosen to be reminiscent of a traffic light.

14 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 3. PYWASH: WEB INTERFACE

Figure 3.1: The PyWash web interface

3.1 The Data Table

We wanted a data table to be the key piece in the web interface. Dash provides a fully interactive DataTable which offers live editing, sorting and filtering. Figure 3.2shows some of these options in action. In this particular instance, I filtered both strings (Sex and Smokstat) and numeric (Age and Alcohol) features. I also sorted cholesterol in descending order. You can highlight a cell by clicking on it. Double-clicking on a cell allows you to edit the value of the cell. Every row and column also has a small cross. Clicking on it will remove that row or column from the data.

Saving the current dataset on disk is possible at any time using export options available below the DataTable. Exporting the data will return the data ‘as is’, thus including any modifications made in the table (filtering, sorting, etc.). Currently, data can be exported as either a csv or arrf file.

Figure 3.2: The Dash DataTable

CHAPTER 3. PYWASH: WEB INTERFACE

3.1.1 Difficulties encountered using Dash

The Dash DataTable is still in alpha and I noticed that during development. One of the most important features to us; virtualization, made the web interface very unstable. This is unfortunate because it would have added a lot to the scalability of the interface. Virtualization saves browser resources by only a rendering a subset of the data at any given instant. However, if in the future virtualization would become more stable, PyWash would only need a few lines changed in order to use it. Another annoyance was that the selecting feature is not that rich yet. In figure3.2there is a checkbox in front of every row. However, upon selecting a row nothing can be done out of the box. I tried to implement a function that if activated removes every selected row from the dataset.

This worked well at first use, but after that the indexes of the DataTable and the data frame in the background became misaligned, making the function unstable after the initial use. Another issue is the method Dash uses to represent empty cells in the data table. I wanted to highlight the missing values by changing the background color of the cells. However, Dash does not provide the syntax needed to conditionally highlight these empty values. I also encountered countless of small to large bugs. In example, boolean values are hidden in the data table, however, when a cell is selected it does show the boolean value.

Dash also has some limitations in that every callback can only have a single output, this resulted in seemingly redundant code. However, during the project Dash received an update which allows multiple outputs to be used with one callback. Because this update was introduced late into the project, it is not used to the full potential. Another feature that was added to Dash late into the project was Dash Testing. Dash Testing is a set of testing APIs which can be used in combination with a Selenium WebDriver to automatically test the interaction inside a web browser. This would have been very useful if it was available at the start of the project.

3.2 Visualizations

The purpose of visualizations is to efficiently interpret the data. Visualizations are a powerful tool that can be used during different phases of analysis. Having an interactive element in visualizations will only improve the ability to explore, interpret and understand the data [13]. One of the advantages of a web interface is that visualizations can be made more interactive than in a Jupyter Notebook. Some of the plots also have interactivity with the data table. Selecting points in the plot will also select this data in the data table. The plots are generated using Plotly. I have created multiple (interactive) visualizations for PyWash that are easily generated. The visualizations can be generated in the ‘Plots’ tab in figure3.1. After a plot is selected, a Dash loading component is used while the plot is being generated in the background.

3.2.1 Box Plot

A box plot is a visualization of numeric data. It shows the levels of the quartiles, a horizontal line depicts the level of the median, and it has ‘whiskers’ at the top and bottom that indicate possible outliers. It is commonly used to detect outliers.

Figure 3.3: Parallel box plots

16 PyWash: a Data Cleaning Assistant for Machine Learning

CHAPTER 3. PYWASH: WEB INTERFACE

Figure 3.3shows how this visualization looks like in PyWash. Hovering across a feature with your cursor will display the quartile levels, minimum and maximum value such as shown in the figure. It has to be noted that because of the distinct ranges of the features, it is hard to see any outliers in the features with a smaller range in figure3.3. There is an interactive solution for this problem, in the legend to the right of the visualizations, users can hide features by clicking on the name of the feature. The visualization will automatically re-scale its axes to the new subset of the data (Figure3.4).

Figure 3.4: Automatic rescaling after hiding features

3.2.2 Stacked Bar Chart

The stacked bar chart is primarily used to visualize the distribution of values in categorical features.

Figure3.5below shows the stacked bar chart generated on a subset of the Titanic dataset. As the

‘Cabin’ feature demonstrates, the stacked bar chart can also be used to highlight missing values.

Hovering across a feature with your cursor will show the labels and frequency such as seen in the

‘Sex’ feature in figure3.5.

Figure 3.5: Stacked bar chart

3.2.3 Distribution Plot

In this visualization, shown in figure 3.6, a single numeric feature can be explored. The plot combines a histogram, kernel density estimation and a rug plot (one-dimensional scatter plot) into a single interactive plot. This plot can be helpful in determining the distribution of a feature.

Sturges’ formula [24] is used to determine the bin size of the histogram.

Figure 3.6: Distribution plot

CHAPTER 3. PYWASH: WEB INTERFACE

3.2.4 Parallel Coordinates

The parallel coordinates plot (or PCP) is a popular technique to visualize multivariate data [11].

In a parallel coordinates plot, each feature has its own axis and these axes are placed parallel to each other. Records are then plotted on these axes, creating a polygonal chain across the features.

This way relationships can be explored. When most of the lines between two axes are somewhat parallel to each other, it might indicate a positive relationship between the two examined features.

However, when most lines cross (creating X-shapes), it indicates a negative relationship [12].

Figure 3.7: Parallel coordinates plot

Figure3.7is an example of a parallel coordinates plot in PyWash. The plot is richly interactive in order to explore relationships and outliers. Users can select a feature to be color-coded, reorder the axes, and drag the lines along the axes to filter regions. Figure3.8 has reordered axes and is filtered (pink lines along the axes), showing a clear relationship between age and annual income.

Figure 3.8: Parallel coordinates plot interactivity

18 PyWash: a Data Cleaning Assistant for Machine Learning

Chapter 4

PyWash: design process and decisions

The framework described in chapter 2 is the theoretic foundation PyWash is built on. Implement-ing a solution for every deficiency in figure 2.1 in combination with presenting the data quality score in an effective manner is the end goal of PyWash. However, reaching this goal within the limited time-frame would be ambitious. Therefore, choices had to be made. PyWash currently handles column types and missing values from band B and both outlier detection and feature scaling from band A. A case can be made that some of the other deficiencies can also be (partly) solved by utilzing the visualizations and the interactive DataTable of PyWash as seen in the pre-vious chapter. In order to offer solutions for the described deficiencies, I was asked to leverage existing packages and improve them when needed. Some of the packages I used are staples of a data scientists toolkit such as scikit-learn [4] and pandas [19]. Others are lesser-known packages such as PyOD [28] and datacleanbot, a package that is part of a recent MSc Thesis by Ji Zhang [25]. This chapter will explain the design process and justifies the decisions that have been made.

4.1 The SharedDataFrame

As mentioned in chapter 2, before deficiencies can be solved the data has to be ingested. In other words, introduce the data to PyWash in a way that allows operations to be performed on the dataset. To facilitate this we created a ‘SharedDataFrame’. The SharedDataFrame is an abstract data type that wraps around a pandas DataFrame. This enables PyWash to maintain a data quality score and to call functions onto the SharedDataFrame. A dataset has passed band C when a dataset is loaded successfully into a SharedDataFrame. The SharedDataFrame is designed as an API. This means that the SharedDataFrame together with its properties and functions can be used independently from the web interface.

4.2 Band A: Outlier Detection

An outlier is a data point that differs significantly from other observations. Outliers can have a significant influence on the reliability of models because not every machine learning model is robust to outliers. In this section, I focus on unsupervised outlier detection in multivariate data. I took the implementation of automatic outlier detection from datacleanbot [26] as a starting point.

In this implementation, landmarking meta-features are used to recommend an approach. Pairs of F1 scores from outlier detection methods and meta-features are used to set up a regression learner. The regression learner then predicts the accuracy of each algorithm based on the meta-features of a new dataset. After using the package for the first time, I mostly encountered user experience problems such as that there was no clear visual feedback of which records are marked

CHAPTER 4. PYWASH: DESIGN PROCESS AND DECISIONS

as an outlier in the table that is displayed. I also did not get the option to keep a subset of the detected outliers, it was either drop them all or keep them. The visualizations that are provided are also hard to work with - especially with high-dimensional data- since there is no filter or zoom functionality available. Furthermore, there are only three outlier detection algorithms considered:

iForest, LOF, and OCSVM. This is acknowledged by Zhang in the future work section where it is also stated that a further project should search for more meaningful meta-features for outlier detection. The training set on which the recommender is trained is also too small. Zhang also mentioned how you can let users choose to run multiple algorithms and then present the results in an appropriate matter [25]. I started searching for more outlier detectors. I came across the relatively new package PyOD that offers a wide variety of models to detect outliers [28]. After trying all of them out I selected ten of the algorithms that were the least prone to spewing out errors. The algorithms can be separated into 4 groups, the following sections will shortly describe the groups and the algorithms that belong to them.

4.2.1 Linear Models

Features in data are generally correlated. This correlation provides the capacity to make predic-tions from one another. The concepts of prediction and detection of outliers are closely linked.

This is because outliers are values that, based on a specific model, deviate from anticipated (or predicted) values. Linear models concentrate on using correlations (or lack of correlation) to spot possible outliers [1].

One-Class Support Vector Machines (OCSVM)

OCSVM learns the decision boundary (hyperplane) that separates the majority of the data from the origin, in which the distance from this origin to all such hyperplanes is maximal. Only a user defined fraction of data points can lie outside the hyperplane, these data points are regarded as outliers. Compared to inliers, outliers should contribute less to the decision boundary [2].

Principal Component Analysis (PCA)

PCA is a dimension reduction technique, it uses singular value decomposition to project the data into a lower-dimensional space. The covariance matrix of the data is decomposed into eigenvectors.

These eigenvectors capture most of the variance of the data and outliers can, therefore, be deduced from them [23,1].

Minimum Covariance Determinant (MCD)

MCD detects outliers by making use of the Mahalanobis distance. The Mahalanobis distance is

MCD detects outliers by making use of the Mahalanobis distance. The Mahalanobis distance is