Future Work - 4 Event Logger - Eindhoven University of Technology BACHELOR A Data Cleaning Assi

4 Event Logger

5.1 Future Work

There is still a long way to go before PyWash is finished and can accurately pre-process any dataframe in automated fashion. Although most basic prepre-processing

techniques have been implemented there are a variety of additions to be made.

For example, most of our algorithms are be unable to deal with time data. Fur-thermore, the user interface could be made more intuitive. Besides these general remarks, I will present some possibilities for future work for all three of my sub-tasks.

Data Type prediction At the moment the data type prediction can accurately predict a variety of basic data types. However, besides basic data types a method to predict statistical data types could be implemented. I would like to refer to an article written by Valera & Ghahrmani (2017) [7] in order to do this. The proposed method uses a Bayesian approach to predict column types such as count, interval, or real valued numerical. Combining this method with the basic data type pre-diction could improve results of subsequent algorithms like missing value handling.

Outlier Detection There are two directions to go to in order to improve the outlier detection. First, the current implementation could be revised and im-proved. For example, the current implementation is quite slow. It could be sped up by using vectorized algorithms. Secondly, after an mail conversation with the writers of the advanced ensemble method[25], a different, supposedly even better, outlier detection method designed by them was recommended to me. At the mo-ment there was not enough time to switch to their outlier method [29], however, in the future this method could be implemented and compared to the current method.

Event logger The event logger could be improved by adding additional inform-ation. For example, in between subroutines, datasets can be stored in csv files to allow the user to go back a step quickly. I have chosen not to implement this because of extensive storage space, however, allowing the user to opt into storing datasets in between could improve the user experience.

References

[1] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak and F. Herrera,

‘A survey on data preprocessing for data stream mining: Current status and future directions’, Neurocomputing, vol. 239, pp. 39–57, 2017, issn: 18728286.

doi: 10.1016/j.neucom.2017.01.078.

[2] S. García, J. Luengo and F. Herrera, ‘Data Preprocessing in Data Mining’, Intelligent Systems Reference Library, vol. 72, 2015, issn: 18684408. doi:

10.1007/978-3-642-04898-2{\_}51.

[3] S. Zhang, C. Zhang and Q. Yang, Data preparation for data mining, 5-6.

2003, vol. 17, isbn: 4159822665. doi: 10.1080/713827180.

[4] Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. [Online]. Available: https://www.forbes.com/sites/

gilpress / 2016 / 03 / 23 / data preparation most time consuming -least-enjoyable-data-science-task-survey-says/#22d8cd606f63. [5] How Much Data Do We Create Every Day? The Mind-Blowing Stats

Every-one Should Read. [Online]. Available: https://www.forbes.com/sites/

bernardmarr/2018/05/21/how- much- data- do- we- create- every- day-the-mind-blowing-stats-everyone-should-read/#2d5f8aff60ba.

[6] August Workforce Report 2018. [Online]. Available: https://economicgraph.

linkedin.com/resources/linkedin-workforce-report-august-2018. [7] I. Valera and Z. Ghahramani, ‘Automatic discovery of the statistical types of

variables in a dataset’, 34th International Conference on Machine Learning, ICML 2017, vol. 7, pp. 5380–5388, 2017.

[8] Types of Machine Learning Algorithms You Should Know. [Online]. Avail-able: https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861.

[9] T. Ceritli, C. K. I. Williams and J. Geddes, ‘ptype: Probabilistic Type In-ference’, no. 2006, pp. 1–27, 2019. [Online]. Available: http://arxiv.org/

abs/1911.10081.

[10] E. Vidal, I. C. Society, F. Thollard, C. D. Higuera, F. Casacuberta, I. C.

Society and R. C. Carrasco, ‘Probabilistic Finite-State Machines — Part I’, vol. 27, no. 7, pp. 1013–1025, 2005.

[11] L. Castelijns, ‘PyWash : a Data Cleaning Assistant for Machine Learning’, no. July, 2019.

[12] I. Ben-Gal, ‘Outlier Detection’, in Data Mining and Knowledge Discovery Handbook, Springer-Verlag, May 2006, pp. 131–146. doi: 10.1007/0-387-25465-x{\_}7.

[13] V. Barnett and T. Lewis, ‘Outliers in Statistical Data, 2nd ed.’, Biometrical Journal, vol. 30, no. 7, pp. 866–867, Jan. 1988, issn: 03233847. doi: 10.

1002/bimj.4710300725. [Online]. Available: http://doi.wiley.com/10.

1002/bimj.4710300725.

[14] A. M. Canuto, M. C. Abreu, L. de Melo Oliveira, J. C. Xavier and A. d. M.

Santos, ‘Investigating the influence of the choice of the ensemble members in accuracy and diversity of selection-based and fusion-based methods for ensembles’, Pattern Recognition Letters, vol. 28, no. 4, pp. 472–486, 2007, issn: 01678655. doi: 10.1016/j.patrec.2006.09.001. [Online]. Available:

www.elsevier.com/locate/patrec.

[15] M. Amer, M. Goldstein and S. Abdennadher, ‘Enhancing one-class Support Vector Machines for unsupervised anomaly detection’, in Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, ODD 2013, 2013, isbn: 9781450323352. doi: 10.1145/2500853.2500857.

[16] M. L. Shyu, S. C. Chen, K. Sarinnapakorn and L. Chang, ‘A Novel Anom-aly Detection Scheme Based on Principal Component Classifier’, 3rd IEEE International Conference on Data Mining, 2003, issn: 1860949X. doi: 10.

1007/11539827-18.

[17] J. Hardin and D. M. Rocke, ‘Outlier detection in the multiple cluster set-ting using the minimum covariance determinant estimator’, Computational Statistics and Data Analysis, 2004, issn: 01679473. doi: 10.1016/S0167-9473(02)00280-3.

[18] M. M. Breuniq, H. P. Kriegel, R. T. Ng and J. Sander, ‘LOF: Identifying density-based local outliers’, SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, issn: 01635808. doi: 10.1145/335191.

335388.

[19] M. Goldstein and A. Dengel, ‘Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm’, undefined, 2012.

[20] S. Ramaswamy, R. Rastogi and K. Shim, ‘Efficient algorithms for mining outliers from large data sets’, SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, issn: 01635808. doi: 10.1145/335191.

335437.

[21] H. P. Kriegel, M. Schubert and A. Zimek, ‘Angle-based outlier detection in high-dimensional data’, in Proceedings of the ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, 2008, isbn:

9781605581934. doi: 10.1145/1401890.1401946.

[22] F. T. Liu, K. M. Ting and Z. H. Zhou, ‘Isolation forest’, in Proceedings - IEEE International Conference on Data Mining, ICDM, 2008, isbn: 9780769535029.

doi: 10.1109/ICDM.2008.17.

[23] A. Lazarevic and V. Kumar, ‘Feature bagging for outlier detection’, in Pro-ceedings of the ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2005. doi: 10.1145/1081870.1081891.

[24] Z. He, X. Xu and S. Deng, ‘Discovering cluster-based local outliers’, Pattern Recognition Letters, 2003, issn: 01678655. doi: 10.1016/S0167-8655(03) 00003-5.

[25] J. R. Pasillas-Díaz and S. Ratté, ‘An Unsupervised Approach for Combin-ing Scores of Outlier Detection Techniques, Based on Similarity Measures’, Electronic Notes in Theoretical Computer Science, vol. 329, pp. 61–77, 2016, issn: 15710661. doi: 10.1016/j.entcs.2016.12.005.

[26] J. Yang, ‘Outlier Detection : How to Threshold Outlier Scores ?’, no. Decem-ber, 2019. doi: 10.1145/3371425.3371427.

[27] T. Fawcett, ‘An introduction to ROC analysis’, Pattern Recognition Letters, 2006, issn: 01678655. doi: 10.1016/j.patrec.2005.10.010.

[28] Pywash2/Pywash2. [Online]. Available: https : / / github . com / Pywash2 / Pywash2.

[29] J. R. Pasillas-Díaz and S. Ratté, ‘Bagged Subspaces for Unsupervised Outlier Detection’, Computational Intelligence, vol. 33, no. 3, pp. 507–523, Aug.

2017, issn: 08247935. doi: 10.1111/coin.12097. [Online]. Available: http:

//doi.wiley.com/10.1111/coin.12097.

In document Eindhoven University of Technology BACHELOR A Data Cleaning Assistant Quadt, Thomas J. (pagina 29-34)