Continuous Integration in Machine Learning

(1)

Continuous Integration in Machine

Learning

Towards fully automated continuous learning

Andras Herczeg

h.herczeg.andras@gmail.com

July 11, 2018, 50 pages

Supervisor: Dr. Ana Oprescu

Host organisation: Media Distillery,https://www.mediadistillery.com/

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

The use of Continuous Integration promises many benefits, but its implementation is not trivial. In the increasingly popular area of machine learning there are a number of challenges that prevent the use of Continuous Integration, but currently there are no scientific works outlining these. This paper aims to highlight the barriers in the way of implementing Continuous Integration for machine learning and proposes solutions to overcome them.

(5)

Chapter 1

Introduction

Continuous integration is a widely established development practice as it promises higher software quality and shorter release cycles with a significantly decreased chance of integration errors and the end of the development, but despite these benefits, some industries, such as the increasingly popular machine learning has yet to adopt it. While machine learning is quickly gaining ground thanks to its ability to tackle previously unsolvable tasks with relative ease, the technology is not without its challenges; besides the basic code complexity issues common in traditional software development, machine learning systems face higher-level level difficulties as well, one of which is the challenge of implementing software engineering practices including Continuous Integration. The frequent updates of Continuous Integration fit well with the iterative nature of the training machine learning models and the evolving data, but despite the advantages, implementing this software process is not trivial.

There are a number of barriers that makes the implementation of Continuous Integration difficult, some being more obvious than others. Some of the clear issues are the average build time and the lack of automated test frameworks both which are common traits of machine learning systems, but there are other challenges that stand in the way of the standard Continuous Integration practices in ways not obvious at the first glance. That being said, there are currently no scientific research concerning these difficulties or the possible remedies for them. The motivation for this paper was to fill this gap of knowledge by creating a concise list of barriers preventing the implementation of Continuous Integration that commonly associated with machine learning development, and suggested already available techniques to overcome them.

1.1 Research Question

This research aims to answer the following research question:

What are the barriers preventing machine learning systems from adopting Continuous Integration and what methods can be used to overcome these barriers?

1.2 Research Method

This research was conducted by employing a sequential exploratory strategy — a strategy that makes use of quantitative results in order to explain or interpret the findings of qualitative study [1]. First, a qualitative literature survey was conducted to identify the root causes of the problem at hand: exploring the barriers in the implementation of continuous integration for machine learning systems and finding methods that can help development teams overcome these challenges.

Once the barriers and their workarounds were identified, a confirmatory case study was conducted to determine the efficiency of some of the suggested solutions.

(6)

1.3 Outline

In this section the structure of this thesis is outlined.

Chapter 2 summarizes the papers whose findings motivated this research.

Chapter 3 introduces the necessary background knowledge on Continuous Integration and machine learning.

Chapter 4 details the barriers associated with the implementation of Continuous Integration. Chapter 5 discusses the potential solutions for the challenges mentioned in Chapter 4.

Chapter 6 is concerned with the evaluation of two techniques suggested in Chapter 5 implemented at the host company.

Chapter 7 contains the analysis on the results from chapter 6 and expands on possible future work based on this thesis.

(7)

Chapter 2

Motivating Examples

This thesis was motivated by a number of previous studies on the challenges experienced with the implementation of Continuous Integration and with the introduction of Software Engineering stan-dards to machine learning development; in this chapter some of these works are shortly summarized focusing on the findings similar to the ones in this thesis.

2.1 Climbing the “Stairway to Heaven”

In their frequently cited work Olsson & Bosch [2] explore the difficulties software companies face on the so-called “Stairway to Heaven”, the evolution path from traditional software development, through Agile development, Continuous Integration, and Continuous Deployment, to finally reaching R&D as an innovation system where the development organization responds based on instant cus-tomer feedback and where actual deployment of software functionality is seen as a way of validating functionality. From the perspective of this thesis, the most important findings are the ones concerned with the difficulties associated with the transition from Agile development to Continuous Integration. Some of the barriers identified by this study such as the lack of automated tests and the challenges of modularization are also considered to be relevant to the domain of machine learning.

2.2 Challenges When Adopting Continuous Integration

Debbiche et al. [3] took a closer look at the transition from Agile development to Continuous In-tegration and the adoption challenges associated with it. Closely inspecting the work of a Swedish

(8)

Figure 2.2: Continuous Integration challenges [3]

telecommunication service, the study aims to provide an in-depth analysis of 23 challenges that could be viewed as barriers in front of the adoption of Continuous Integration. Many of the findings of this study are also applicable to the domain of machine learning: besides the barriers mentioned in “Climbing the Stairway to Heaven”, some of the most relevant challenges identified are the ones associated with developers’ mindset, problems originating from the maturity of the processes, and the question of domain applicability.

2.3 Continuous Integration Applied to Software-Intensive

Em-bedded Systems Problems and Experiences

Maartensson et al. [4] further narrowed the study of Continuous Integration by focusing their research on embedded systems. The challenges faced by embedded system developers when adopting various software processes are somewhat similar to the ones faced by machine learning experts: just like in machine learning, automated testing is difficult in embedded systems development; similarly, both machine learning and embedded systems are often tightly coupled making frequent changes hard to achieve.

2.4 The ML Test Score A Rubric for ML Production

Readi-ness and Technical Debt Reduction

Finally, Breck et al. [5] present a study concerning with the production readiness of machine learning systems. Although this paper is not directly related to Continuous Integration, many of the best practices suggested by the team could potentially help alleviate the problems posed by the adoption of the software process. For example, many of the tests described in the paper can be partially used as automated tests much needed by machine learning developers.

(9)

(10)

Chapter 3

Background

In this chapter an overview of Continuous Integration and machine learning provided as these two are the most important topics from the perspective of this thesis. First, the practice of Continuous Integration is explained among with its most common elements with benefits followed by an outline of machine learning and its popular branch, deep learning. While this chapter aims to offer the necessary background knowledge for the understanding of this thesis, the full extent of these research areas cannot be fully covered here.

3.1 Continuous Integration

For many years the integration of software projects were risky and full of uncertainties. In most cases, the components that were tested and approved on their own, produced unforeseen errors when put together with all the other pieces. But for almost two decades, this problem slowly disappeared thanks to the appearance of Continuous Integration (CI).

CI is a widely used development process originating from Kent Beck’s twelve Extreme Programming practices [7]. It is characterized by the frequent integration of code changes often several times a day [3, 8, 6]. Each of these integrations are checked by automated tests, if all of them are passed, the integration results in a new working build of the software.

3.1.1 Common Elements of Continuous Integration

While CI is becoming a standard in software development, there is still no consensus on how these systems should be implemented [9,3]. Similarly, while various tools are available to facilitate the use of CI, such as Jenkins or TeamCity, the practice does not require any particular tools to be adopted [8]. That being said, there are a handful of common practices that are widely associated with CI and can be viewed as a sort of guideline. In his commonly cited article [8] Martin Fowler highlights

ID Continuous integration corner stone

C1

All developers run private builds on their own workstations before committing their code to the version control repository to ensure that their changes don’t break the integration build

C2 Developers commit their code to a version control repository at least once a day C3 Integration builds occur several times a day on a separate build machine C4 100% of tests must pass for every build

C5 A product is generated that can be functionally tested C6 Fixing broken builds is of the highest priority

C7 Some developers review reports generated by the build, such as coding standards and dependency analysis reports, to seek areas for improvement

(11)

Figure 3.1: The components of a CI system [6]

the importance of good version control, daily integrations, use of integration machines or servers, and automating the build. In his book, Paul M. David [6] mentions several common rules, such as developers should run private builds on their own system before committing to the shared version control repository, integrating at least once a day, and not accepting any builds if it doesn’t pass all the tests.

3.1.2 Benefits of Continuous Integration

Just like with the specific implementations, the exact benefits CI provides are still debated. The most commonly mentioned benefits are increased frequency of software releases, shortened feedback cycle [3], less frequent regression and their source can be found easier [8, 10], and increased productivity [11, 4]. While these benefits (among others) are frequently mentioned in literature, there have been only a few research papers exploring the validity of these claims. The two research explicitly focused on reviewing these benefits [12,13] both show that all of these claims are at least partially true and projects that implement CI do perform better than the ones that don’t. That being said, the increased popularity of CI warrants for more research in this area.

3.2 Machine Learning

In recent years, machine learning has become widely adopted bringing breakthrough in many fields ranging from agriculture [14], through bioinformatics [15], healthcare [16], marketing [17], natural language processing [18], to neuroscience [19] and software engineering [20] just to mention a few examples. Using historical data, machine learning algorithms can learn from past experiences by creating models through statistical methods; these models then can be used to make predictions about previously unseen data. The main difference between machine learning and traditional programming is that this model is not explicitly programmed, but rather incrementally improved by the algorithms. One way to classify machine learning approaches is by their learning signals, specifically into su-pervised, unsusu-pervised, and reinforcement learning. Supervised learning is the most common form of machine learning [21]; it aims to find patterns in a given set of training data that contains pairs of input vectors and their appropriate output values or labels. The result of this learning is a model that

(12)

Figure 3.2: The outputs of each layer (horizontally) of a typical convolutional network architecture applied to the image of a dog [21]

attempts to correctly identify this input-output pair for new data points. A subclass of supervised learning is semi-supervised learning where the output values are only given for a subset of the training examples. Unsupervised learning uses a set of training data as well, but it only contains input vectors without any output values. It is the task of the learning algorithm to identify common structures within the input data and categories it accordingly. Due to the fact that the examples are unlabeled, it is often difficult to evaluate the performance of these algorithms. The third group is reinforcement learning where instead of input data, the algorithm carries out various actions (such as playing video games [22]); these actions are rewarded or in some cases penalized according to the learning task. The algorithm learns prioritizing certain actions in order to maximize the received rewards.

Machine learning algorithms can be also categorized by their desired outputs. Classification is one of the most common application; its output is a model that can select the most fitting class (among a number of predefined classes) for unseen data. Another popular application is regression that produces a model which can be used for prediction and forecasting. A third type of common task is clustering, which resembles classification with the difference that the classes are not predefined, instead the data points are grouped together in a way that they are more similar to the other points they are grouped together with, than the ones in other groups.

3.2.1 Deep Learning

For many years, designing a suitable feature extractor — the component that transforms the raw data into the desired features — used by the machine learning algorithms was a considerable challenge [21]. Recently, a new branch of machine learning gained increased popularity dramatically improving speech and object recognition without the need of tedious feature engineering. Deep learning uses representation methods that can automatically discover the representations from raw data. With the composition of multiple layers of representations, very complex functions can be learnt that can tackle challenges that even the best efforts of machine learning experts couldn’t solve before. Such as other machine learning techniques, deep learning can be based on supervised, unsupervised, or reinforcement learning. Deep learning has proved to be better than conventional machine learning systems not only in image [23,24,25] and speech recognition [26,27,28], but also in playing games, [29,22], malware detection [30], drug research [31], controlling self-driving cars [32] and in many other tasks.

That being said, while deep learning has many benefits, it is not without its own drawbacks. To perform better than other machine learning algorithms and produce results as expected, deep learning requires a vast amount of training data. Computational time is also considerably larger compared to classic algorithms, with more complex models that can take days of training time on hundreds of machines even using state-of-the-art equipment [33,34,35,36].

(13)

Chapter 4

Barriers

This chapter outlines the various challenges that in one way or another make the adoption of CI difficult. The challenges listed here are only the ones that are specific to machine learning and that are in some way related to CI.

4.1 Challenges with the Training Data

The quality and quantity of the training and testing data is critical for effective machine learning making it at least as important as the algorithms, software, and infrastructure used [37, 38, 39,40, 41,42,43,44,45]. In the case of deep learning systems this is even more crucial as they rely more on large quantities of relatively clean data than conventional machine learning algorithms [46]. Gathering this data is challenging even in the era of Big Data; collecting samples is often hard, expensive, and time consuming

[41,47,48]. Reusing data classified by an algorithm to gain more training samples creating better algorithms may seem like an obvious solution[41], but misclassified samples can easily lead to hidden feedback loops [49]. With the increasing popularity of machine learning systems problems in these data sets can have far reaching consequences [42]. It is well known that noise in the data (more importantly class noise which is shown to be more harmful than feature noise [50, 51, 52, 53]) can significantly affect the quality of trained models [54,55,56,44,43]. Therefore, an essential but often overlooked stage of machine learning pipelines is handling dirty data [54].

The process of cleaning dirty data usually consists of two main steps: first the error should be detected after which the data can be repaired [45]. Error detection is considered to be easier to handle as it can fulfilled with simple integrity rules on the input data. On the other hand error repair is substantially more difficult and it often requires tedious manual work [54]. Besides noise detection, another method to handle dirty data is by building robust, noise tolerant systems [43] but it is argued that this method is less effective than noise elimination [57].

4.1.1 Challenges in Production Machine Learning

Polyzotis et al. [41] highlight several issues that commonly occur in production machine learning in regards of data. First, the quality of the model is highly dependent on the data validity. Invalid data can also cause outages. As stated earlier, data noise can have significant negative effects on model performance, but model validity is more than just the cleanness of the data; ensuring that the data has the expected features, the values of these features fall within an expected range, and that these features correlate as expected is equally important. When accessing the data, one needs to consider the size of the training — as models are often trained on millions of examples —, the size of each data point, and the overall throughput of the system. Correct data preparation is also crucial; the validity of the data is irrelevant if it is not processed correctly before training. Some of the tasks that may be considered as data preparation are feature engineering, adding new attributes or examples to the training data, and transcoding these values for the training. While one of the main advantages of deep learning is the lack of need for feature engineering, in traditional machine learning this task

(14)

is still hard and time consuming. Solutions used in database systems might alleviate some of these issues, but there are new challenges that are introduced by machine learning. For example, if the schema of the data constraints is a property of the pipeline but not part of the process generating the data, the two may evolve separately even though they are tightly coupled. Another challenge specific to machine learning is its unique constraints that needs checking such as bounds of drift of feature values or an embedding for some input features: training-serving skews and time traveling are results of suboptimal data management and they almost always lead to incorrect predictions made by the model. [41]. A well implemented data infrastructure that addresses these problems can speed up the development resulting in more frequent integrations as well as facilitate automated testing.

Summary of the challenges concerning the training data:

• Gathering training data is slow, expensive, mostly done manually • Noisy data needs to be handled lest it degrades the model

• Suboptimal data infrastructures pose a number of risks such as training-serving skews

4.2 Challenges with Version Control

Version control should be a standard for any project whether using CI or not, but for the former it is simply an essential requirement [6]. Version control repositories manage changes in the source code and in other software assets allowing all members of the team to access them from a single source point. They also keep track of previous versions of the code, so that in can be rolled back to a previous working version if needed. While nowadays the use of these repositories is common practice, according to Fowler [8] a frequent mistake when using CI is not saving everything needed for the build e.g. test scripts, database schemas, install scripts, third party libraries.

When it comes to version control in machine learning, new problems emerge. The first challenge comes from the size of machine learning models. Models can easily be several GB in size. Whenever a change is put into place, the new version of the code needs to be saved in the repository making it grow by the size of the file; not only will this result in a large repository to maintain, but current version control systems such as Git are simply not designed to handle this task. While Git Large File Storage(LFS) [58] is specifically designed to handle large files, it has yet to be supported by most CI tools and it does not fully solve the problem of model and data management.

Another issue with machine learning is the fact that the build is highly dependent on the training data [37]. If one wants to adhere to the practice of storing everything needed for a build in the version control repository, they also need to save the training and testing data, the hyperparameters, and other artifacts such as the feature sets, and predictions [59]. Storing all these poses a challenge on its own, but on top of this, extracting some of the artifacts might not be a straightforward process. Schemas storing the metadata can also become large as many machine learning pipelines have thou-sands of features [37]

Summary of the challenges concerning version control:

• Traditional version control systems are not optimal for handling large files as machine learning models

• Not only changes the model are needed to be tracked, but also changes to the training data and the hyperparameters, but this poses additional challenges

4.3 Challenges with Model Testing

Software testing is an essential part of software development projects; it is a standard approach to validate and verify that the code works as expected. Typically testing is carried out by executing a

(15)

program (or parts of it) with a given test input; to detect faults a test oracle is needed that determines if the output of the program was correct or not, often by comparing the output of the program with an expected output [60]. In CI, testing plays an important role as any change to the main codebase must first pass a number of automated tests designed to assure that the code changes hasn’t introduced any regression [7,8]. Research into the challenges of adopting CI in various areas of software development shows that the lack of adequate automated tests is one of the most common issue faced by teams attempting the transition[3,2,4].

4.3.1 Testing

When it comes to machine learning, software testing can be difficult. The first thing to consider is that while a test oracle is necessary to carry out conventional testing, in some cases such oracle may not exist [61]. Due to various reasons, such as complex input-output relations, determining the expected result for any given input may be impossible without a theoretical perfect version of the program which is impossible to create in practice. A system is facing the oracle problem if for a given input, predicting the correct output is not obvious or is error prone [60]; generally a class of software systems including supervised classifiers [62] can be referred to as non-testable program if it faces the oracle problem [63].

Another aspect of machine learning that makes testing difficult is its complexity; machine learning functions often treated as black boxes [64,65, 66,67] that makes comprehensive testing challenging. While many advancements were made in making these models more understandable [68,69,67] testing is still more trial-and-error than systematic methodology [61,66].

4.3.2 Model Evaluation

While testing might be challenging, evaluating the performance of models is easier and more com-monly carried out. Assessing a learned model’s prediction ability on independent data is an essential step in development making correct evaluation extremely important [70,71]. Evaluation relies on the metric used, the most popular single performance metric being accuracy; it can be defined as fraction of correctly classified instances or the degree of right predictions made by the model [71,72,73,74]. Despite its popularity due to its intuitive nature and ease of use, in some cases a model with lower accuracy might have greater predictive power than a one with higher accuracy; this problem is referred to as the accuracy paradox [73,75,72]. When the training data is highly imbalanced or skewed which is often the case in real-world machine learning applications , the capability of models to predict the behavior of a phenomena may be compromised and accuracy to fail as a representative metric [73]. Random chance can also play a role in measuring accuracy, hindering its reliability as an effective metric of performance even further [75].

Summary of the challenges concerning testing: • Machine learning suffers from the oracle problem • Machine learning functions often treated as black boxes

• Model evaluation often relies on accuracy, but this metric is not without its drawbacks

4.4 Challenges with Automating Build

One of the most important characteristic of CI is the short and frequent build integrations emphasising multiple check-ins a day. Each time new code is checked-in, the new build needs to be rebuilt from scratch and pass all automated tests to be accepted into the main repository. Since the developers are blocked until their code is accepted or refused, this regression time needs to be short in order to maximize productivity. While Humble & Farley [76] believes that a build should not take longer than 10 minutes and Brooks suggested that a 2 minute build time is optimal [77], but in reality this is not always the case; One research showed that regression feedback times could take up to days

(16)

or even weeks (which might still be preferred over big bang integrations) [3]. In fact, long feedback times are common barriers encountered when trying to adopt CI [3, 2, 4]. Issues resulting in long feedback time can originate from having a large number of components, using multiple programming languages, or from having a tightly coupled system [78, 4]. Sometimes the sheer number of commits can cause problems; at Google for example there is a code commit every second on the average rendering individual regression tests infeasible [79]. One solution to this problem is to break a large build down into smaller components [80,2].

As it can be seen, adhering to short and frequent code commits can pose challenges even in tradi-tional software development, but when it comes to machine learning, these challenges become a lot more severe. As mentioned earlier, the training of complex models demands considerable computing resources and time [33, 34, 35, 36], therefore rebuilding the model and testing it under 10 minutes after every small change is virtually impossible. Because data is a crucial element of machine learning, when it changes considerably most statistical models need to be retrained from scratch [81]. It is also difficult to break down the model training into smaller tasks as it is a tightly coupled process where each layer depends on the previous one [82].

Summary of the challenges concerning automating build:

• Rebuilding the model from scratch and testing it under 10 minutes after every small change is virtually impossible.

• Model training is tightly coupled process, making small, independent changes is difficult

4.5 Challenges Within the Industry

In recent years, the field of machine learning experienced progress in an immense pace. With break-throughs as image classifiers performing better than ever dropping error rates by an order of magnitude [83], deep learning systems driving cars only with a single camera [32], and defeating the best human players in go [29], it is not hard to forget that machine learning is still in its infancy with the industry facing various challenges linked to the maturity to its tools and practices. Research has shown, that the mindset of the developers and the maturity of the tools and frameworks used by them play an important role in adopting CI [3], therefore these are important factors to consider in the terms of machine learning as well.

While machine learning has the potential to tackle previously unsolvable problems, these systems not only have all the code complexity issues as traditional softwares, they also carry their own hidden technical debt [49, 41]. More often than not, traditional programming is discrete, deterministic, and focused on modules and lines of code; in contrast, machine learning is stochastic and instead of explicitly describing the desired program behavior, it must be learnt from data pipelines [84]. On top of this, to this date machine learning algorithms are still mostly treated as black-boxes due to their complexity [65, 82, 49]; development typically relies on time consuming prototyping and trial-and-error exploration [65, 85,41] that produces systems with unintended tight-coupling, large masses of glue code, and unnecessary general purpose solutions [49]. Although this experimentation can lead to excellent solutions, due to challenges with record-keeping, alternative strategies explored may get lost as developers move on [82].

4.5.1 Software Engineers

Another thing to consider is that these systems are seldom developed by ordinary software engineers with explicit knowledge of these challenges [84], but rather by machine learning experts and data scientists who often have other aspects of the system they have to focus on first. Even when these algorithms are implemented by engineers, testing may be difficult to be carried out due to the difference in working culture between the two groups. Research also shows, that a lack of background knowledge in software engineering can result in conducting testing in an unsystematic way [61]; this lack of standardized testing can lead to troubles when it comes to reproducing results or comparing them

(17)

with other models [59]. In some cases even when the importance of systematic testing is understood, the implementation might still be missing, as a research at Google have shown [5]. Paying of these technical debts are crucial for long-term system health and it may enable cutting-edge improvements [49], but the pace of progress in the industry is argued to push researchers and developers to neglect empirical rigor in the pursuit of scientific breakthroughs [83]. This pressure could also explain the phenomenon of practitioners often overestimating the performance of their models [82,67].

The tools and frameworks used in machine learning also suffer from symptoms of immaturity. While most machine learning algorithms have a strong scientific background, there is currently no system-atic way to test if they are implemented correctly in practice [62]. Another problem is that while debugging traditional software is an organic part of the development, in machine learning developers require a disruptive cognitive switch from model building to analyzing the model [85] which can slow down the overall development process [86].

Summary of the challenges concerning the industry:

• On top of traditional code complexities, machine learning carries additional hidden debts • Machine learning development still heavily relies on time consuming prototyping

• Sound Software Engineering practices are seldom applied in machine learning • The tools and frameworks used in machine learning also suffer from immaturity

(18)

Chapter 5

Proposed Solutions

This chapter outlines the various solutions proposed to overcome the challenges listen in chapter 3. The solutions were split up into to subcategories: the ones working on the data level and the ones concerned with the trained models

5.1 Data

5.1.1 Noise Elimination

It is a well known fact, that the quality of machine learning models is highly dependent on the data they were trained on; it was shown that even a small number of noisy data can have a big impact on the performance of classifiers [87,88, 89, 44, 45]. The two most important characteristic of this data is its features and labels, both of which can suffer from noise [39, 90, 40, 91, 44, 43]; noise in this sense can be defined as something that obscures the relationship between the features and their classes [92] or simply as errors in feature values [50] and mislabeled data [43].

There are two main branches of techniques aimed to alleviate the negative effect of noisy data: noise tolerance and noise elimination. Noise tolerant approaches aim to design robust algorithms that are insensitive to noise and therefore do not need to remove noisy instances [43,44,93,38,94,95,96]; two common methods to achieve this are rule truncation [97] and tree pruning [98]. Studies [50, 52,53] has shown that cleaning feature noise may reduce the predictive power of a model if the same noise is expected to be present when the model is put to use; on the contrary, cleaning label noise results in a model with better accuracy than one trained on the noisy data. While noise tolerance might work well, Gamberger et al. [57] demonstrated that noise elimination is more effective than noise tolerance; on top of that they are also easier to implement into existing systems as they are independent from the underlying algorithms. Noise elimination techniques work by identifying and removing noisy instances from the dataset prior to feeding it to the learning algorithms [43, 44,99]. These techniques can be further categorized by the specific strategy they use to approach the problem of filtering the data.

One common strategy is to use classification algorithms such as K-nearest neighbors (KNN) that compare the labels of the data points with the labels of their neighbors; if most of them are different, than the data point is treated as mislabeled data [44, 100, 101]. While this concept tends to work quite well in most of the time, there exist data distributions where neighboring samples have different labels. A more widely used approach is based on ensemble training where ensembles of multiple classification algorithms to produce a more reliable result [44].

Consensus/Majority Filtering

In their works, Brodley & Friedl [102, 103] proposed two filtering methods, majority filtering (MF) and consensus filtering (CF), that make use of multiple classifiers trained on various subsets of the noisy data. After the training, each sample is classified by each classifiers. The steps of MF/CF are the following: First the noisy data set E is partitioned into n equal sized disjoint subsets and an empty output set A that will hold the detected noisy samples. For each subsets of training data Ei a

(19)

complement set Et is created where Et = E/Ei and proceed to train each classifier on Et. After the

training is done, each element e of Eiis classified by the algorithms; if at least half of the classifiers (in

MF) or all of them (in CF) disagree on the class of a given sample, it is considered as mislabeled. This process is then repeated for each subset Ei. This method requires two conditions to be fulfilled: the

predictions made by the classifiers must be independent of each other and each classifier’s probability of a correct classification needs to be greater than 0.5. CF is more conservative than MF at disposing samples, but it carries an increased risk of leaving mislabeled data in the training set.

Algorithm 1 Consensus Filtering (CF) Input: E (training set)

Parameter: n (number of subjects), y (number of learning algorithms), A1, A2,..., Ay (y kinds of learning algorithms

Output: A (detected noisy subset of E )

form n disjoint almost equally sized subsets of Ei, where ∪iEi= E

A ← ∅

for i = 1,..., n do form Et← E\Ei

for j = 1,..., y do

induce Hj based on examples in Etand Aj

end for

for every e ∈ Ei do

ErrorCounter ← 0 for j = 1,..., y do

if Hj incorrectly classifies e then

ErrorCounter ← ErrorCounter + 1 end if end for if ErrorCounter = y then A ← A ∪ {e} end if end for end for

Consensus/Majority Filtering with the Aid of Unlabeled Data

Guan et al. [43] designed a new variant of MF and CF that makes use of unlabeled data to improve the performance of the filtering algorithms. Their proposed method called consensus/majority filtering with the aid of unlabeled data (CFAUD/MFAUD) is based on the fact that while labeled data is usually difficult and expensive to gather, unlabeled data is generally abundant. CFAUD/MFAUD works in a similar way as CF/MF with an added en-co-training set where the unlabeled data samples receive a predicted label; this prediction is carried out by an ensemble of classifiers in an unsupervised manner. The research results show that MFAUD/CFAUD performs better at identifying noisy labels than the regular CF/MF.

Cost-sensitive Elimination of Mislabeled Training Data

Another approach extends CF/MF by taking into consideration the cost of misclassifying samples as false positives or false negatives [44]. When it comes to noise filtering there’s two possible errors to be made: identifying correct labels as incorrect ones (type 1) or incorrect labels as correct ones (type 2). Type 1 mistakes reduce the size of the training set reducing the robustness of the learnt model, while type 2 mistakes leave noisy labels behind that can degrade its accuracy. The cost of these two types of errors might be vastly different based on the data and the application of the model, but existing data filtering methods don’t address these costs. The proposed cost-sensitive repeated

(20)

Algorithm 2 : Ensemble-based Co-training (En-co-training) Input: E (training set)U (unlabeled set)

Parameter: k (number of iterations), y (number of learning algorithms), n (number of initially selected unlabeled instances),

A1, A2,..., Ay (y kinds of learning algorithms

Output: TU (selected unlabeled instances from U with predicted labels)

Create U’ by choosing u instances at random from U TU ← ∅

for i = 1,..., n do

U ← U\U’, numbefore←| TU |

for j = 1,..., y do

induce Hj based on instances in L and algorithm Aj

end for

for every t ∈ U’ do for j = 1,..., y do plj(t) ← Hj(t) // predicted label of Hj on t end for if pl1(t) = pl2(t) =, ..., ply(t) then TU ← t ∪ pl1(t), U’ ← U’ \ t end if end for L ← L ∪ TU

numafter←| TU | ,∆num ← numafter-numbefore

if | u |> ∆num then

randomly choose ∆num instances from U to replenish U’ end if

if 0 <| U |< ∆num then

choose all instances of U to replenish U’ end if

if | U | =0 then exit

end if end for

(21)

Algorithm 3 : : Consensus Filtering with the Aid of Unlabeled Data CFAUD) Input: E (training set), U (unlabeled set)

Parameter: n (number of subjects), y (number of learning algorithms), k (refer to Alg, of En-co-training)

n (refer to Alg, of En-co-training)

A1, A2,..., Ay (y kinds of learning algorithms

Output: A (detected noisy subset of E )

form n disjoint almost equally sized subsets of Ei, where ∪iEi= E

A ← ∅ for i = 1,..., n do form Et← E\Ei TU = En-co-training(Et, U, k, u, A1, A2,..., Ay) Et← Et∪ TU for j = 1,..., y do

induce Hj based on examples in Etand Aj

end for

for every e ∈ Ei do

ErrorCounter ← 0 for j = 1,..., y do

if Hj incorrectly classifies e then

ErrorCounter ← ErrorCounter + 1 end if end for if ErrorCounter = y then A ← A ∪ {e} end if end for end for

(22)

majority/consensus filtering algorithm CSRMF/CSRCF makes use of cost matrix to produce less cost compared to cost-blind mislabeling filters.

Human-in-the-loop

A relatively common practice in the industry is having a human-in-the-loop manually correcting la-bels, often crowdsourced through a framework such as Amazon Mechanical Turk [104,105,106,107, 108, 109,110,99]. While crowdsourcing believed to be cheap and feasible [111], due to the fact that this process somewhat contradicts the automated nature of CI, manual noise elimination will not be considered as a solution for noisy data.

5.1.2 Version control

One of the biggest deficiency in machine learning regarding CI is the lack of adequate version control. Machine learning is an iterative process where practitioners iteratively collect data, create features, train a model, and inspect its performance to determine the next iteration [85]. This process often consists of a lot of prototyping resulting in a number of different strategies. Unfortunately, once one of these strategies are selected, the others often get forgotten as the development moves on [82]. Version control for machine learning would enable not only the record-keeping of these different approaches, but also the comparison of models overtime and an easy roll-back in case of integration issues. It it also crucial for CI and automated builds. As mentioned in the previous section, there are a number of challenges that arises when one tries to version control their machine learning pipeline.

Git and Git LFS

In traditional software development, one of the most common tools for version control is Git due to its exceptional speed, data integrity, and support for distributed workflow. While git excel at these tasks, it is not without its drawbacks. First, Git was not designed with large files in mind, which is an issue in machine learning where models can easily grow to several GB in size. Another issue is the number of files; Git tends to handle repositories as a whole, even if only a small slice of it is required. This can be limiting if one wants to keep track of metadata or the training sets that can contain millions or even billions of examples [41]. While Git LFS can handle large files, it still doesn’t provide extra support for large dataset management or the automatic handling of artifacts and metadata that is crucial for reproducibility.

Data Version Control

One promising candidate for version control is Data Version Control (DVC) [112], a Git extension specifically designed for the data science community. Released in 2017, DVC is state-of-the-art version management system that emphasizes the handling of large files and reproducibility. Instead of storing binaries in a Git repository, DVC streamlines large files and models into a single Git environment. It can handle files of hundreds of gigabytes by using hardlinks instead of copying files from cache to the workspace. This way the same version of a model can belong to several experiments without the need of data duplication. DVC also supports the automatic tracking of various metrics as well as failures, which can be as valuable in some cases as successful experiments. To further facilitate reproducibility, DVC has built in functions that enables it to easily track dependencies in a dependency graph. As machine learning is an iterative process, DVC was designed to keep track of all steps, dependencies between the steps, dependencies between your code and data files and all code running arguments [112].

Sacred

Another useful tool that aims to tackle the challenge of reproducibility is Sacred [113]. Sacred was released in 2014 by Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA) with the aim

(23)

to encourage modularity and configurability of computational experiments. Sacred automatically captures local variables and tracks them as configuration entries; these entries then can be injected into captured functions. It also logs all information about the experiment. Another useful functionality of Sacred is the seeding of randomness that not only facilitates reproducibility, but also alleviates the evaluation challenges connected to the random nature of machine learning [113]. All these features together may help data scientists with experimenting in a systematic manner.

ModelDB

(a) Models view (b) Pipelines view

Figure 5.1: ModelDB overview [114]

In their work, Vartak et al. [114] describe ModelDB, a novel end-to-end system for the management of machine learning models. ModelDB allows both the tracking and analysis of machine learning models and their metadata and even the management of multi-stage pipelines including the process of pre-processing, training and testing steps. To achieve this, ModelDB uses three key components: native client libraries for different machine learning environments, a backend that handles keys and access to the storage, and a visual user interface. This architecture allows experimenters to work in their own preferred machine learning environment while still being able to track their progression. ModelDB also allows the analysis of the tracked models through a visual interface or through SQL querying if preferred [114].

Automatically Tracking Metadata

Finally, Schelter et al. [59] proposes a system that automatically extracts and manages the metadata, such as the hyperparameters, feature transformations, and training datasets, for machine learning experiments. Tracking these parameters may enable data scientists to automatically compare models in development with older models the same way traditional software developers carry out regression tests. The two most important aspect of this system is declarativity and immutability; only metadata of the artifacts are stored, not the code itself, and it can only be written once [59].

5.1.3 Data infrastructure

How data is managed in a machine learning pipeline is just as important as the quality of the data itself; if the data is incorrectly processed, the information derived from it will be incorrect as well. Proper data management is a challenging task, the data is usually large and may arrive continuously and in chunks. It can also contain errors that can propagate through the system if left unhandled. Besides the overall benefits of fewer errors caused by faulty data, data infrastructure is important for implementing CI for one main reason: automation is a cornerstone of CI and a data infrastructure

(24)

could automatically handle new data and possibly train and serve new models without the need of human intervention.

Generally, data management entails four high-level data activities: understanding, preparing, val-idating, and fixing [41]. First, data understanding can consist of simply checking the features’ max-imum and minmax-imum values as well as the most common ones, checking if they appear in enough examples as well as if they have the right number of values. Next, data preparation can cover all tasks related to feature engineering and adding new attributes or examples to the the training data. Data validation has several facets such as ensuring that features are correlated as expected and that the serving data does not deviate from the training data. Finally, data fixing is the three sequential tasks of understanding where the error occurred, understanding its impact, and fixing the error [41].

Google Data Infrastructure

In their work, Breck et al. [37] describe the the data infrastructure they have built at Google. Their system allows the continuous checking of errors in both the training and serving data, the detection of drifts between instances, and the analysis of data errors and their effect on the correctness of the models. The infrastructure is based on a data schema that describes the expectations for correct data. This schema represents a logical model of the data that contains constraints and captures necessary semantics for validation and testing. The system was also built with the evolving data in mind; it can suggest changes to the schema for errors that occur due to the changes of data evolution such as the appearance of new domain values. On top of validating the data, the system can also be used to generate synthetic data for testing purposes to see the code’s ability to run, process data, and call machine learning APIs. The data infrastructure has already been deployed and it can analyze and validate more than one petabyte of data per day.

KeystoneML

When practitioners build complex pipelines, they typically mix domain specific libraries for feature extraction with general purpose packages for supervised learning, but this is a tedious and error-prone process. On top of this, when the training data or the features grow significantly, these pipelines need to be completely re-engineered. Sparks et al. [115] presents a solution for these problems in the form of KeystoneML, a framework for machine learning pipelines designed to allow developers to specify an end-to-end machine learning pipeline using high-level logical operators, scale dynamically as the data and the problem complexity grows, and to automatically optimise these pipelines.

Data Linter

Catching problems, such as numerical features on widely differing scales or malformed values of string types in datasets is commonly performed as part of the cleaning process, but even with automation it can be time consuming and error-prone. Hynes et al. [42] proposed data linter, a lightweight tool that analyzes a user’s training data and suggests ways features can be transformed to improve model quality. To identify potential issues, it inspects the data’s summary statistics as well as individual examples, and even considers the names developers give to their features; for each potential issue identified, the data linter produces a warning, a recommendation for how to improve the feature’s representation, and a concrete example of the lint taken directly from the data. The potential data lints can be classified into three higher-level categories: miscodings of data, outliers, and packaging errors. Miscoding data linters attempt to identify data that should be transformed to improve the likelihood that a model can learn from the data. Outlier and scaling data linters attempt to identify likely outliers and scaling issues in data. Finally, packaging error data linters identify problems with the organization of the data.

(25)

5.2 Model

5.2.1 Metamorphic Testing

Automatic unit- and integration tests are the backbone of CI; they make frequent integrations possible as the developers have to spend less time debugging as well as maintain the quality of the produc-tion code. Unfortunately, automatically testing machine learning code is still considered a serious challenge due to the so-called oracle problem. Normally, a software is tested by a test oracle which can determine the expected correct output for a given input. When a programs — such as machine learning algorithms — input-output relation is complex or error-prone, one can say that it is non-testable or that it faces the oracle problem. As automated test typically rely on test oracles, the oracle problem is considered as one of the greatest challenge in software testing [61]. One solution proposed for this problem is the use of pseudo-oracles: A pseudo-oracle is a program that is independently developed from the program needed to be tested but carries out the same functionality. The idea is to run the two programs on identical input sets and compare their results; if the outputs differ, both programs is to be examined with standard debugging techniques. This process is repeated until all disagreements between the programs are resolved [63]. While the idea behind this concept might be sound, it is easy to see why in practice this method might not work so well. Producing two programs with the same functionality but different working mechanisms sounds like a challenge on its own, even if only the increased number of man-hours are considered, but when it comes to machine learning, the randomness and the long training times add another layer of difficulty.

A more promising method for alleviating the oracle problem is Metamorphic Testing (MT). MT was first proposed by Chen et al. [116] as a method to generate new test cases based on successful ones, but it quickly became widespread as a technique for testing non-testable programs. MT is based on the idea, that even if one can’t directly predict the correct output of a given input, it is often still possible to reason about their relations [60]. By using certain properties of a function, called Metamorphic Relations (MR), it is possible to predict some expected changes to the output for particular changes to the input; if the change on the output is not as expected, the function can be deemed to be faulty even without knowing the correct output for the given input [62]. One typical example for MT is testing a function that calculates the sine function: in this case a MR to use would be the property of the sine function according to which sin(x) = sin(-x). Without knowing the exact value for any sin(x), it is still possible to tell if the program if faulty if this property doesn’t hold for a given input value [60]. While the fact that a particular MR is satisfied does not guarantee that the program is error-free, the violation of MR does indicate the presence of faults [61]. Apart from identifying MRs, MT testing can also be highly automated [60], making it as an ideal for CI in any domain suffering from the oracle problem.

Typically, MT consists of the following steps: First, a number of MRs are identified and their corresponding tests are constructed. Next, source test cases are generated for the program under test using any traditional testing technique such as random testing. Finally, the follow-up test cases are generated using the MRs; if the outputs of a source test case and its followup test case violate the metamorphic relation, then the program can be considered faulty [60]. While this is the most common execution, Segura et al. [60] collected a number of other approaches such as Automated Metamorphic Testing [117] or the use of genetic algorithms for the selection of source test cases [118].

MT has been demonstrated to work on both regular machine learning applications as well as deep learning ones: Xie et al. [62] identified previously unknown errors in the popular Weka software; Ding et al. [119] developed MRs that can validate deep learning frameworks on three different levels; namely on system level, data set level, and data item level.

Identifying Metamorphic Relations

While MT is both simple and efficient, one challenge that needs solving has to do with the identification of MRs. The process of creating these relations is non-evident, generally requires domain knowledge, and they may not necessarily apply to other applications [120]. It is also important to state, that not all necessary properties of an algorithm are MRs, not all MRs are separate input-only output-only subrelations, and they are not necessarily even equality relations [121]. That being said, there are already a number of studies aimed at solving the challenge of identifying MRs. Murphy et al. [120]

(26)

Relation Change made to the input Additive Add or subtract a constant Multiplicative Multiply by a constant

Permutative Randomly permute the elements Invertive Take the inverse of each element Inclusive Add a new element

Exclusive Remove an element

Compositional Combining two or more inputs

Table 5.1: Metamorphic relations common in mathematical functions [61]

Figure 5.2: Detecting Metamorphic Relations using machine learning techniques [61]

defined six properties of supervised machine learning applications that can be used to form MRs, these properties are additive, multiplicative, permutative, invertive, inclusive, and exclusive. Kanewala & Bieman [61] presents an approach to predict possible MRs using machine learning techniques.

5.2.2 Improved Model Testing

In the iterative process of CI, evaluation is a crucial in determine the effectiveness of an iteration and in the preparation for the following one. This isn’t any different in machine learning where the model goes through many training iterations, after each it is predictive performance is evaluated in order to make sure that it is progressively getting better. Currently the most popular metric for this is accuracy which is simply the fraction of correctly classified instances [72] and while it is a reliable single figure-of-merit, due to its simplicity it can mask important information about the true performance of the model. The accuracy paradox, class imbalance, or even random sampling can result in improved accuracy when the predictive power of a model is in fact decreasing [73,75,72]. Perhaps the best and most elegant way to address this problem is through the combined use of accuracy with additional metrics that focus on other aspects of the model in order to gain a more complete understanding of its real performance.

Comparing Metrics

While there exist a large number of candidate metrics, it is important to ask if they are sufficiently different from other metrics and therefore if they reveal any new information about the model. In their work, Ferri et al. [71] attempts to answer this question by comparing the results of 18 unique performance metrics and analyzing their relationships. The metrics compared range from more com-mon ones such as accuracy or F-measure to less often used like Brier score, Calibration by Bins, and various AUC scores. The experiment comparing the results was conducted using six well known machine learning algorithms and 30 small- and medium sized datasets; each algorithm was evaluated using 20x5 fold cross-validation for each of the 30 datasets resulting in 18,000 total results. For each results, the Pearson linear- and Spearman rank correlation were calculated for all eighteen metrics.

(27)

Figure 5.3: Dendrograms of standard correlations (left) and rank correlations (right) between the metrics for all datasets [71]

The final results of this analysis show that while there are important similarities between these met-rics, they are different enough to say that they do measure different things. A surprising result is that even metrics from the same family can measure vastly different things with the exception of the various AUC metrics.

Additional Metrics

Additional metrics that can be used besides these 18 were proposed by Sokolova et al. [72]. They analysed the effectiveness of three metrics Youden’s index, likelihood, Discriminant power that are being used in medical diagnosis, but haven’t been tried in statistical machine learning. They argue, that while common metrics aren’t sufficient in problems where all classes are equally important and multiple algorithms are compared, these three proposed metrics derived from the medical field allow comprehensive comparison for such cases. All three of these measures are based on sensitivity and specificity which are commonly employed in medical applications and research involving visual data. To test their proposal, the metrics were compared with more common ones as accuracy on e-negotiation data where both positive and negative results are equally important. The results show that the performance of the classifiers is defined by the metrics used, and that higher accuracy doesn’t necessarily guarantee higher performance.

Model Analysis

On top of the challenges with selecting the right metric, what makes the evaluation of machine learning models difficult is their sheer complexity; they are often treated as black boxes and when metrics shows that performance decreases, it is often hard to tell why that is the case. Without a clear understanding of how the model works, achieving higher performance typically relies on a time-consuming trial-and-error process [65]. While making models more interpretable is a popular research area [68, 69, 67], these attempts are still far from perfect. A better approach is to make existing models easier to interpret by humans without the need to change anything in the training process; most recent efforts achieve this by visualizing the various aspects of the model and the training data. The aim of these techniques is to understand why models behave the way they do, diagnose the training process that fails to achieve the expected performance, and to guide experts to improve these models [65].

(28)

Figure 5.4: Squares allows bi-directional coupling between the visualization and table to locate inter-esting instances found in the table in the visualization [122]

Figure 5.5: An overview of ModelTracker. Boxes represent user labeled examples and color indicates the label given (green for positive and red for negative). Test examples are placed at the top and train examples at the bottom. Examples are laid out horizontally according to the model’s prediction scores, with low scoring examples to the left and high scoring examples to the right. [85]

Squares

Ren et al. [122] proposed Squares, a visualizer for multi-class classification problems. The motivation behind the development of Squares was the fact that the commonly used summary statistics often obscure details that might be important at evaluation and they dissociate performance from the data, which is a fundamental part of any model. While confusion matrices might help with the flattened results produced by the summary statistics, they tend to significantly decrease efficiency with the increase in the number of classes. Squares on the other hand, seems not to impose any significant overhead with the growth of class size. It was designed with the goals of showing performance at multiple levels and connecting the measured performance to data while being agnostic to common performance metrics. By achieving these goals, Squares may help practitioners prioritize efforts in debugging performance problems while simultaneously adding direct access to data the model was derived from. This system can be used for both improving a single model or to compare multiple different ones.

ModelTracker

In their work, Amershi et al. [85] present ModelTracker, an interactive visualization based on summary metrics for iterative model building. The two main benefit of their system is that it aims to eliminate the disruptive cognitive switch between model building and debugging, and that it can subsume the information contained within numerous traditional summary statistics and graphs and display it in a single, compact visualization. ModelTracker is also algorithm and data agnostic therefore it can

(29)

Figure 5.6: CNNVis, a visual analytics system that helps experts understand, diagnose, and refine deep CNNs [66]

be applied to virtually any classification problem. A six months long case study and a controlled experiment shows, that ModelTracker is indeed superior to manual debugging and that it is fairly easy to use by practitioners.

CNNVis

When it comes to the evaluation of deep learning networks, the problems of traditional machine learning only become more severe; not only do these network consist of possibly hundreds of layers with thousands of neurons in each layer, but also of many functional component whose roles are generally not well understood [123]. Liu et al. [66] designed a CNNVis, a visual analytics system, that is capable of disclosing the multiple facets of each neuron and the interactions between them. CNNVis turns a convolutional neural network (CNN) into a directed acyclic graph (DAG) and allows the users to visualize the neurons, or to interact with the model in a number of ways, such as interactive clustering result modification. CNNVis also provides debugging information on demand.

5.2.3 Transfer Learning

Building a machine learning is a tedious task from start to finish; from collecting a large, labeled dataset to training the model over long hours or days, machine learning can be extremely time consuming. This long waiting time between iterations clearly goes against the CI concept of frequent integrations and less than 10 minutes long build time. Transfer learning may be a suitable solution for both the lack of training data and the long build times.

Transfer learning is an effective method to leverage rich labeled data from one domain to build an accurate classifier for another domain. The idea behind this is the assumption that these domains may share certain knowledge structures that can be encoded and extracted by preserving the important statistical properties of the data [48]. It is a rewarding research area as solving problems regarding transfer learning can have far reaching practical benefits by training better performing model faster and with lower cost [124]. Transfer learning can help to alleviate the challenge posed by the lack of labeled data by extracting information from domains rich in data or significantly reduce the build

(30)

Figure 5.7: An overview of different settings of transfer learning [81]

time of models by only retraining the top n layers.

The study of transfer learning is characterized by three main research issues: what to transfer; how to transfer; and when to to transfer [81]. The first question asks which parts of the knowledge are common enough between domains to be transferred and which are too specific for the source domain. The how question asks how the knowledge should be transferred between learning algorithms once it is clear what should be transferred. Finally, the last question is concerned with the situations where knowledge should or should not be transferred as in some cases transfer learning may harm performance. Contemporary research mainly focuses on the first two questions, but it is also important to consider the last one too to avoid negative transfer. Transfer learning itself can be categorised by the relation between the source and target domain and task. First, in inductive transfer learning, the domains can be the same or different, but the target task is different from the source task. In contrast, in transductive transfer learning, the source and target domains are different, but the tasks are the same. Finally, unsupervised transfer learning is similar to inductive transfer learning with the difference that it focuses on unsupervised learning tasks such as clustering or dimensionality reduction and there are no labeled data available in either the source or the target domains [81].

Transfer Learning in Deep Learning

In deep neural networks, transfer learning has an additional application, namely the fine-tuning of layers. To avoid long training times, one can take a pretrained model and retrain only the top n layers to fit the task on hand. This is motivated by the fact that earlier features of a network contain more generic features such as edge- or color blob detectors that are useful to many task. An example use case is to use a pretrained face recogniser and retrain only the last layer to identify new faces saving a significant amount of time not training the model from scratch. Yosinski et al. [125] studied the generality and specificity of neurons in each layer of a deep convolutional neural networks to see how this affects transferability. Their results show that transferability is negatively affected by splitting networks in the middle of fragilely co-adapted layers but also by specialization of higher layer features to the original task. That being said, their research also proves that initializing with transferred features can improve generalization performance even after substantial fine-tuning on a new task.

(31)

Figure 5.8: Demonstration of BadNets backdoors [126]

risks. Negative transfer might be harmful for model performance, but it is fairly easy to catch. On the other hand, backdoors hidden in pretrained models by are far more difficult to spot. In their work, Gu et al. [126] warn about the potential dangers of using model libraries such as Model Zoo [127]. They demonstrate that that a model can contain a hidden backdoor that can cause targeted misclassification while the model still produces high accuracy on most inputs. These backdoors can also survive transfer learning making them especially dangerous and highlights the importance of acquiring pretrained models from trusted sources.

(32)

Chapter 6

Case Study

This chapter outlines the experiments that were conducted to test the effectiveness of two methods identified in chapter 5. The chapter is structured the following way: a short explanation is given for selecting these two methods specifically, followed by the outline of the experiment setups and results for each methods respectively.

6.1 Selection Criteria

For the case study, two of the proposed solutions were selected to be evaluated, namely the noise elimination with the use of unlabeled data (MFAUD/CFAUD) and the Data Version Control system (DVC). The reason behind this decision was that among all the suggested methods, these two offers the highest impact while being the least difficult to add to existing projects. Some of the suggestions, such as the additional evaluation metrics, can’t tackle problems alone, but rather aimed to work well together with the other solutions. Others might promise a more significant impact, like transfer learning, but require a more intrusive process to be successfully adopted.

6.2 Noise Elimination with MFAUD/CFAUD

The first experiment was carried out to measure the effectiveness of MFAUD/CFAUD on detecting noisy labels. This experiment can be viewed as a replication of the work reported by Guan et al. [43] with the addition of using real-world data.

6.2.1 Experiment Setup

The experiment was set up the following way: First, both the MFAUD and the CFAUD algorithms were implemented in python based on the pseudocode provided in the paper. To confirm that the algorithms were implemented correctly, they were tested with three toy datasets used in the original experiment namely iris, breast, and wine datasets to see if the implementation can achieve similar

Continuous Integration in Machine Learning