Active Learning forWorkload Reduction of Automated Screening in Systematic Reviews

(1)

U

NIVERSITY OF

A

MSTERDAM

M

ASTER

T

HESIS

Active Learning for Workload Reduction

of Automated Screening in Systematic

Reviews

Author:

Shuxin Zhang

Supervisors:

Sílvia Delgado Olabarriaga Allard van Altena

(2)

ii

Active Learning for Workload Reduction of

Automated Screening in Systematic Reviews

Student

Name: Shuxin Zhang Student number: 11598123 E-mail: s.x.zhang@amc.uva.nl

Location

Department of Clinical Epidemiology, Biostatistics, and Bioinformatics, Amsterdam UMC, University of Amsterdam

Address: Meibergdreef 9, Amsterdam, the Netherlands

Contact information of tutor

Sílvia Delgado Olabarriaga, assistant professor E-mail: s.d.olabarriaga@amc.uva.nl

Room: J1B-210-1

Phone: +31 (0) 20 566 4660

Contact information of mentor Allard van Altena, PhD candidate E-mail: a.j.vanaltena@amc.ua.nl Room: J1B-207-1

Time period

(3)

iii

Acknowledgements

I would like to acknowledge my tutor Sílvia Delgado Olabarriaga for her enthusi-asm, responsible supervision, and constructive feedback throughout the process of researching and writing this thesis. I would also like to thank my mentor Allard van Altena for his daily supervision, patient instructions and technical support on my work.

Finally, I must express my very profound gratitude my family for their uncondi-tional support and continuous encouragement during my studies in the Nether-lands. Thank you.

(4)

(5)

v

Abstract

Introduction: In systematic reviews, the growing number of published studies im-poses a significant screening workload on reviewers. To mitigate this problem, pre-vious studies have used supervised learning to automate the screening task. They have proved to be helpful, but require a large amount of labelled data. So active learning, which requires less labelled data, is a promising approach to support the automatic screening.

Objectives:(1) To design the best active learning model among various alternatives; (2) To investigate the effects of active learning on screening workload reduction for systematic reviews.

Methods: Firstly, we designed the optimal baseline model (using supervised learn-ing) and the optimal active learning model. Support Vector Machine (SVM) classifier was used to perform text classification and Bag-of-Words (BOW) was used to repre-sent features. Additional aspects including term-weighting schemes of BOW, class imbalance technique and query strategy were determined empirically. Work Saved over Sampling (WSS) was used to determine the baseline models’ performance, and to compare the baseline models and active learning models. Secondly, these optimal models were applied to a dataset with 20 Diagnostic Test Accuracy (DTA) system-atic reviews to compare their performance of workload reduction in screening. This dataset consisted of 149,405 documents with labels indicating their relevance to the belonging review.

Results:The difference of active learning models with baseline models (added value) for 20 DTA reviews ranged from 0.01 to 0.61 in WSS@95 (WSS value when recall was 95%), and in WSS@100 ranged ranged from 0.03 to 0.46. All added values were positive, although they varied a lot among different reviews.

Conclusion: Active learning can further improve the performance of supervised learning in the task of screening Diagnostic Test Accuracy reviews, although this added performance varied among reviews.

Keywords Active Learning, Text Mining, Systematic Reviews, Systematic Reviews Screening Automation, SVM

(6)

(7)

vii

Introduction

Today, evidence-based medicine plays a vital role in measuring the progress and innovation in healthcare. At the heart of the evidence-based medicine, systematic reviews are a widely used method that brings together the findings from multiple researches regarding a certain clinical topic of interest in a reliable way to inform clinicians and patients [1].

Developing systematic reviews is a time-consuming and resource-intensive pro-cess that can take more than a year, with up to half of this time being spent on searching and screening relevant documents. This situation is becoming more chal-lenging with the growing number of published researches, which results in a larger of documents to be analyzed in a systematic review. For example, four out of the 31 Cochrane Collaboration systematic reviews published in March 2014 involved screening of more than 10,000 items [2–5].

A systematic review mainly includes a workflow with various steps [6]. Below we emphasize the following three steps because they are more relevant in the context of this work:

1. searching: search a database (e.g., PubMed) with a carefully constructed query tailored to the research question being investigated and retrieve potentially relevant documents.

2. screening: categorize documents as either relevant or irrelevant in two steps. The first step is based on reviewing title and abstract, and the second on read-ing the full text;

3. summarizing: summarize the relevant documents via meta-analysis or other review methods [7].

Within the workflow, the screening phase in particular is resource-intensive, re-quiring researchers to go through all the retrieved documents, the majority of which turns out to be irrelevant for the review. For example, one Cochrane review [8] had to screen 16,923 documents before 142 relevant documents (0.8%) were identified. An experienced reviewer requires thirty seconds on average to decide whether a single document is relevant to the review, although this can extend to several min-utes for complex topics [9, 10]. This amounts to a considerable human workload, given that a typical screening task involves manually screening tens of thousands of documents [9,11].

There are several possible ways to reduce screening workload [12], for example by increasing the rate of screening and by improving workflow. But in general, the focus has been put on reducing the number of documents that need to be screened manually. For example, many studies have investigated the use of text mining for this purpose [10, 12–19], and reported a reduction in manual screening workload

(10)

2 Chapter 1. Introduction from 10% [18] up to more than 90% [19]. For example, Cohen et al. [20] used a modified version of the voted perceptron algorithm [21] to perform automatic text classification in fifteen systematic reviews relating to drug class efficacy for disease treatment. Their results demonstrated a significant reduction in the screening work-load in eleven out of the fifteen reviews.

In the screening task supported by automatic text classification, the process starts with a subset of documents manually annotated with labels. The labels denote whether the document is relevant or irrelevant to the topic. These documents paired with the labels serve as the training examples for the automatic classifier. In a su-pervised learning manner, the classifier is then trained to learn how to discriminate between relevant and irrelevant documents. As a final step, the trained classifier is applied to automatically screen the remaining unlabelled documents. Here, an im-portant assumption should hold that some information from one systematic review can be carried over into another related one, so that the model trained by the former review is capable of predicting the relevance of documents in another review.

Supervised learning relies on large training datasets, but this can be problematic in the context of a new systematic review, where training data is rarely available. Potentially a keyword search can return many thousands of documents, and label-ing these documents to create a sufficiently large trainlabel-ing dataset is difficult, labori-ous and time-consuming. On the other hand, a classifier trained with a small-sized dataset often leads to an overly simple prediction function with poor performance.

Motivated by the observation that unlabeled data is plentiful but labelled data is limited or expensive, previous work (e.g., [10, 13, 22]) proposed active learning methods. Active learning leverages both labelled and unlabelled documents to en-hance the classification performance during cycles of interaction with (human) re-viewers. As such, it potentially decreases the screening workload without missing any relevant documents of the review. Therefore, we believe that active learning approaches can be useful for automation of screening task in systematic reviews.

This thesis addresses the following research question:

What is the added value of active learning compared to supervised learning in workload reduction of the screening task for systematic reviews?

The thesis is organized as follows: Chapter 2 provides an overview of active learning and related work for its application in systematic reviews. Chapter3 de-scribes the preparation of data (systematic reviews) used in the experiments. Chap-ter4 and Chapter5present the design of the optimal supervised (baseline) model and active learning model respectively. Next, the comparison of performance be-tween these two models for full dataset is reported in Chapter6. Chapter 7 con-cludes this study highlighting contributions, limitations, and directions for future research.

(11)

3

Chapter 2

Preliminaries

In this chapter we introduce the key concepts of active learning and present refer-ences to relevant related work. We furthermore present in more detail the studied alternatives for classification feature representation, and the metrics for performance evaluation.

2.1 Active Learning

Active learning [10] is an iterative process whereby the performance of a model is improved through interaction with reviewers. In this project we focus on pool-based active learning because we want to identify and label the relevant documents from specific datasets of systematic reviews as early as possible during the active learning process. Figure2.1presents an overview of the process. The model that is progres-sively trained is also called the active learner. The human reviewer, or oracle, indicates whether or not an unlabelled document is relevant to a given systematic review. The selection of this unlabelled document is defined as a query and the document is coined as a query document.

Figure 2.1: Diagram of the pool-based active learning.

A pool-based active learner (Figure 2.1) receives a collection of unlabelled doc-uments – the unlabelled pool. A set of labelled docdoc-uments – the labelled pool – is used to train an active learner. This learner is used to rank the unlabelled pool

(12)

4 Chapter 2. Preliminaries from most to least relevant for the systematic review at hand. Then a query strategy determines the informativeness of each unlabelled document. Informativeness rep-resents the ability of a document to reduce the generalization error of the adopted classification model, and ensures less uncertainty of the classification model in the next active learning iteration [23]. Given a query strategy, the active learner presents the top-most informative documents from the unlabelled pool to be labelled by the oracle. After that, the newly labelled documents (query documents) are removed from the unlabelled pool and added to the labelled pool, and the learner is updated. As a result, the informative level associated with each unlabelled document in the unla-belled pool are updated. The whole process repeats as long as the oracle will continue to provide labels, or until some other stopping criteria is reached – for example la-belling further documents is not deemed sufficiently informative. In this project, all experiments about active learning were implemented by Python code [24]. Except pool-based active learning, there are also other scenarios, such as query synthesis [25] and stream-based active learning [26].

Two major components of a pool-based active learning model have large influ-ence on the overall performance: the classifier and query strategy.

Classifier. Various classifiers have been described in the literature about active learning for text classification, such as linear Logistic regression classifier [27], Naïve Bayes classifier [28–31], support vector machine (SVM) classifier [10, 22, 32–35]. Among them, the state-of-the-art approach used for text classification is the SVM classifier.

Support vector machines (SVM) [36] are binary classifiers that find an optimal linear hyperplane separating given documents into two specific classes (e.g., ’rele-vant’ or ’irrele’rele-vant’ in text classification). Given a training set in which a document is a vector xi = hf1, f2, . . . , fni, in which fi is a feature labelled by yi = {−1,+1},

the SVM attempts to specify a linear hyperplane with the maximal margin defined by the maximal (perpendicular) distance between the documents of the two classes [37]. The documents that form the border of the margin on each of the+1 and−1 sides are called the support vectors. Figure2.2illustrates a two-dimensional space where the documents are positioned according to their features. The hyperplane splits them based on their labels.

Figure 2.2: Example of decision boundary of a SVM classifier. Two classes (+1 and−1) are separated by a hyperplane (depicted as a 2-D line). Support vectors (blue circles) are data points that define the margin around the hyperplane. This figure is adapted from [38]

As Joachims [39] demonstrated, the SVM is widely known for its ability to handle a large amount of features, a capability which is useful in the textual domain. More-over, Krishnaet al. [40] found that a simple linear SVM outperformed other kernels

(13)

2.2. Feature Representation 5 in text mining. Therefore we chose to use a linear SVM classifier in this project. Query Strategy. The main challenge for the design of an active learning model is to determine good queries from the unlabelled pool that can provide the best informa-tion for improving classificainforma-tion performance. The query strategy therefore involves assigning each document of the unlabelled pool a value indicating how informative a label for that document would be.

Many variations of query strategies exist, as detailed in an active learning sur-vey [41]. Some of the algorithms are computationally expensive and not practical, such as expected gradient length [42], Fisher information [43], and estimated error reduction [30]. Query-by-committee (QBC) [44] is sensitive to the type of classifi-cation models selected. In this work we focused on two straightforward strategies: certainty sampling [45] and uncertainty sampling [22].

Figure 2.3: Example documents from a systematic review are separated by SVM hyperplane and divided into three groups. Documents in Group A are those closest to the hyperplane; documents in Groups B and C are respectively the negative and positive points located farthest from the hyperplane. The figure is adapted from [46].

Certainty sampling regards the document with the highest probability of being relevant or irrelevant to the review as the most informative – see circles B and C in Figure2.3. This strategy aims to ensure that relevant documents are presented for labelling as early as possible, and is suitable for the purpose of reducing the screening workload. This strategy, however, has a potential drawback in that it may produce a hastily generalized classifier that is biased to a limited set of relevant documents and misses other relevant documents [22].

Uncertainty sampling, in contrast, regards as most informative the document that is closest to the separating hyperplane of the SVM classifier. In other words, it gives priority to documents about which the classifier is most uncertain whether they rep-resent relevant or irrelevant documents – see circle A in Figure2.3. Presenting such uncertain documents to be labelled aims to improve the ability of the classifier to find the best separating hyperplane, and thus to improve its accuracy in classify-ing new documents. Note that some uncertain documents may require more careful consideration in manual screening and thus might take a reviewer more time to la-bel. This effort is defined as labelling cost, which is considered constant in the scope of this work.

2.2 Feature Representation

In text classification for systematic reviews, features are extracted from the text of documents. It is argued that feature representation of biomedical documents can

(14)

6 Chapter 2. Preliminaries have a large effect on text classification performance, [47–49] being even more im-portant than the choice of the classification algorithm itself [50].

In the literature on active learning for text classification, various feature repre-sentations have been described, such as Bag-of-Words(BOW) [10,22,27,35,46,51], LDA (Latent Dirichlet allocation)-based features [22], and paragraph vectors [52]. Among them, BOW is the most popular approach to represent features.

BOW represents words with their frequencies in the text [53]. There are two term-weighting schemes in BOW representation: Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). TF presents the frequency of a word within one document, and document frequency (DF) presents the number of documents that contain this word. TF-IDF for a document can be calculated as [54]:

TF-IDF=TF·IDF (2.1)

where IDF is defined as:

IDF=lg( N

DF) (2.2)

where N denotes the total number of documents.

Since TF-IDF decreases the value of frequent words, it can be used to filter stop-words that occur too frequently in documents, but have negligible effects on classi-fication.

Although BOW is often used in text classification [10, 22], it does not capture position in text, semantics, and co-occurrence in different documents, and is only useful in lexical level. We nevertheless chose Bag-of-Words (BOW) to represent tures in this project mainly because it was the simplest starting strategy. These fea-tures were extracted from titles and abstracts of documents. It is true that other relatively more sophisticated strategies for feature representation may improve the performance [52]. However, since our goal is to compare the supervised learning and active learning, we considered that BOW could provide a good baseline feature representation.

2.3 Performance Evaluation

Generally, classifiers are evaluated by considering their predictive performance as evaluated over a hold-out dataset. However, in this project we are interested in how much screening workload can be saved when a model identifies all relevant documents out of the unlabelled pool.

Two groups of researchers have introduced useful metrics assessing classifier performance in this regard. Cohen et al. introduced Workload Saved over Sam-pling (WSS) [20] and Wallace et al. introduced Yield [10] and Burden [10]. Although these researches also introduced some other metrics for active learning such as Cov-erage [55] and Utility [56], they are not mainly for measuring the saved work, thus are not considered here. All these metrics are based on the assumption that all of the documents in a systematic review require the same labelling cost. Note that in these researches, there was no real human (reviewer) participating in the interaction with the active learning model, because it would otherwise be more expensive and sophisticated. So they simulated the oracle by hiding all labels of documents in the unlabelled pool, and revealing the real label to the document queried by the active learner.

(15)

2.3. Performance Evaluation 7

Work Saved over Sampling (WSS). WSS defines, for a specific level of recall, the percentual reduction in effort achieved by a ranking method as compared to a ran-dom ordering of the papers [20]. Effort here corresponds to the number of papers to consider during the screening task.

WSS is defined as follows:

WSS= (TN+FN)

N − (1−R) (2.3)

where TN denotes numner of true negatives, FN denotes false negatives, N de-notes the total number of documents. R is the desired level of recall, that is the pro-portion of correctly identified relevant documents among all relevant documents:

R= TP

TP+FN (2.4)

where TP denotes number of true positives.

It is usual to use WSS at recall 95% to evaluate the saved work (denoted as WSS@95) because it can show larger performance difference among models [20]. But in practice, reviewers require identifying all relevant documents, i.e., 100% recall. Yield and Burden. Yield and Burden are metrics introduced for evaluation of ac-tive learning approaches. Yield is the fraction of relevant documents identified by a given automatic screening model. Burden is the fraction of the total number of documents that a human reviewer needs to manually screen. They are defined as follows (see also [10]):

Yield= TP L₊_TPU TPL₊_TPU₊_FNU (2.5) Burden= N L₊_TPU₊_FPU N (2.6)

where FP denote false positives. The superscript L denotes manually labelled documents by a human. The superscript U denotes unlabelled documents to be labelled by the active learning model’s automatic prediction.

100% Yield means all relevant documents are either manually screened by the re-viewer or automatically identified by the model. So when all active learning models achieve 100% Yield, the one with the lowest Burden is regarded as the best.

(16)

(17)

9

Chapter 3

Data Preparation

This chapter will introduce the process of preparing data from systematic reviews. As Figure3.1 presents, this process involves data acquisition, text preprocessing, feature extraction and reduction, and standardization. Details of each step will be described in the following sections. The prepared data would be used in experi-ments of the following Chapters.

Figure 3.1: Data preparation steps, from raw data to classification feature set.

3.1 Data Acquisition

The raw dataset [57] was provided by CLEF eHealth Lab [58]. It consists of PubMed unique identifiers (PubMed ID) of 149,405 documents that belong to 20 Diagnostic Test Accuracy (DTA) systematic reviews. Each document has a label to indicate if it is relevant for the review (labelled as ‘1’) or irrelevant (‘0’). See characteristics of the reviews in Table3.1.

All documents were fetched through the PubMed Entrez API [59] on April 7th 2018. All 149,405 documents had titles, but 30,585 documents (259 relevant and 30,326 irrelevant) did not have abstracts.

(18)

10 Chapter 3. Data Preparation

Review ID #Docs #Relevant Docs % Relevant Docs

CD011549 12,705 2 0.02% CD008643 15,083 11 0.07% CD008686 3,966 7 0.18% CD010409 43,363 76 0.18% CD009593 14,922 78 0.52% CD011548 12,708 113 0.89% CD010438 3,250 39 1.20% CD009591 7,991 144 1.80% CD010632 1,504 32 2.13% CD009323 3,881 122 3.14% CD007394 2,545 95 3.73% CD011984 8,192 454 5.54% CD008691 1,316 73 5.55% CD011975 8,201 619 7.55% CD007427 1,521 123 8.09% CD008054 3,217 274 8.52% CD009944 1,181 117 9.91% CD009020 1,584 162 10.23% CD011134 1,953 215 11.01% CD010771 322 48 14.91% Total 149,405 2,804 1.88%

Table 3.1: Main characteristics of 20 reviews: review identifier, total number of documents, number of relevant documents, and percentage of relevant documents with regard to the total number of documents. Highlighted reviews were included in the Partial Dataset used for model development and optimization in Chapter4and5.

(19)

3.2. Text Preprocessing 11

3.2 Text Preprocessing

We used the Natural Language Toolkit (nltk) Python package [60] to implement the preprocessing of plain text of titles and abstracts from all documents. The Python codes is available in [24]. The preprocessing involved the following steps:

• Break the stream of text into words, phrases and symbols;

• Replace dashes (’—’) with spaces, and replace hyphens (‘-’) with underscores (‘_’) so that, for example, ‘pre-thrombotic’ can be converted to ‘pre_thrombotic’. • Eliminate unwanted symbols by selecting only alphanumerical characters, the

underscore (‘_’), and the percent sign (‘%’) in order to keep percentages. • Lowercase words. Convert capital letters into small letters.

• Remove so called stopwords that are not useful in classification of documents (e.g., ‘and’, ‘are’). For this we used the default english list in the nltk package. • Perform stemming to reduce each word to its word stem or the root. nltk provides several well-known stemming algorithms. We tested three of them (Porter [61], Lemmatize [60], and Snowball [62]) and chose Porter stemmer because we found that it could reduce more words than the others.

The preprocessed texts were stored in the cleaned corpus with 149,405 docu-ments. In each document, texts were presented as a list of cleaned words, e.g., the cleaned text ’correct haemostatic defect’ preprocessed from the original text ’Correc-tion of the haemostatic defects’.

Each document presented unique words within itself, and different document still can have overlaps. We found that the first 10,000 documents already had 995,939 cleaned words in total, which was huge. Unique words among all documents, or called features would be extracted in the next step.

3.3 Feature Extraction and Reduction

We used CountVectorizer [63] in Scikit-learn, a free machine learning library [64], to vectorize the cleaned corpus into the document-term matrix. The matrix was based on the term frequency (TF) that presents frequency of a word within one doc-ument. This newly obtained matrix contained 163,662 features (unique words).

Next we performed feature reduction on the basis of document frequency (DF) – see definition in Section3.3. We set the minimum threshold of DF to 28 (1% of the number of relevant documents) and the maximum threshold to 134,464 (90% of the total number of all documents), and removed the words whose DF were outside the range. We considered that if a word appeared in so few relevant articles (less than minimum), it would not be useful for predicting the relevance. Likewise, if a word occurred in so many documents (greater than maximum) it would not be valuable for prediction either. A few documents that were not written in English could be removed in this way.

163,662 features were reduced to 12,691 features – see the part of blue line be-tween the maximum threshold (red line) and minimum threshold (orange line) in Figure3.2.

(20)

12 Chapter 3. Data Preparation

Figure 3.2: Graph showing base-10 logarithm of document frequency (DF) versus words, which were reversely sorted by DF. The red dashed curve indicates the maximum threshold

log(134, 464). The orange dashed line indicates the minimum threshold of document frequency (log(28)).

3.4 Feature Standardization

Support Vector Machine algorithms are not scale-invariant, so it is important to scale the feature values. For instance, if a feature has a variance that is orders of mag-nitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected [65].

We used MaxAbsScaler [65] in Scikit-learn to scale the features to values rang-ing from 0 to 1. We used this method because it was specifically designed for scalrang-ing sparse data. It translates each feature individually such that the maximal absolute value of each feature in the data set will be 1.0, and does not shift/center the data, and thus it does not destroy any sparsity.

3.5 Final dataset

The full dataset resulting from the preparation process described above is composed of:

• Standardized document-term matrix with 149,405 documents (rows) and 12,691 features (columns)

• A list of 149,405 labels indicating the relevance of documents • A list of review IDs indicating the belonging reviews of documents This dataset is used in all following experiments.

(21)

13

Chapter 4

Baseline Model Design

In this chapter, we first introduce a baseline model and its main aspects for the de-sign. Next, we will describe how experiments were conducted for model selection. Finally the results of the optimal baseline model will be presented. In this study, the baseline model was a supervised learning model and could be further supported by an active learning model.

As introduced in Chapter2, we chose linear Support Vector Machine (SVM) to do text classification and used Bag-of-words (BOW) for features representation for this study. Furthermore, the following aspects had to be determined in the design of the baseline model:

• values of hyperparameters of classifier; • term-weighting schemes of BOW; • techniques to deal with class imbalance.

The following sections will describe these aspects in detail and the experiments carried out to take decisions about them. All experiments in this chapter, imple-mented by Python code [24], used the same dataset and evaluation metric, as de-scribed below.

Dataset. Coined Partial Dataset, it consisted of five reviews selected by randomly choosing from the available twenty Reviews. The selected reviews are highlighted in Table3.1. This dataset was used in a leave-one-out fashion, where the documents of four reviews were used as a training set and the documents of the remaining review were used as a validation set.

Evaluation. WSS@95 (see Section2.3) was used to evaluate the performance of a baseline model in workload reduction of screening for systematic reviews.

4.1 Hyperparameters of Classifier

We used the SGDClassifier provided by Scikit-learn [64] to construct a linear SVM classifier with stochastic gradient descent (SGD) learning. This means that the gradient of the loss function is estimated for each document at a time, and that the model is updated along the way. We used SGDClassifier because it is efficient and allows online learning, which is needed for implementation of active learning [66].

We used L2-regularization term and hinge-loss function to create linear SVM classifiers. Two hyperparameters of SGDClassifier had to be considered: α, which controls the regularization term, and tolerance, which indicates the stopping criterion [64]. We kept other parameters to the default values. We tested the following values

(22)

14 Chapter 4. Baseline Model Design

{1e−1_{, 1e}−2_{, 1e}−3_{, 1e}−4_{, 1e}−5_{, 1e}−6_}_{for α, and tested}_{_1e−2_{, 1e}−3_{, 1e}−4_{, 1e}−5_{, 1e}−6_{, 1e}−7_}

for tolerance.

Grid search was performed to explore every combination of hyperparameters and leave-one-out experiments were conducted to determine the optimal one in terms of WSS@95. So for each combination, five models were trained and tested with different split of training set and test set.

The heatmap in Figure4.1 shows the average WSS@95 of each hyperparameter combination. We found that the model achieving the highest WSS@95 used α =

1×10−5and tolerance=1×10−3.

Figure 4.1: Heat map representing the average grid search results (N=5) for hyperparameters α and tolerance for the SGDClassifier method.

4.2 Term-Weighting Schemes of BOW

As described in Section2.2, there were two schemes in BOW representation: term frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). A dataset in TF scheme can be transformed to normalized TF-IDF representation via Scikit-learn’s TfidfTransformer[64].

We carried out leave-one-out experiments in which models were trained with the Partial Dataset but used different BOW schemes. We compared their performance by WSS@95.

As results are presented in Figure4.2. Note that TF-IDF models had higher me-dian WSS@95 than TF models. Moreover, also theoretically TF-IDF should perform better than TF [67]. Therefore, we chose to use TF-IDF term-weighting scheme in BOW for the baseline model.

(23)

4.3. Class Imbalance Techniques 15 TF n = 5 TF-IDF n = 5 0.0 0.1 0.2 0.3 0.4 0.5 WSS@95

Figure 4.2: Results of leave-one-out experiments (N=5) comparing two BOW schemes (TF and TF-IDF) given WSS@95.

4.3 Class Imbalance Techniques

All DTA reviews (see Table 3.1) displayed class imbalance, with fewer than 15% relevant documents. Such high level of class imbalance poses challenges for training the classification model [68].

Three class imbalance techniques available in the imbalanced-learn Python pack-age [69] were considered in our study:

• random undersampling randomly selects documents from the irrelevant docu-ments and removing other irrelevant docudocu-ments. An equal number of relevant and irrelevant documents are selected;

• random oversampling randomly samples additional relevant documents with replacement until reaching the equal number of both classes;

• weighting assigns larger weights to relevant documents and less weights to ir-relevant documents.

We conducted leave-one-out experiments for the Partial Dataset using the three different class imbalance techniques and compared their performance based on WSS@95. Experiments using random undersampling and random oversampling were repeated ten times due to the stochastic characteristics.

Figure4.3presents the results showing that all three class imbalance techniques had higher median WSS@95 than the model that did not use any technique (none). Moreover, random undersampling achieved the highest value, indicating that it was the best technique to address class imbalance in this study and thus was selected.

4.4 Final Baseline Model

Based on the results of the experiments above, we defined the optimal setting of the baseline model as:

(24)

16 Chapter 4. Baseline Model Design none n = 5 undersampling random n = 50 random oversampling n = 50 weighting n = 5 0.0 0.1 0.2 0.3 0.4 0.5 WSS@95

Figure 4.3: Performance (WSS@95) obtained without (none) and with three techniques to deal with class imbalance: random undersampling, random oversampling, weighting. n denotes the number of models. Points using the same color represent the models trained with the same dataset.

• Hyperparameters for SGDClassifier: α=1×10−5and tolerance=1×10−3 • BOW scheme: TF-IDF

• Class imbalance technique: random undersampling

These optimal settings are also used in the active learning design.

We chose the optimal settings for the BOW feature representation and class im-balance technique based on median WSS@95 as presented in Figure4.2and Figure 4.3. Note however that no significant differences could be identified between the techniques, so our choice was based on heuristics and results from previous works.

(25)

17

Chapter 5

Active Learning Model Design

In this chapter, we first introduce different aspects to be determined in the design of an active learning model. Next, we will describe how experiments were conducted for model selection. Finally the results of the optimal active learning model will be presented.

Here we build a pool-based active learning model based on the optimal baseline model using linear SVM, TF-IDF weighting-term scheme, and random undersampling – see Section4.4. Two additional aspects for the active learning model need to be determined are:

• query strategy: determines the next document to be labelled by the oracle; • document weight: indicates the importance of a document in building an active

learning model.

The following sections will further describe these aspects and the experiments carried out to take decisions about them (see implementation in Python code [24]). All experiments in this chapter used the same experimental settings for active learn-ing, as described below:

Active Learning Model. Experiments over the Partial Dataset (Section4) were con-ducted in a leave-one-out fashion. We used the documents of four reviews as the initial labelled pool and used the remaining review as the unlabelled pool.

Firstly, we applied random undersampling to the initial labelled pool to address class imbalance. Then the initial model was trained by this balanced labelled pool. After that we simulated active learning by allowing the learning algorithm to pick a batch of documents to label at each active learning iteration. The batch size varied according to the considered query strategy. It has been suggested, however, that a smaller batch size leads to a sharper increase in performance [70].

Simulation of an oracle was implemented by ensuring the oracle’ labelling is 100% correct and then revealing real labels of these queried documents to the active learner. These newly labelled documents were added to the labelled pool and used in train-ing, and subsequently were removed from the unlabelled pool. For each iteration, we evaluated the current classifier over the remaining documents in the unlabelled pool, calculating Yield and Burden as defined in Section2.3. We continued active learning until the unlabelled pool was exhausted, i.e., until all documents had been labelled. This whole process was repeated five times due to the stochastic character-istics of the random undersampling approach.

Evaluation. Previous studies [10, 34] used Burden and Yield (see Section2.3) to compare the performance among active learning models. They selected the model with the lowest Burden when achieving 100% Yield.

(26)

18 Chapter 5. Active Learning Model Design But we introduced another metric one minus Burden (detonated as OMB) for eval-uation, because we prefer using a metric directly indicating how much workload can be saved. It is defined as follows:

OMB= TN

U₊_FNU

N (5.1)

where TN denotes true negatives, FN denotes false negatives, N denotes the total number of documents. The superscript U denotes unlabelled documents to be labelled by the active learning model’s automatic prediction.

We used OMB@100 (value of OMB when the model reached 100% Yield) to com-pare the performance among active learning models, because 100% Yield is always required by reviewers.

5.1 Query Strategy

This study focused on two query strategies, uncertainty sampling and certainty sam-pling. Uncertainty sampling queries documents that are most uncertain of relevance, whereas certainty sampling queries documents that are most certain to be relevant or irrelevant (see Section2.1). There are two main approaches to certainty sampling: query one most relevant with one most irrelevant document, and query two most relevant documents for each active learning iteration. To be consistent among all strategies, we set batch size as two for experiments in this section.

We conducted experiments of active learning using different query strategies. For each iteration, OMB was recorded.

uncertainty certainty

(positive) (positive and negative)certainty 0.0 0.2 0.4 0.6 0.8 1.0 (1-Burden) at 100% Yield

Figure 5.1: Performance (OMB@100) (N=5) obtained using uncertainty sampling and two sub-types of certainty sampling : query two most relevant documents or one most relevant plus one most irrelevant. Points using the same color represent the models trained with the same dataset.

Figure 5.1 presents three boxplots with five OMB@100 values for each query strategy. The figure shows that uncertainty sampling achieved much higher median

(27)

5.2. Document Weight 19 OMB@100 than other two certainty strategies, so uncertainty sampling was selected as the best query strategy for active learning in this work.

5.2 Document Weight

Previous section showed that uncertainty sampling was the best query strategy for active learning models. Here we further investigate the effects of weights of query documents for the iterative training in active learning process.

Weights of documents indicate how much contribution they can make on build-ing a model. Since the documents queried by the active learner are considered more informative than those in the initial labelled pool, assigning more weight to these query documents might increase the performance of active learning models.

We conducted experiments to construct active learning models with different weight given to query documents, i.e.,{1, 2, 4, 6, 8, 10, 20, 30, 40}. All models use un-certainty sampling strategy and set batch size to one, Weights of documents in the initiallabelled pool were set to one, as compared to query documents.

Figure 5.2 presents the performance of active learning models using different weights for query documents. We found that the default models which set the weight to one had higher median OMB@100 than other models which used higher weights for query documents. It means that assigning more weight to query documents did not improve the performance of active learning models.

1 2 4 6 8 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 (1-Burden) at 100% Yield

Figure 5.2: Performance (OMB@100) of models (N=25) assigning different document weight (from 1 to 40) to query documents. Points using the same color represent the models trained with the same dataset.

Then we thought that assigning more weight to relevant query documents than ir-relevant ones has potential to improve the performance of the active learning model (hypothesis), because in systematic reviews, relevant documents are fewer but more important than irrelevant ones.

We set four groups with different weights assigned to query documents of two classes – see Table5.1. Group 1 and Group 4 act as the control groups while Group 2 and Group 3 act as the experiment group. If the hypothesis holds, Group 2 models

(28)

20 Chapter 5. Active Learning Model Design

Group 1 Group 2 Group 3 Group 4

Relevant 1 10 1 10

Irrelevant 1 1 10 10

Table 5.1: Different sample weight assignment to query documents in different groups

should perform better than Group 1 models. For each group, we conducted experi-ment with all experiexperi-mental settings the same as the previous one, except the weight assignment.

Figure5.3presents performance of active learning models from the four groups. We found that the OMB@100 of Group 2 was lower than Group 1. But Group 2 was much better than Group 3 as expected. This finding may indicate that putting more weight onto the relevant query documents could improve the performance of the active learning model to some degree. The crux of it is to find the appropriate document weight for the relevant query documents, which can be one of the future directions. 1 10_to_positive 10_to_negative 10 0.0 0.2 0.4 0.6 0.8 1.0 (1-Burden) at 100% Yield

Figure 5.3: Performance (OMB@100) (N=25) of active learning models using four weighting schemes (left to right: group 1 to 4): assigning sample weight 1 to both classes, 10 to relevant documents, 10 to irrelevant documents, and 10 to both classes. Points using the same color represent the models trained with the same dataset.

5.3 Final Active Learning Model

In this chapter we found that uncertainty sampling was the best query strategy. Also, we found that putting more weight on query documents was not useful.

Note that this is the simplified design of the active learning model and can be extended in future work. For example, the stopping criterion in this study was that the unlabelled pool was exhaustive. This stopping point is meaningless in practice because there will be no workload reduction if all unlabelled documents are queried. So a better stopping criteria should be developed. Some researchers have proposed

(29)

5.3. Final Active Learning Model 21 confidence-based stopping criteria which suggest to stop the active learning process based on measuring the classifier confidence [71–73].

It may be a concern that uncertainly sampling sometimes fails by selecting out-liers [30], because these have high uncertainty, but cannot provide much help to the active learner.

(30)

(31)

23

Chapter 6

Evaluation of Active Learning

Model

In this chapter we describe the experiment performed to compare performance of supervised learning (baseline model, Chapter4) with active learning (active learn-ing model, Chapter5) in workload reduction of screening task for all twenty DTA systematic reviews. Then we present results and discuss the added value of active learning to answer the research question of this project.

6.1 Comparison of Baseline and Active Learning Models

We applied the optimal baseline and active learning models to all twenty DTA re-views (see Table3.1), and then compared their performance.

Leave-one-out experiments were conducted, in which documents of nineteen re-views were used as the labelled pool and documents of the remaining one as unla-beled pool. In each experiment, the initial labelled pool was randomly undersam-pled to address class imbalance, and then the linear SVM classifier was trained as a baseline model by the balanced labelled pool. The model ranked all documents in the labelled pool by their distance to the hyperplane and the initial WSS (see Sec-tion2.3) was calculated. This initial WSS indicated the performance of the baseline model for all DTA reviews, so we denote this WSS as Baseline WSS. After that, the active learning model was applied using uncertainty sampling.

The recorded WSS during the active learning process indicated the performance of the active learning model after each training iteration. Since we were interested in the total workload saved, we used maximum WSS value (denoted as Max WSS) achieved during the process as the comparative performance of the active learning model. Furthermore, we defined the added value of an active learning model as the difference between Baseline WSS and Max WSS. WSS at both recalls 95% and 100% were used. WSS@95 can show larger difference among models, whereas WSS@100 is more realistic because systematic reviews always require identifying all relevant documents.

Table6.1presents the results of baseline models and active learning models for twenty DTA reviews. We found that the baseline models did not reduce the work-load for the review CD008054 (-0.02) at recall 95% and for the review CD011984 (0) at recall 100%. But the baseline models could reduce the workload for other nine-teen reviews from 0.15 to 0.91 at recall 95%, and from 0.01 to 0.96 at recall 100%. Active learning models, however, could reduce the workload for all twenty reviews, from 0.14 to 0.93 at recall 95%, and from 0.03 to 0.98 at recall 100%. We also found that the added values for each systematic review at both recalls 95% (from 0.01 to 0.61) and 100% were positive (from 0.03 to 0.46). This means that active learning has

(32)

24 Chapter 6. Evaluation of Active Learning Model UP WSS@95 WSS@100 SL AL AV # QD (%) SL AL AV # QD (%) CD011549 0.18 0.47 0.29 319 (19.6%) 0.19 0.28 0.09 366 (22.5%) CD008643 0.73 0.74 0.01 47 (1.1%) 0.67 0.72 0.05 234 (5.7%) CD008686 0.70 0.81 0.10 992 (13.2%) 0.61 0.86 0.25 1,093 (14.5%) CD010409 0.26 0.49 0.22 92 (15.6%) 0.03 0.43 0.41 155 (26.3%) CD009593 0.17 0.33 0.16 263 (34.6%) 0.01 0.11 0.10 29 (3.8%) CD011548 0.22 0.65 0.43 1,021 (13.7%) 0.04 0.25 0.21 334 (4.5%) CD010438 0.91 0.93 0.03 100 (1.6%) 0.96 0.98 0.03 100 (1.6%) CD009591 0.16 0.35 0.19 206 (21.1%) 0.03 0.08 0.04 745 (76.3%) CD010632 0.31 0.39 0.08 443 (22.4%) 0.36 0.44 0.08 443 (22.4%) CD009323 0.54 0.60 0.06 202 (4.9%) 0.12 0.39 0.27 1,402 (34.2%) CD007394 0.22 0.69 0.47 354 (18.2%) 0.01 0.03 0.02 1,851 (95.4%) CD011984 0.15 0.30 0.16 238 (30.1%) 0.00 0.03 0.03 521 (65.9%) CD008691 0.33 0.73 0.40 552 (8.7%) 0.11 0.54 0.43 2,904 (45.7%) CD011975 0.22 0.82 0.61 2,707 (12.5%) 0.07 0.53 0.46 2,518 (11.6%) CD007427 0.30 0.47 0.17 151 (9.4%) 0.01 0.05 0.04 431 (26.8%) CD008054 -0.02 0.14 0.16 107 (66.9%) 0.01 0.07 0.06 127 (79.4%) CD009944 0.40 0.74 0.33 540 (13.5%) 0.05 0.15 0.10 29 (0.7%) CD009020 0.22 0.59 0.37 63 (9.6%) 0.15 0.49 0.34 179 (27.2%) CD011134 0.41 0.60 0.19 61 (8.1%) 0.18 0.50 0.31 370 (49.3%) CD010771 0.48 0.73 0.26 137 (10.8%) 0.35 0.40 0.05 10 (0.8%)

Table 6.1: Results of experiments for twenty DTA systematic reviews: WSS of supervised learning models (SL) and active learning models (AL) at recall 95% and 100%, their difference (i.e., added value, AV), and the number of query documents (QD) with its percentage to the unlabelled pool when active learning models achieved this value.

further improved the performance of the supervised learning in workload reduction for systematic reviews.

We also evaluated the trends of the WSS of active learning models at recalls 95% and 100% during the iterative process for all twenty DTA systematic reviews. Graphs in AppendixApresents how WSS@100 changed during the active learning process for each review. We normalized X-axis (the number of documents queried) to percentage and subtracted Baseline WSS from WSS during the active learning for all these graphs. So all trends for twenty reviews can be presented in one graph – see Figure6.1.

We found that WSS of many active learning models fluctuated (even often de-creased to the negative) at the beginning. This should be attributed to the poor per-formance of the model with few documents (from the unlabelled review) used for training. All active learning models experienced a linear decrease of WSS in the end because sufficient relevant documents were already queried, labelled and added to the labelled pool, so any further query was meaningless.

WSS of most active learning models increased to the maximum and then de-creased to the negative, except a few models experienced the decrease of WSS at the early stage before the increase. We also found that Max WSS@95 could be reached when at most 67% of all unlabelled documents were queried, and Max WSS@100 could be reached when at most 96% of all unlabelled documents were queried. This means that the active learning process should be stopped when 67% of all unlabelled documents were queried for recall 95%, otherwise it would be a waste of resources.

(33)

6.2. Added Value of Active Learning 25

Figure 6.1: Trends of difference of WSS at recalls 95% (left) and 100% (right) between baseline models (red dashed line) and active learning models during the active learning process for twenty systematic reviews. For each trend (blue line), there is one Max WSS (blue point) and the percentage of query documents when the model reached this Max WSS. The max of these 20 percentages is denoted as the blue dashed line.

The same applied to WSS@100, although the stopping point when 96% of all unla-belled documents were queried would not save much workload.

6.2 Added Value of Active Learning

In previous experiments, we found that the high heterogeneity of the added perfor-mance of active learning in workload reduction for DTA reviews. In this section we investigate the (potential) association between the added value of active learning and other metadata (see Table6.1) available about the DTA reviews: Baseline WSS@95, Max WSS@95 of active learning model and percentage of documents queried at Max WSS@95.

We calculated Pearson’s correlation coefficients [74] over these variables. A cor-relation coefficient of−1 or+1 indicates a perfect linear positive or negative rela-tionship, whereas a coefficient of zero indicates no association. We found that the coefficients calculated for WSS@100 were close to those for WSS@95, so we only pre-sented the results for WSS@95.

The results of Pearson Coefficients between five variables are presented in Table 6.2. We found that the coefficient between added value and WSS@95 of supervised learning (SL) was negative (-0.49). This indicates that a review already achieving high performance in supervised learning could not get additional benefits from ac-tive learning. For example CD008643 and CD010438 in Figure6.1 represent such cases. Moreover, negative association with ‘inclusion rate’ (-0.31) may imply that ac-tive learning can provide more benefit in a review with a relaac-tively small percentage of relevant documents. However, the coefficient between added value and percent-age of query docs was close to zero (-0.03), indicating weak or absent association.

6.3 Conclusion

The results of our experiments indicate that active learning could further improve the performance of supervised learning in the task of screening DTA reviews, al-though this added performance varied a lot among reviews.

(34)

26 Chapter 6. Evaluation of Active Learning Model % Inclusion SL AL % QD SL -0.46 1.00 AL -0.76 0.74 1.00 % QD 0.59 -0.68 -0.78 1.00 Added Value -0.31 -0.49 0.23 -0.03

Table 6.2: Pearson Coefficient between five variables: Added value, percentage of relevant

documents within a review (% Inclusion), WSS@95 of baseline model (SL), maximum WSS@95 of active learning model (AL), and the percentage of documents queried when active learning model achieving that value (% QD, as values shown Figure6.1. Highlighted values were used for interpretation.

(35)

27

Chapter 7

Discussion and Conclusions

In this thesis, we carried out the research to answer the question about the added value of active learning compared to supervised learning in workload reduction of the screening task for Diagnostic Test Accuracy (DTA) reviews. Firstly, we ex-tracted features from titles and abstracts of documents in systematic reviews and represented them in Bag-of-Words (BOW). Then we designed the optimal baseline model using linear SVM, TF-IDF schemes of BOW and random undersampling based on experiments using part of the dataset. Next, we designed the optimal active learning model using uncertainty sampling as query strategy. Finally, we applied both optimal models to full dataset, and compared their performance of in work-load reduction for these twenty DTA reviews.

Our results showed that the baseline model reduced the workload from 0.01 to 0.96 in nineteen out of twenty reviews to identify all relevant documents. It means that a reviewer in this case does not need to read 1% to 96% documents out of all documents in a review, while not missing any relevant documents.

Our results also showed that using active learning could further reduce the work-load from 0.02 to 0.46 after the supervised learning was applied. For example in the review CD011975 having 8,201 documents, the baseline model could only save 7% document (i.e., 574 documents) to identify all relevant documents. This saved work-load increased to 53% when active learning was applied, leading to 4,346 documents that a reviewer DID not need to read.

The strength of this study is that it compares active learning against supervised learning in their performance of workload reduction, and that all reviews are re-lated to one topic – Diagnostic Test Accuracy. Previous studies [10, 22, 34, 46, 51, 52] mainly focused on finding the best-performing model of active learning among all alternatives for different feature representation, class imbalance technique, query strategy. Wallace et al. [10] developed an active learning model with a novel under-sampling technique and adapted uncertainty under-sampling strategy. This model per-formed best among all tested alternatives and could reduce the number of doc-uments that must be screened manually from 40% to 50% over three real-world biomedical systematic reviews. Singh et al. [34] found that active learning models using BOW to represent features reduced the workload from 40% to 53% for public health reviews. They also found that using paragraph vector improved the perfor-mance of active learning models to around 65%. Kontonatsios et al. [51] compared the Utility (i.e., relative measure of Burden and Yield that takes into account reviewer preferences for weighting these two concepts [22]) of active learning models using different query strategies and feature representations across both clinical and public health reviews. But they did not interpret the finding in the workload reduction as-pect. Moreover, these studies [10,22,34,51] did not focus on a set of reviews only related to one topic, like the DTA reviews in our project. Instead they conducted

(36)

28 Chapter 7. Discussion and Conclusions researches over systematic reviews in different topics, e.g., clinical reviews about COPD, Pronton Beam, and social science reviews about Sanitation, Cooking Skills.

This study also has several limitations discussed below.

First, some decisions in the model design were not based on experiments. For example BOW and SVM were chosen as the baseline because they are straightfor-ward. Other options for feature representation could be paragraph vector, which has been proven helpful in workload reduction of screening for clinical reviews [51]. Concerning the choice of classifier, Singh et al. [34] observed that performance of a linear SVM and a logistic regression classifier are similar, so they used the logistic regression classifier which is computationally cheaper.

Second, it remains unknown when active learning can have positive added value. We found that many active learning models had bad performance when they query too few or too many documents during the iterations – see Figure6.1. When the model for review CD008054 reached its Max WSS@95 with around 67% of all un-labelled documents queried , most other models had negative WSS@95. This asyn-chronous phenomenon makes it difficult to provide intuition for the stopping points for the active learning process.

Third, this study still lacks a large-scale validation on a more comprehensive dataset with more systematic reviews to evaluate the generalizability of the devel-oped active learning model.

Fourth, one potential bias should be considered. Both optimal baseline model and active learning model were validated by a partial dataset with five reviews, but these five reviews were not eliminated from full dataset and were re-used for the added performance. We checked this bias by re-running Pearson coefficient analysis but withholding these five reviews, and found that the coefficients changed only slightly. It means this bias should have negligible effects on the added value of active learning.

As for future work, we foresee several possible directions to follow.

First, it is possible to further optimize the design of the active learning model. Option is to test more query strategies. For example, Singh et al. [34] proposed querying documents by taking both uncertainty and novelty into account. Novelty was defined as the probability of a query document being novel to the labelled pool. Even if one document is the most uncertain, its overall informativeness rank may not be the first if it has low novelty score. Establishing appropriate stopping criterion should also be considered, because querying too many documents can decrease the performance of the model. Some existing criteria are computationally expensive [71– 73], but adapting them to be more efficient is possible and can be another direction of future work.

Second, it is possible to investigate the association between the choices in the model design and the domains of the reviews. Singh et al. [34] found that active learning models using BOW performed better in social science reviews, whereas paragraph vector performed better in clinical reviews. This association may pro-vide an effective way to choose the appropriate feature extraction model for active learning given a certain topic of a systematic review.

Third, it would be possible to consider more realistic factors into the active learn-ing model design. In many settlearn-ings, labelllearn-ing costs vary considerably[75], so per-forming the cost-sensitive active learning [76] might be promising. In this case, the cost of acquiring a label can vary from one document to the other. Moreover, oracles might not be 100% correct at labelling all the time, so the error rate of labelling could be taken into account.

(37)

29

Bibliography

[1] D. Gough, S. Oliver, and J. Thomas. An introduction to systematic reviews. Sage, 2017.

[2] M.-C. Lavoie, J. H. Verbeek, and M. Pahwa. “Devices for preventing percuta-neous exposure injuries caused by needles in healthcare personnel”. In: Cochrane Database of Systematic Reviews 3 (2014).

[3] C. Mischke et al. “Gloves, extra gloves or special types of gloves for preventing percutaneous exposure injuries in healthcare personnel”. In: Cochrane Database of Systematic Reviews 3 (2014).

[4] A. Martin, D. H. Saunders, S. D. Shenkin, and J. Sproule. “Lifestyle interven-tion for improving school achievement in overweight or obese children and adolescents”. In: The Cochrane Library (2014).

[5] S. Fletcher-Watson, F. McConnell, E. Manola, and H. McConachie. “Interven-tions based on the Theory of Mind cognitive model for autism spectrum disor-der (ASD)”. In: Cochrane Database of Systematic Reviews 3 (2014), p. CD008785. [6] J Higgins. “Green S. Cochrane handbook for systematic reviews of

interven-tions Version 5.1. 0. The Cochrane Collaboration”. In: Confidence intervals (2011). [7] R. Whittemore and K. Knafl. “The integrative review: updated methodology”.

In: Journal of advanced nursing 52.5 (2005), pp. 546–553.

[8] K. S. Gurusamy et al. “Ultrasound versus liver function tests for diagnosis of common bile duct stones”. In: Cochrane Database of Systematic Reviews 2 (2015). [9] I. E. Allen and I. Olkin. “Estimating time to conduct a meta-analysis from

num-ber of citations retrieved”. In: Jama 282.7 (1999), pp. 634–635.

[10] B. C. Wallace, T. A. Trikalinos, J. Lau, C. Brodley, and C. H. Schmid. “Semi-automated screening of biomedical citations for systematic reviews”. In: BMC bioinformatics 11.1 (2010), p. 55.

[11] J. McGowan and M. Sampson. “Systematic reviews need systematic searchers”. In: Journal of the Medical Library Association 93.1 (2005), p. 74.

[12] A. O’Mara-Eves, J. Thomas, J. McNaught, M. Miwa, and S. Ananiadou. “Using text mining for study identification in systematic reviews: a systematic review of current approaches”. In: Systematic reviews 4.1 (2015), p. 5.

[13] A. M. Cohen. “Performance of support-vector-machine-based classification on 15 systematic review topics evaluated with the WSS@ 95 measure”. In: Journal of the American Medical Informatics Association: JAMIA 18.1 (2011), p. 104. [14] S. Jonnalagadda and D. Petitti. “A new iterative method to reduce workload in

the systematic review process”. In: International journal of computational biology and drug design 6 (2013), p. 5.

[15] Y. Ma. “Text classification on imbalanced data: Application to Systematic Re-views Automation”. PhD thesis. University of Ottawa (Canada), 2007.

(38)

30 BIBLIOGRAPHY [16] S. Matwin, A. Kouznetsov, D. Inkpen, O. Frunza, and P. O’blenis. “Perfor-mance of SVM and Bayesian classifiers on the systematic review classifica-tion task”. In: Journal of the American Medical Informatics Associaclassifica-tion 18.1 (2011), pp. 104–105.

[17] R. Paynter et al. “EPC methods: an exploration of the use of text-mining soft-ware in systematic reviews”. In: (2016).

[18] S. Matwin, A. Kouznetsov, D. Inkpen, O. Frunza, and P. O’blenis. “A new algo-rithm for reducing the workload of experts in performing systematic reviews”. In: Journal of the American Medical Informatics Association 17.4 (2010), pp. 446– 453.

[19] B. C. Wallace et al. “Toward modernizing the systematic review pipeline in ge-netics: efficient updating via data mining”. In: Genetics in medicine 14.7 (2012), p. 663.

[20] A. M. Cohen, W. R. Hersh, K Peterson, and P.-Y. Yen. “Reducing workload in systematic review preparation using automated citation classification”. In: Journal of the American Medical Informatics Association 13.2 (2006), pp. 206–219. [21] Y. Freund and R. E. Schapire. “Large margin classification using the perceptron

algorithm”. In: Machine learning 37.3 (1999), pp. 277–296.

[22] M. Miwa, J. Thomas, A. O’Mara-Eves, and S. Ananiadou. “Reducing system-atic review workload through certainty-based screening”. In: Journal of biomed-ical informatics 51 (2014), pp. 242–253.

[23] B. Du et al. “Exploring representativeness and informativeness for active learn-ing”. In: IEEE transactions on cybernetics 47.1 (2017), pp. 14–26.

[24] Python Code for Active Learning Project.https : / / github . com / AMCeScience / python-miner/archive/v3.0.zip. Version 3.0. 2018.

[25] D. Angluin. “Queries and concept learning”. In: Machine learning 2.4 (1988), pp. 319–342.

[26] L. E. Atlas, D. A. Cohn, and R. E. Ladner. “Training connectionist networks with queries and selective sampling”. In: Advances in neural information pro-cessing systems. 1990, pp. 566–573.

[27] Y. Chen, S. Mani, and H. Xu. “Applying active learning to assertion classifica-tion of concepts in clinical text”. In: Journal of biomedical informatics 45.2 (2012), pp. 265–272.

[28] A. McCallum, K. Nigam, et al. “A comparison of event models for naive bayes text classification”. In: AAAI-98 workshop on learning for text categorization. Vol. 752. 1. Citeseer. 1998, pp. 41–48.

[29] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. “Text classification from labeled and unlabeled documents using EM”. In: Machine learning 39.2-3 (2000), pp. 103–134.

[30] N. Roy and A. McCallum. “Toward optimal active learning through monte carlo estimation of error reduction”. In: ICML, Williamstown (2001), pp. 441– 448.

[31] X. Zhu, J. Lafferty, and Z. Ghahramani. “Combining active learning and semi-supervised learning using gaussian fields and harmonic functions”. In: ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining. Vol. 3. 2003.

(39)

BIBLIOGRAPHY 31 [32] G. V. Cormack and M. R. Grossman. “Evaluation of machine-learning proto-cols for technology-assisted review in electronic discovery”. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in infor-mation retrieval. ACM. 2014, pp. 153–162.

[33] N. Nissim et al. “Improving condition severity classification with an efficient active learning based framework”. In: Journal of biomedical informatics 61 (2016), pp. 44–54.

[34] G. Singh, J. Thomas, and J. Shawe-Taylor. “Improving Active Learning in Sys-tematic Reviews”. In: arXiv preprint arXiv:1801.09496 (2018).

[35] Z. Yu, N. A. Kraft, and T. Menzies. “Finding better active learners for faster literature reviews”. In: Empirical Software Engineering (2018), pp. 1–26.

[36] E. P. Pednault. Statistical learning theory. Citeseer, 1997.

[37] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

[38] B. McDermott, M. O’Halloran, E. Porter, and A. Santorelli. “Brain haemor-rhage detection using a SVM classifier with electrical impedance tomography measurement frames”. In: PloS one 13.7 (2018), e0200469.

[39] T. Joachims. Making large-scale SVM learning practical. Tech. rep. Technical Re-port, SFB 475: Komplexitätsreduktion in Multivariaten Datenstrukturen, Uni-versität Dortmund, 1998.

[40] R. Krishna, Z. Yu, A. Agrawal, M. Dominguez, and D. Wolf. “The bigse project: lessons learned from validating industrial text mining”. In: Proceedings of the 2nd International Workshop on BIG Data Software Engineering. ACM. 2016, pp. 65– 71.

[41] B. Settles. “Active learning”. In: Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012), pp. 1–114.

[42] B. Settles, M. Craven, and S. Ray. “Multiple-instance active learning”. In: Ad-vances in neural information processing systems. 2008, pp. 1289–1296.

[43] K. Chaloner and I. Verdinelli. “Bayesian experimental design: A review”. In: Statistical Science (1995), pp. 273–304.

[44] H. S. Seung, M. Opper, and H. Sompolinsky. “Query by committee”. In: Pro-ceedings of the fifth annual workshop on Computational learning theory. ACM. 1992, pp. 287–294.

[45] D. D. Lewis and J. Catlett. “Heterogeneous uncertainty sampling for super-vised learning”. In: Machine Learning Proceedings 1994. Elsevier, 1994, pp. 148– 156.

[46] Z. Yu, N. A. Kraft, and T. Menzies. “How to read less: Better machine assisted reading methods for systematic literature reviews”. In: CoRR, abs/1612.03224 (2016).

[47] H. Kilicoglu, D. Demner-Fushman, T. C. Rindflesch, N. L. Wilczynski, and R. B. Haynes. “Towards automatic recognition of scientifically rigorous clinical re-search evidence”. In: Journal of the American Medical Informatics Association 16.1 (2009), pp. 25–31.

[48] C. Blake and W. Pratt. “Better rules, fewer features: a semantic approach to selecting features from text”. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE. 2001, pp. 59–66.

Active Learning forWorkload Reduction of Automated Screening in Systematic Reviews

U

NIVERSITY OF

A

MSTERDAM

M

T

Active Learning for Workload Reduction

of Automated Screening in Systematic

Reviews

Active Learning for Workload Reduction of

Automated Screening in Systematic Reviews

Acknowledgements

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Preliminaries

2.1

Active Learning

2.2

Feature Representation

2.3

Performance Evaluation

Chapter 3

Data Preparation

3.1

Data Acquisition

3.2

Text Preprocessing

3.3

Feature Extraction and Reduction

3.4

Feature Standardization

3.5

Final dataset

Chapter 4

Baseline Model Design

4.1

Hyperparameters of Classifier

4.2

Term-Weighting Schemes of BOW

4.3

Class Imbalance Techniques

4.4

Final Baseline Model

Chapter 5

Active Learning Model Design

5.1

Query Strategy

5.2

Document Weight

5.3

Final Active Learning Model

Chapter 6

Evaluation of Active Learning

Model

6.1

Comparison of Baseline and Active Learning Models

6.2

Added Value of Active Learning

6.3

Conclusion

Chapter 7

Discussion and Conclusions

Bibliography