ACTIVE LEARNING IN SYSTEMATIC LITERATURE REVIEW

(1)

ACTIVE LEARNING IN SYSTEMATIC LITERATURE REVIEW

submitted in partial fulfillment for

the degree of master of science

HUYEN LE

164413989

master information studies

data science

faculty of science

university of amsterdam

2018-06-28

Internal Supervisor Second reader Title, Name Dr Evangelos Kanoulas Dr Ilya Markov

Affiliation University of Amsterdam University of Amsterdam

(2)

ACTIVE LEARNING IN SYSTEMATIC LITERATURE REVIEW

Huyen Le

University of Amsterdam Amsterdam, the Netherlands

huyen.le@student.uva.nl

ABSTRACT

A Systematic Literature Review ("SLR") is the process to search for and screen a large number of articles on a topic to locate the relevant articles from the list of articles returned by running the search query on databases. This process requires much effort given the significant amount of articles. By applying text mining, the selection process of SLR can be automated partially, thereby reducing time spent by reviewers.

Two main tasks during the SLR are discuss in the thesis: prior-itizing documents in the active learning process and classifying documents. On two datasets, provided by Deep Dynamics and task 2 CLEF eHealth lab1, different specifications on text representations, the choices of "seed" documents to start the active learning process and the degrees of the classifier regularization are tried. Finally, an effort-effectiveness analysis is implemented on four variances of SLR.

Given the same proportion of reviewed documents, the TFIDF vector space model generates the highest recall on Deep Dynamics dataset and a part of CLEF dataset with 5% or less of relevant documents. Whereas, the LDA text representation outperforms TFIDF and LSI on the CLEF topics with more than 5% of relevant documents. A moderate level of regularization provides the best performance for ranking task while allowing for overfitting can increase recall and precision in classification task.

Up to 100% recalls and a 26% saving in time can be achieved with active learning usage during the abstract screening stage. Therefore, the thesis implies the feasibility and also potential of active learning and text mining application in SLR.

1 INTRODUCTION

To support reimbursement claims of pharmaceutical products in Europe, SLRs are carried out to find relevant scientific information, which is presented in a later stage to health authorities. The pro-cess ensures sifting through an extensive collection of literature, followed by a manual selection of articles by one or two reviewers and extracting data from the selected articles. However, the manual part is time-consuming and can take up to seven days for a medium-sized search of 5000 articles. Therefore, the search is limited to a practical size, thereby limiting the thoroughness of the search and potentially missing relevant articles.

In more detail, the SLR is a highly structured process for review-ing the literature on a specific topic or group of topics to distill a targeted subset of knowledge or data [2]. In practice, during an SLR, the reviewers first search a database with a query tailored to the question or the product [17]. Then, there is a two-phase screening process. The first phase of systematic reviews is "broad screening" [2] where the reviewers read only titles and abstracts

1 http://sites.google.com/view/clef-ehealth-2018/task-2-technologically-assisted-reviews-in-empirical-medicine SVM classifier Seed training data Next documents to review Labels for the whole dataset Labels

+/-A: Prioritization for documents to be reviewed

B: Classification for all documents

Figure 1: Two setups for the Active learning process

of the articles to select the documents. The goal of this phase is to maximize the number of relevant articles and minimize the number of non-relevant articles for the next phase. The second phase is "strict screening". In this phase, the reviewers need to read through the whole body of the articles being included from the first phase to find every article that contains the sought after evidence [2].

The thesis aims to improve the workflow of the SLR process by applying the active learning method in selecting relevant articles in the broad screening phase. Two setups for the active learning are discussed:

• Setup A The prioritization setup where the aim is to prior-itize the most relevant studies to be screened first. In this setup, the review process will stop only when all the rele-vant documents are believed to be returned to the reviewers already.

• Setup B The classification setup where the aim is to review a certain amount of relevant documents to classify the rest of the dataset. In this setup, the uncertainty-based query strategy is used to select the next instants to be reviewed. As no label is available for reference at the beginning of the active learning process, it is also essential to study how to create "seed" positive and negative examples. One option is to select random documents from the unlabeled dataset as applied by Cormack and Grossman [3]. However, as there are certain metrics which can be used to rank documents according to their relevance, it is worth investigating whether the documents with the lowest ranks can be used as the "synthetic" non-relevant documents, instead of random ones.

Various specifications in the Active Learning process are exam-ined to evaluate the following research questions:

• RQ1: What are the befitting text representations for the articles used in the systematic literature review?

(3)

•RQ2: How can the performance of the active learning pro-cess be improved by:

– applying different regularization on the classifier, – using certainty and uncertainty-based sampling for the

classification task in setup B,

– randomly choosing or using a ranked list of documents to create "seed" training samples.

Finally, a cost-effectiveness analysis is conducted to examine the effort reduced by applying active learning in the SLR.

The thesis will be organized as follows. Section 2 discusses the literature. The section 3 introduces the studied datasets. Then sec-tions 4 and 5 state in detail about how the experiment is imple-mented and evaluated. Sections 6 describes the final results. Finally, section 7 evaluates both the efforts and effects of four different approaches in automating the SLR process.

2 RELATED WORK

2.1 Supporting Vector Machine

Supporting Vector Machine ("SVM") or Classifier ("SVC") [5] is a supervised learning method. Specifically, given a set of training examples belong to two categories, it generates a hyperplane to separate examples so that the they are divided into two categories by the maximized gap.

Given the examplesxi ∈ Rn belonging to two classesyi ∈ {+1, −1}, mapped as points in space by function f (x) and their dividing hyperplane characterized byw, the objective of SVM is to minimize: 1 n Õ i=1,n L(f (xi),yi)+ λΩ(w) (1) where L is the loss function andΩ is the penalty function of the model parameter. To set the amount of regularization for SVM, another parameterλ is added to trade off between the correct clas-sification and the complexity of the decision hyperplane.

As applied in sklearn package, equation 1 could be modified to:

C Õ

i=1,n

L(f (x_i),y_i)+ Ω(w) (2) DecreasingC will make the weight on Ω larger, thus avoid over-fitting. Vice versa, increasingC puts more weight on L, thus aims at classifying correctly more training examples[12].

Regarding the text classification task, SVM performs well because the features generated from text data is often in high-dimensional space, with a small number of relevant features while SVM can handle large feature spaces with less overfitting [7].

2.2 Active learning

Active learning in SLR is the process of improving the performance of the machine predictions by iterative feedback receipts from the reviewers [11]. Initially, the machine will learn from the first ex-amples given by the reviewers. It will then return the reviewers a list of articles and ask them to provide the labels. Subsequently, it will update its decision rules according to the new examples and generate another list of instants for the reviewers to screen again [11]. The process will continue until the specified stopping criterion is met.

To select the list of examples to be labeled, for the classification setup B, the uncertainty-based sampling strategy is used. Specif-ically, the machine queries the instances of which it is least sure how to label. Regarding an SVM classifier, under the uncertainty-based strategy, the next examples to be queried will be the one with the smallest distances to the decision boundary [8] or the smallest margins. Whereas, in setup A, as the goal is to review the most relevant articles first, the examples with the largest margins in the included category are returned.

2.3 Text representation

2.3.1 TF-IDF vector space model. TF-IDF stands for "term frequency-inverse document frequency". TFIDF, as a vector space model, aims at presenting documents as vectors of N dimensions, where each dimension is a unique word. The number of dimensions can be set so that onlyk most frequent words are taken into consideration. In a collectionC of N documents, the TFIDF score for term t in documentsd is calculated as follows:

T FIDF = t f (t,d) ∗ id f (t,C) (3) The term frequency is measured by the raw count of termt in documentd . The inverse document frequency is the logarithm of the inverse proportion of the documents that contain the words in the whole document collection. In formula,id f (t,C) = loд_{df (t )+1}N . The number 1 is added tod f (t) to avoid zero division.

2.3.2 Latent Semantic Indexing. . Latent Semantic Indexing or LSI is a mathematical method to model the relationship among documents from their composing words and among words from their frequencies in the documents [16].

Specifically, given X is the term-document matrix of sizen × m, by Singular vector decomposition ("SVD"), matrix X can be transformed into the product of three matrices as follow:

Xn×m= Un×rΣr ×rVm×rT (4)

whereΣ is a square diagonal matrix (with the non-zero values in the diagonal only). These values are called singular values and in decreasing order. By keeping thek largest singular values and set others to 0, X can be approximated withk parameters [16] while the similarity between its columns is reserved. Accordingly, LSI can output a set of concepts that relate words and documents while reduce the dimensions of text representations.

2.3.3 Latent Dirichlet Allocation. Latent Dirichlet Allocation or LDA is a generative statistical model for a collection of documents. The fundamental idea of LDA is that documents are represented as a mixture of latent topics, where each topic is a mixture of words [1]. In LDA, the probability of topics over documents and of words over topics are modeled using Dirichlet Allocation. LDA outputs a set of topics that are the most likely to have generated the collection of documents.

In detail, as described by Blei et al., LDA assumes that each topic is a multinomial distribution of words in the vocabulary. In the case ofK topics, the distribution of words within each topic φj(j = 1, K) is then drawn from a DirichletDir(β).

Given this set ofK topics with the corresponding word distri-butions, to generate a documenti of length N , for each word, we then:

(4)

• Drawθifrom Dirichlet distributionDir(α) as the distribution of topics in documenti.

• Draw a topicj from Multinomial(θi).

• Draw a wordwijin topicj from Multinomial(φj) [6].

3 DATA

Regarding the first stage of the SLR process, when only abstracts, titles and keywords are available, two main datasets are used. The first one is the two datasets provided by Deep Dynamics, containing documents returned by PubMed and Cochrane database searches using certain query strings. The SLR objective for each topic in the Deep Dynamics dataset is as the following table:

Dataset name Objective

AECOPD The range and frequency of the aetiology of acute exacerbations of COPD (AECOPD). Rotavirus The effectiveness and the impact of

ro-tavirus vaccines (i.e. Rotarix and RotaTeq) Table 1: Deep Dynamics dataset objectives

The second dataset is from the CLEF eHealth evaluation lab 20182. The dataset contains systematic reviews on Diagnostic Test Accuracy conducted by Cochrane. It composes of 72 topics corre-sponding to 72 systematic reviews. However, within the limit of this research, only 40 topics with fewer than 5000 documents are studied.

From both Deep Dynamics and CLEF datasets, the following components are used:

• The query strings which were used for the Boolean search. • The abstracts, titles, and keywords attached to the articles. • The labels of "relevant" or "non-relevant" for the references

in the broad screening stage.

Documents without abstracts, titles or keywords are excluded from the study. Table A1 in the Appendix provides more details of the datasets.

4 EXPERIMENTAL SETUP

4.1 Active learning process

The active learning method follows the "auto-TAR" process by Cormack and Grossman[3] with modified parameters. Specifically, the applied algorithm is described in Algorithm 1, the step with the human feedback in practice is stated in blue.

To answer the two research questions in section 1, the following specifications in the above algorithm are applied:

• Regarding the text representations (step 1), TFIDF, LDA, and LSI are used. More details can be found in table A2 in the Appendix.

• As the synthetic non-relevant documents (step 3), either randomly selected documents, or BM25 bottom-ranked doc-uments are used.

2

http://sites.google.com/view/clef-ehealth-2018/task-2-technologically-assisted-reviews-in-empirical-medicine

Data: Query and documents without any labels Result: The set of documents included

(1) Create text representations for documents and query begin Initialization of train dataset

(2) Construct a synthetic relevant document from the query;

(3) Add 50 synthetic non-relevant documents to the train dataset;

end

begin Active Learning

while The stopping criteria is not True do (4) Set the initial batch size B to 1;

(5) Train a Linear SVM classifier using the train dataset;

(6) Use SVM margins the rank the unlabeled documents again;

(7) Select the next B documents to review; (8)Label the documents;

(9) Add the new label to the train dataset;

(10) Remove the synthetic non-relevant documents from the train dataset;

(11) Add 50 synthetic non-relevant documents from the unlabeled documents to the train dataset; (12) Increase the batch size B by ₁₀₀B

end end

Algorithm 1: The Active Learning Algorithm

• Three different values of 0.01, 1 and 100 of the parameterC (as in formula 2) is used as the regularization parameter for linear SVM classifier (step 5).

• To decide which documents to be labeled next (step 7), the most certain examples are chosen for setup A, and the least certain examples are chosen for setup B.

4.2 Stopping criteria

4.2.1 Knee detection method. The knee-finding method pro-posed by Cormack and Grossman is employed to decide the stop-ping point for the process in setup A - prioritization. Specifically, at the end of each iteration as in algorithm 1, the gain curve of the total number of relevant documents found versus the total number of documents reviewed so far (as in figure 2) is updated.

Consider each point on that curve as a candidate, letslopebef or e

equal to the slope of the line from the origin to the candidate knee point andslopeaf terequal to the slope of the line from the candidate knee point to the last point [18]. The process will stop if there is a knee point so that: (i) we had reviewed at least 200 documents, and (ii) the slope ratioslope_slopeb e f or e

af ter at that point exceeds

a certain threshold. In this research, the slope ratio for each ranki between 0 and the number of reviewed documents is:

ρ =

Number of relevant document up to rank i i

Number of relevant document between rank i and rank s s−i

(5)

The stopping criterion forρ after reviewing s documents can be defined as follow:

(5)

Figure 2: Knee Detection [14]

SC = (

1 ifs ≥ 200 and ∃i ≤ s | ρi ≥ 156 −min(relret, 150) 0 otherwise

whererelret are the number of relevant documents found so far. Specifically, in the stopping criteria, to avoid the early stopping due to a small number of relevant documents found, following the method by Cormack and Grossman, the threshold forρ is set to be 156 if there is no relevant document found yet, six if there are at least 150 relevant documents found.

4.2.2 Maximum uncertainty. For setup B - classification, as the aim is to classify as accurately the unlabeled data as possible and stop only when it is certain enough to classify the unlabeled data, the maximum uncertainty method is applied. In detail, the stopping criterion is met if the maximum uncertainty value of all unlabeled examples is less than a predefined threshold [19]. The entropy-based uncertainty measure is applied. Following Zhu et al., the uncertainty measure for each unlabeled example is determined as:

U M(x) = −Õ

y ∈Y

P(y|x)loд(P(y|x)) (6)

whereP(y|x) is a proteriori probability and Y is the set of classi-fication classes. The higher value ofU M means that the classifer is more uncertain about the unlabeled articlex. The estimated forP(y|x) is calculated with Platt calibration [13], using function CalibratedClassifierCVin the sklearn package.

The stopping criterion is defined as follow:

SC = (

1 ifU M(x) ≤ 0.2∀x unlabeled 0 otherwise

5 EVALUATION

5.1 Evaluation of the classification task

For the classification task in setup B, the ouput is the labels for the whole dataset, after the reviewers had labeled a certain amount of documents. Only the documents classified as relevant by the machine are retrieved. The metrics applied for the labels provided

by the classifiers are: Recall, Precision. Given the contingency table as Table 2,Recall = tp/(tp + f p) and Precision = tp/(tp + f n) [9].

Relevant Non-relevant

Retrieved True positive (tp) False positive (fp) Not retrieved True negative (tn) False negative (fn)

Table 2: Relevant-Retrieved contingency table

The recall and precision metrics in setup B is denoted as:

M

evaluated dataset

#trained samples

whereM is the corresponding metric.

For example,recallall_stopdenotes the recall obtained on the whole dataset, by the linear SVM trained on the returned samples until the stopping criterion is met.

5.2 Evaluation of the prioritization task

In the prioritization setup, the output is a ranked list of documents returned to the reviewers. The evaluation methods for this setup are calculated over this ranked list. The metrics used are Recall at the knee, Mean Average Precision (MAP) at the knee, Normalized Discounted Cumulative Gain (NDCG) and NDCG at the knee. The details of those metrics are explained in this section.

In a ranked or prioritized retrieval context, topk returned doc-uments are considered as the set of retrieved docdoc-uments, with kdecided by the knee detection method as mentioned in section 4.2.1. The recall at the knee is the proportion of relevant documents re-turned to the reviewer until the knee point is reached.

For a single topic or query string, the average precision or AP is average of the precision values at all the relevant documents in the returned ranked list of documents. Mean average precision or MAP is average of all the AP values across topics.

Discounted Cumulative Gain is the gain accumulated at a partic-ular rankk of the ordered document list:

DCG@k = Õ

r =1,k

2r elr_{− 1}

loд2(1+ r)

whererel = 1 if the document at rank r is relevant and 0 otherwise. Normalized Discounted Cumulative is the DCG normalized by the best possible DCG value or the case that all the relevant documents are retrieved at the top of the list.

N DCG = _DCGDCG

ideal

Here, NDCG is calculated both for the complete set of documents and at the threshold decided by knee method.

6 RESULTS

This section discusses the results with different specifications and answers the research questions from Introduction.

6.1 Prioritization task - Setup A

6.1.1 Deep Dynamics Dataset. The results for the two datasets provided by Deep Dynamics are summarized in the table 3. The table shows that among three text models, TFIDF performs best

(6)

with a total recall of 1.0. Moreover, using TFIDF vectors, the active learning process requires less feedback while finding more relevant documents before the knee point is detected, compared to LSI and LDA. The NDCG and MAP metrics are also enhanced by using TDIDF. Thus, TFIDF based model is better at the task of prioritizing the relevant documents in SLR.

Regarding the regularization setting for the SVM classifier, it appears that the default value forC = 1.0 improves the ranking accuracy, as NDCG and MAP metrics all increase, compared with the value of 0.01. Allowing for more overfitting by increasingC to 100 does not generate better performance.

The way synthetic non-relevant documents are selected in this case plays a small role in the performance of the active learning process as the usage of BM25 bottom-ranked documents improves none of the metrics, compared to random documents.

Figure 3 illustrates the recalls achieved when a certain propor-tion of documents are reviewed. The insights from the figures are consistent with the table 3.

6.1.2 CLEF dataset. The results in prioritization task, for the topics in CLEF dataset, are displayed in table 4. It can be seen from the table that TFIDF features performs best and requires the least feedback to achieve 100% recall. Comparing NDCG and MAP when using TFIDF also reveals that the ranking of documents is better than LSI and LDA.

The results for the choices of regularization parameters for CLEF dataset corroborates the findings on the Deep Dynamics dataset with higher recalls whenC is equal to 1.0 or 100.0.

As averaging the recalls across all the CLEF topics shows small differences, it is worth investigating the performance in different subsets of topics. Figures 4 and 5 illustrates the recalls in 21 topics with 5% or more of relevant documents and 19 topics with less than 5% of relevant documents, respectively. As can be seen from those figures, in topics with small amounts of relevant documents, TFIDF text representation outperforms other models while LDA recalls are higher in the other case.

Regarding the regularization choices, there is no difference in performance in figure 5b where each topic has 5% or less of rele-vant documents. However, applyingC equal to 0.01 (which means increasing penalty on the model parameters) significantly improves recall for topics with a higher number of included studies.

Similar to Deep Dynamics dataset, the usage of BM25 bottom-ranked documents as synthetic negative examples does not improve recall significantly.

6.2 Classification task - Setup B

6.2.1 Deep Dynamics Dataset. In the classification task, the eval-uation metrics are calculated over the returned labels of the linear SVM classifier after a certain amount of documents are reviewed. Table 5 presents the number of feedbacks, recall, and precision on Deep Dynamics dataset for this task. The LSI text representation outperforms others with the average recall of 90% despite a smaller number of dimensions.

In contrast to setup A where more regularization means better performance, in setup B, setting the parameter C equal to 100, or allowing for more overfitting can increase recall.

6.2.2 CLEF Dataset. Similar patterns from Deep Dynamics dataset can also be observed in CLEF dataset in table 6. However, the clas-sification performance is not as good on CLEF topics as on Deep Dynamics data. The best performing text vectors are again LSI, but the highest average recall value stays at 31%. HigherC proves to allow Linear SVM to predict better in this case, as can be seen from recall and precision values.

In general, for both datasets, there is not much variance in per-formance when using random documents as seed non-relevant examples, compared to when using BM25 bottom-ranked docu-ments.

7 EFFORT-EFFECTIVENESS ANALYSIS

To examine whether applying text mining in the SLR process can reduce workload while maintaining high recall compared to manu-ally screening in both the first and second stage of SLR process, in this section, the performance and also efforts of four variant SLR approaches are considered.

The approaches are displayed in figure 6. The first one is "Sin-gle screening" where all the abstracts and full-texts are manually screened. In the first stage, the reviewer reads all the abstracts returned by the Boolean search and selects only the ones which meet the predefined criteria to include. In the second stage, the full-texts of the selected documents are retrieved and read by the reviewer, and only the full-text documents that are eligible are be included eventually. It is called "single screening" because only one reviewer is involved in the process. In practice, as described by Shemilt et al., there are two other SLR variances, namely "Safety first" and "Double screening", where the opinions of two reviewers are taken into consideration.

In "Scenario 1", the active learning method is applied during the first stage of abstract screening. In this stage, the active learn-ing process is set up to prioritize the most relevant documents. The process only stops when the knee point is reached. The re-maining documents which are not yet reviewed will be discarded. Among the reviewed documents, those that meet all the criteria are retrieved as full-texts. Regarding the second stage of full-text screening, the reviewer in this scenario applies the same approach as in "Single screening", they will read every document and select the final eligible ones.

In "Scenario 2", the active learning method is applied to both stages of the SLR. In the first stage, the approach is similar to "Scenario 1" where the highest ranked documents are chosen to be reviewed first until the knee point is reached. Similarly, during the second stage, the most relevant documents are prioritized. However, the knee method is not applied here in the second stage due to the small numbers of documents. Instead, a fix proportion of selected full-text is considered.

In "Scenario 3", only the full-texts are directly used from the beginning. The same approach as the first stage of Scenario 1 and 2 is applied, except this time the output is the final list of eligible studies.

(7)

#feed-back last rel. found #rel. found @knee NDCG @knee NDCG recall @knee MAP @knee Text representations(The results are forC = 1.0 only)

TFIDF_10K 1000.00 798.50 160.0 0.81 0.81 1.00 0.52

LSI200 1399.50 1162.00 156.0 0.75 0.75 0.98 0.39

LDA200 1449.75 1235.75 156.0 0.75 0.74 0.98 0.38

Synthetic non-relevant documents

BM25 ranked 1280.83 1088.33 156.78 0.74 0.74 0.98 0.39 random 1292.67 1096.72 156.67 0.75 0.74 0.98 0.39 Regularization parameter C 0.01 1265.08 1092.33 155.92 0.69 0.69 0.97 0.30 1.0 1283.08 1065.42 157.33 0.77 0.77 0.98 0.43 100.0 1312.08 1119.83 156.92 0.77 0.77 0.98 0.43

Table 3: Setup A - Deep Dynamics datasets results

0 20 40 60 80 100 %reviewed 0.0 0.2 0.4 0.6 0.8 1.0 recalls TFIDF_10K LDA200 LSI200

(a) Text representations

0 20 40 60 80 100 %reviewed 0.0 0.2 0.4 0.6 0.8 1.0 recalls 1.0 0.01 100.0

(b) Regularization parameter values

0 20 40 60 80 100 %reviewed 0.0 0.2 0.4 0.6 0.8 1.0 recalls ranked random

(c) Synthetic non-relevant documents

Figure 3: Setup A - Deep Dynamics datasets Review Effort - Recall

#feed-back last rel. found #rel. found @knee NDCG @knee NDCG recall @knee MAP @knee Text representations(The results are forC = 1.0 only)

TFIDF_10K 1514.05 1452.79 61.48 0.44 0.44 0.99 0.07

LSI200 1504.88 1460.50 61.20 0.45 0.45 1.00 0.08

LDA200 1514.05 1443.54 61.81 0.44 0.44 1.00 0.07

BM25 ranked 1508.62 1450.25 61.29 0.45 0.45 1.00 0.07 random 1510.78 1453.21 61.43 0.44 0.44 0.99 0.07 Regularization parameter C 0.01 1509.10 1449.47 61.35 0.45 0.45 1.00 0.07 1.0 1510.99 1452.28 61.50 0.44 0.44 0.99 0.07 100.0 1509.00 1453.45 61.24 0.44 0.44 0.99 0.07

Table 4: Setup A - CLEF datasets results

(8)

Figure 4: Setup A - CLEF datasets Review Effort - Recall for topics with 5% or more of relevant documents

Figure 5: Setup A - CLEF datasets Review Effort - Recall for topics with less than 5% of relevant documents

#feedback recall_stopall precisionall_stop Text representations(The results are forC = 100.0 only)

TFIDF_10K 1274.50 1.00 1.00

LSI200 1900.50 0.94 0.89

LDA200 1598.25 0.36 0.79

BM25 ranked 1564.33 0.58 0.71 random 1452.83 0.59 0.70 Regularization paremeter C 0.01 1472.50 0.27 0.30 1.00 1462.17 0.72 0.92 100.00 1591.08 0.76 0.89

Table 5: Setup B - Deep Dynamics datasets results

#feedback recallall_stop precisionall_stop Text representations(The results are forC = 100.0 only)

TFIDF_10K 486.48 0.43 0.98

LSI200 630.41 0.39 0.83

LDA200 462.19 0.03 0.44

BM25 ranked 564.17 0.20 0.56 random 534.05 0.20 0.57 Regularization paremeter C 0.01 587.23 0.06 0.29 1.00 533.73 0.25 0.65 100.00 526.36 0.28 0.75

Table 6: Setup B - CLEF datasets results

(9)

Prioritize the most relevant documents

Set of reviewed abstracts classified as relevant

Set of final eligible article

Single screening Scenario 1 Scenario 2

Abstract screening

Full-text screening

Titles and Abstracts Full-texts

Prioritize the most relevant documents

Scenario 3

Figure 6: Four compared SLR approaches

7.1 Studied data and assumptions

To examine the four approaches, the labels for full-text screening from the CLEF dataset are used. However, only a part of the full-texts are available at Open Access Subset of PubMed Central (PMC)

3_.

After filtering for the documents with full-texts, selecting only the topics with more included abstracts than included full-texts and at least ten abstracts selected in the first stage, there are 15 topics left with in total 13,632 documents. More details about those included topics could be found in the Appendix, table A5.

To compare the amounts of time saved by the text mining tech-niques, here it is assumed that the time resources used by the research staff for all approaches are the same. In more details, the time to read abstracts or full-texts and the time to obtain documents are displayed in table 7.

Item Estimated time use

per unit (min) Time to screen a title-abstract record 1

Time to retrieve a full-text record 4

Time to screen a full-text record 5

(Source: Shemilt et al.) Table 7: Estimated time use per unit

7.2 Analysis

Figure 7 displays the recall (how many valid documents can be retrieved in scenario 2 among all the eligible full-texts in the dataset) in scenario 2 versus the proportion of selected full-text screened. The thick blue line represents the average across the 15 studied topics.

As can be seen from the figure, the recall could reach 100% for specific topics while on average, we can obtain up to 75% of all the relevant full-texts.

The numbers of documents selected and read in each stage for all 15 topics are also presented in figure 8 for each SLR variance, using the PRISMA flow diagram [10]. The estimated effort (concerning screening and retrieving time) is displayed in table 8.

3_{https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/}

0 20 40 60 80

Percentage of selected full-text being screened 0.0 0.2 0.4 0.6 0.8 1.0 Recall topic recalls average recalls

Figure 7: Full-text recall in scenario 2

As can be seen from the flow diagram 8 and table 8, applying active learning in SLR can reduce the effort in the first stage by ap-proximately 26% while retaining almost all of the relevant abstracts. In scenario 1 where text mining is applied at the first stage only, the same number of eligible studies is identified eventually in the second stage.

For "Scenario 2", it is assumed that only 90% of the selected full-texts for each topic are reviewed. With the effort of 71% compared to single screening, 17 valid articles are omitted. Regarding Scenario 3 when the active learning process starts from the full-text data instead, it can be seen that the by reading approximately 75% of all articles, 100 over the total of 108 eligible studies can be retrieved. However, as the time spent on scanning full-text is much longer, this approach costs even more effort.

8 CONCLUSIONS AND FUTURE WORK

The study implements the active learning to improve the workflow of Systematic Literature Review ("SLR") process. The results are analyzed for two main setups: setup A where the most relevant documents are prioritized and the process stops when it is believed that there are no more relevant documents, and setup B where the documents which the classifier is the most uncertain about and the process stops when it is certain enough about the unlabeled samples.

By trying different choices on text representations, the seed documents to start active learning and the degrees of regularization, on Deep Dynamics and CLEF datasets, it is found that:

• More than 90% of relevant abstracts can be retrieved and reviewed in both datasets for the prioritization task, using TFIDF text features and the knee-detection method proposed by Cormack and Grossman.

• In the classification task, by applying the stopping criteria based on maximum uncertainty [19], the linear SVM classi-fier with TFIDF features is able to found up to 100% in Deep

(10)

Records being screened (n = 13,632 )

Full-texts being screened (n = 418 )

Eligible studies (n = 108 )

Records being excluded (n = 13,218 )

Full-texts being excluded (n = 310 )

Records entering abstract screening – 15 topics

(n = 13,632 )

Single screen

Full-texts being selected (n = 418 ) Abstract screening

Full-text screening

Single screen with Active learning Manual full texts screened

Full-texts being selected (n = 412 )

Records not being screened (n = 4,469 ) Records entering abstract

screening – 15 topics (n = 13,632 )

Records being excluded (n = 8,751)

Full-texts being screened

(n = 412) Full-texts being excluded(n = 304 ) Scenario 1

Single screen with Active learning 90% full-texts screened

Full-texts being selected (n = 412 )

Records not being screened (n = 4,469 )

Full-texts not being screened (n = 41 )

Records entering abstract screening – 15 topics

(n = 13,632 )

Records being excluded (n = 8,751)

Full-texts being screened

(n = 371) Full-texts being excluded(n = 280 ) Scenario 2

Single screen with Active learning on full-text

Records not being screened (n = 3,479 ) Records entering full-text

screening – 15 topics (n = 13,632 )

Records being excluded (n = 10,035) Scenario 3

Figure 8: The flows of records through the screening

Single screening Scenario 1 Scenario 2 Scenario 3

#(unit) Time(min) #(unit) Time(min) #(unit) Time(min) #(unit) Time(min)

Abstract read 13,632 13,632 9,163 9,163 9,163 9,163 0 0 Full-text retrieved 418 1,672 412 1,648 412 1,648 10,135 40,540 Full-text read 418 2,090 412 2,060 317 1,585 10,135 50,675 Total time 17,394 12,871 12,396 91,215 Results Recall 100% 100% 84% 93%

Number of eligible studies iden-tified

108 108 91 100

Time spent (compared with sin-gle screening)

100% 74% 71% 524%

Table 8: Effort-effectiveness analysis

Dynamics provided dataset. However, for CLEF dataset, the average recall across the topics peaks at 43%.

• Stricter regularization on the classifier’s parameters yields better performance in setup A for the CLEF topics with more than 5% of relevant documents while decreasing regular-ization yields better results for the ones with 5% or less of relevant documents.

• The choice of the "seed" documents to start the active learn-ing process with does not play an important role in improv-ing the performance.

By an effort-effectiveness analysis on a part of CLEF dataset in which full-texts are available via PMC Open Access4, it is clear that by applying the active learning method, the effort in the first stage of title-abstract screening could be reduced by 26%, compared with the case where one reviewer conducts the SLR. As using full-texts from neither abstract-filtered documents nor all documents can retrieve all the eligible articles, the method to apply active learning on full-text data is worth studying further.

4_{https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/}

The differences between the performances of the applied method on Deep Dynamics dataset (contains 2 topics) and CLEF dataset (contains 40 topics) demands a better approach. Specifically, the se-lection of the synthetic documents, feature extraction, and classifier can be improved. However, the study has shed light on the perfor-mance and effectiveness of active learning application in assisting the SLR process.

8.1 Future work

There are multiple ways to develop this research further. Firstly, better feature-engineering for documents can improve the perfor-mance in ranking and classifying articles. In particular, it is worth considering combining all the text representations or using pre-dictions from different feature spaces. Secondly, the thesis uses knee-detection and maximum uncertainty methods as the stopping criteria for setup A and B respectively. However, to find the better halting point where recall is maximized, other strategies for decid-ing when to stop learndecid-ing needs to be examined in the context of SLR. Finally, the model used to classify and rank documents is Lin-ear SVM. For the future work, the choices of SVM kernel function

(11)

and the application of machine learning in document ranking are worth being studied.

9 ACKNOWLEDGEMENT

I would first like to thank my supervisor Dr. Evangelos Kanoulas for giving me advice and direction over the past four months. I appreciate your time and effort.

I would also like to express my sincere gratitude to Michiel van Vliet and Oscar Untied at Deep Dynamics. Thank you for your valuable advice regarding the practical application and your support.

Finally, I am grateful for the encouragement to complete this thesis that I received from my parents and my friends - Mai, Trang, and Ha.

REFERENCES

[1] David Blei, A Ng, and Michael Jordan. 2003. Latent Dirichlet allocation. Journal of machine Learning Research 3, Figure 1 (2003), 993–1022. https://doi.org/10. 1162/jmlr.2003.3.4-5.993 arXiv:1111.6189v1

[2] Aaron M Cohen. 2008. Optimizing feature representation for

auto-mated systematic review work prioritization. AMIA ... Annual

Sym-posium proceedings / AMIA Symposium. AMIA Symposium (2008),

121–5. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656096/pdf/

amia-0121-s2008.pdfhttp://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2656096{&}tool=pmcentrez{&}rendertype=abstract

[3] Gordon F. Cormack and Maura F. Grossman. 2015. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv (2015), 19. https://doi.org/10.1145/2766462.2767771 arXiv:1504.06868v1

[4] Gordon V Cormack and Maura R Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval -SIGIR ’16. 75–84. https://doi.org/10.1145/2911451.2911510

[5] Corinna Cortes and Vladimir Vapnik. 1995. Support-Vector Networks. Machine Learning 20, 3 (01 Sep 1995), 273–297. https://doi.org/10.1023/A:1022627411411 [6] Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet allocation. In advances in neural information processing systems. 856–864.

[7] Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning. Springer, 137–142.

[8] Jan Kremer, Kim Steenstrup Pedersen, and Christian Igel. 2014. Active learning with support vector machines. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 4 (2014), 313–326.

[9] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (2008). [10] David Moher, Alessandro Liberati, Jennifer Tetzlaff, Douglas G Altman, Prisma

Group, et al. 2009. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS medicine 6, 7 (2009), e1000097. [11] Alison O’Mara-Eves, James Thomas, John McNaught, Makoto Miwa, and Sophia

Ananiadou. 2015. Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews 4, 1 (2015). https: //doi.org/10.1186/2046-4053-4-5

[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [13] John Platt et al. 1999. Probabilistic outputs for support vector machines and

com-parisons to regularized likelihood methods. Advances in large margin classifiers 10, 3 (1999), 61–74.

[14] Ville Satopää, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a "kneedle" in a haystack: Detecting knee points in system behavior. In Proceedings - International Conference on Distributed Computing Systems. 166–171. https: //doi.org/10.1109/ICDCSW.2011.20

[15] Ian Shemilt, Nada Khan, Sophie Park, and James Thomas. 2016. Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. https://doi.org/10.1186/s13643-016-0315-4

[16] Dumais Susan T. 2004. Latent semantic analysis. Annual Review of Information Sci-ence and Technology 38, 1 (2004), 188–230. https://doi.org/10.1002/aris.1440380105 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/aris.1440380105 [17] Byron C Wallace, Thomas A Trikalinos, Joseph Lau, Carla Brodley, and

Christo-pher H Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 11, 1 (jan 2010), 55. https://doi.org/10. 1186/1471-2105-11-55

[18] Haotian Zhang, Jimmy Lin, Gordon V Cormack, and Mark D Smucker. 2016. Sampling Strategies and Active Learning for Volume Estimation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR ’16. 981–984. https://doi.org/10.1145/2911451. 2914685

[19] Jingbo Zhu, Huizhen Wang, Eduard Hovy, and Matthew Ma. 2010. Confidence-based stopping criteria for active learning for data annotation. ACM Transactions on Speech and Language Processing 6, 3 (2010), 1–24. https://doi.org/10.1145/ 1753783.1753784

(12)

APPENDIX

#doc. #doc. used #doc. rele-vant used %rel. among doc. used DD dataset AECOPD 2072 1989 201 10.10 Rotavirus 2485 2055 121 5.89 CLEF dataset CD007394 2542 2299 92 4.00 CD007427 1457 1439 59 4.10 CD008054 3149 2905 206 7.09 CD008081 970 902 26 2.88 CD008122 1911 1761 272 15.45 CD008686 3964 3724 5 0.13 CD008691 1310 1248 67 5.37 CD008759 932 751 60 7.99 CD008760 64 56 12 21.43 CD008892 1499 1276 69 5.41 CD009020 1576 1442 154 10.68 CD009135 791 702 77 10.97 CD009185 1615 1278 92 7.20 CD009323 3857 3520 98 2.78 CD009372 2248 2054 25 1.22 CD009551 1911 1879 46 2.45 CD009647 2785 2642 56 2.12 CD009694 161 154 16 10.39 CD009786 2065 1813 10 0.55 CD009944 1162 982 98 9.98 CD010023 981 859 52 6.05 CD010296 4602 4491 53 1.18 CD010386 625 579 2 0.35 CD010438 3241 2356 30 1.27 CD010502 2985 2601 229 8.80 CD010542 348 338 20 5.92 CD010632 1499 1481 27 1.82 CD010633 1573 1464 4 0.27 CD010657 1859 1683 139 8.26 CD010705 114 112 23 20.54 CD010864 2505 2204 44 2.00 CD011053 2235 1945 12 0.62 CD011134 1938 1792 200 11.16 CD011420 251 241 42 17.43 CD011431 1182 1122 297 26.47 CD011912 1406 1289 36 2.79 CD011926 4050 3861 40 1.04 CD012009 536 521 37 7.10 CD012083 322 293 11 3.75 CD012216 217 192 11 5.73

Table A1: CLEF dataset

Text rep. Nr. of Dimension Sklearnfunction TFIDF 10000 TfidfVectorizer

LDA 200 LatentDirichletAllocation LSI 200 TruncatedSVD

Table A2: Text representation

#feedback r ecall_{s t op}al l pr ecisional l_{s t op} TFIDF_10K_BM25_ranked_0.01 1112.5 0.00 0.00 TFIDF_10K_BM25_ranked_1.0 1099.5 0.99 0.99 TFIDF_10K_BM25_ranked_100.0 1295.0 1.00 1.00 TFIDF_10K_random_0.01 535.5 0.00 0.00 TFIDF_10K_random_1.0 1138.0 1.00 0.99 TFIDF_10K_random_100.0 1254.0 1.00 1.00 LSI200_BM25_ranked_0.01 1952.0 0.81 0.91 LSI200_BM25_ranked_1.0 1952.0 0.94 0.93 LSI200_BM25_ranked_100.0 1890.5 0.89 0.92 LSI200_random_0.01 1868.5 0.81 0.91 LSI200_random_1.0 1900.5 0.97 0.95 LSI200_random_100.0 1910.5 0.99 0.86 LDA200_BM25_ranked_0.01 1719.0 0.00 0.00 LDA200_BM25_ranked_1.0 1363.0 0.20 0.83 LDA200_BM25_ranked_100.0 1695.5 0.36 0.78 LDA200_random_0.01 1647.5 0.00 0.00 LDA200_random_1.0 1320.0 0.20 0.83 LDA200_random_100.0 1501.0 0.35 0.79

Table A3: Setup B - Deep Dynamics detailed results

#feedback r ecall_{s t op}al l pr ecisional l_{s t op} TFIDF_10K_BM25_ranked_0.01 677.68 0.00 0.00 TFIDF_10K_BM25_ranked_1.0 494.72 0.42 0.98 TFIDF_10K_BM25_ranked_100.0 492.85 0.43 0.98 TFIDF_10K_random_0.01 555.38 0.00 0.00 TFIDF_10K_random_1.0 506.20 0.43 0.98 TFIDF_10K_random_100.0 480.10 0.42 0.98 LSI200_BM25_ranked_0.01 514.10 0.19 0.85 LSI200_BM25_ranked_1.0 647.80 0.33 0.92 LSI200_BM25_ranked_100.0 599.42 0.38 0.86 LSI200_random_0.01 586.33 0.19 0.88 LSI200_random_1.0 634.33 0.34 0.94 LSI200_random_100.0 661.40 0.41 0.81 LDA200_BM25_ranked_0.01 623.88 0.00 0.00 LDA200_BM25_ranked_1.0 514.80 0.00 0.05 LDA200_BM25_ranked_100.0 512.25 0.03 0.42 LDA200_random_0.01 566.02 0.00 0.00 LDA200_random_1.0 404.55 0.00 0.05 LDA200_random_100.0 412.12 0.03 0.47

Table A4: Setup B - CLEF detailed results

(13)

Topic #abstracts #rel. abs. #rel. full-text %rel. abs. %rel. full text CD008122 388 47 6 12.11 1.55 CD012599 556 14 2 2.52 0.36 CD008587 1872 23 11 1.23 0.59 CD010502 733 33 12 4.50 1.64 CD010213 1354 51 2 3.77 0.15 CD008054 389 24 2 6.17 0.51 CD012010 772 15 1 1.94 0.13 CD011431 487 106 12 21.77 2.46 CD008892 289 22 12 7.61 4.15 CD009579 792 23 8 2.90 1.01 CD011926 958 11 8 1.15 0.84 CD009551 428 15 4 3.50 0.93 CD007394 454 15 7 3.30 1.54 CD009593 3716 35 13 0.94 0.35 CD008782 1535 12 10 0.78 0.65

Table A5: Available full-text dataset

(14)

#feed-back last relevant found #rel. found @knee NDCG @knee

NDCG recall @knee MAP @knee

TFIDF_10K_BM25_ranked_0.01 1048.0 915.5 160.5 0.69 0.69 1.00 0.32 TFIDF_10K_BM25_ranked_1.0 1000.0 797.0 160.0 0.81 0.81 1.00 0.52 TFIDF_10K_BM25_ranked_100.0 1051.5 912.0 160.0 0.81 0.81 1.00 0.52 TFIDF_10K_random_0.01 1052.5 906.0 160.0 0.68 0.68 1.00 0.29 TFIDF_10K_random_1.0 1000.0 800.0 160.0 0.81 0.81 1.00 0.52 TFIDF_10K_random_100.0 1051.5 911.5 160.0 0.80 0.80 1.00 0.52 LSI200_BM25_ranked_0.01 1238.0 1052.5 155.5 0.72 0.72 0.97 0.36 LSI200_BM25_ranked_1.0 1407.0 1163.0 156.0 0.75 0.75 0.98 0.39 LSI200_BM25_ranked_100.0 1316.5 1105.0 156.0 0.75 0.75 0.98 0.39 LSI200_random_0.01 1238.0 1053.5 155.5 0.70 0.70 0.97 0.33 LSI200_random_1.0 1392.0 1161.0 156.0 0.75 0.75 0.98 0.40 LSI200_random_100.0 1447.0 1232.5 156.5 0.78 0.77 0.98 0.42 LDA200_BM25_ranked_0.01 1511.0 1328.5 152.0 0.68 0.68 0.96 0.26 LDA200_BM25_ranked_1.0 1452.5 1238.0 156.0 0.74 0.74 0.98 0.37 LDA200_BM25_ranked_100.0 1503.0 1283.5 155.0 0.74 0.74 0.97 0.36 LDA200_random_0.01 1503.0 1298.0 152.0 0.68 0.68 0.96 0.25 LDA200_random_1.0 1447.0 1233.5 156.0 0.75 0.75 0.98 0.39 LDA200_random_100.0 1503.0 1274.5 154.0 0.75 0.75 0.97 0.37

Table A6: Setup A - Deep Dynamics detailed results

#feed-back last relevant found

#rel. found @knee

NDCG @knee

NDCG recall @knee MAP @knee

TFIDF_10K_BM25_ranked_0.01 1504.00 1459.50 60.35 0.45 0.45 0.99 0.07 TFIDF_10K_BM25_ranked_1.0 1514.05 1453.18 61.58 0.45 0.45 1.00 0.07 TFIDF_10K_BM25_ranked_100.0 1514.05 1453.90 61.55 0.44 0.44 1.00 0.07 TFIDF_10K_random_0.01 1514.05 1468.35 62.05 0.46 0.46 1.00 0.08 TFIDF_10K_random_1.0 1514.05 1452.40 61.38 0.43 0.43 0.98 0.06 TFIDF_10K_random_100.0 1514.05 1453.98 61.48 0.44 0.44 1.00 0.07 LSI200_BM25_ranked_0.01 1514.05 1456.82 61.78 0.45 0.45 1.00 0.07 LSI200_BM25_ranked_1.0 1495.70 1451.60 60.60 0.45 0.45 0.99 0.08 LSI200_BM25_ranked_100.0 1493.55 1451.10 60.58 0.46 0.46 0.99 0.08 LSI200_random_0.01 1494.42 1439.52 60.50 0.44 0.45 0.99 0.07 LSI200_random_1.0 1514.05 1469.40 61.80 0.45 0.45 1.00 0.08 LSI200_random_100.0 1514.05 1475.70 61.88 0.45 0.45 1.00 0.07 LDA200_BM25_ranked_0.01 1514.05 1435.68 61.70 0.45 0.45 1.00 0.07 LDA200_BM25_ranked_1.0 1514.05 1442.72 61.80 0.44 0.44 1.00 0.07 LDA200_BM25_ranked_100.0 1514.05 1447.78 61.68 0.44 0.44 1.00 0.06 LDA200_random_0.01 1514.05 1436.95 61.70 0.45 0.45 1.00 0.07 LDA200_random_1.0 1514.05 1444.35 61.82 0.44 0.44 1.00 0.07 LDA200_random_100.0 1504.25 1438.28 60.30 0.43 0.43 0.97 0.06

Table A7: Setup A - CLEF detailed results