Active Learning for Systematic Reviews Balancing Exploitation and Exploration

(1)

Active Learning for Systematic Reviews

Balancing Exploitation and Exploration

Mart van der Marel 10752919

Bachelor thesis Credits: 18 EC

BSc Future Planet Studies Major Artificial Intelligence

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Evangelos Kanoulas Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 30th, 2017

(2)

Abstract

This thesis investigates the use of Active Learning systems to accelerate manual identification of relevant literature for systematic medical reviews. This predominantly manual process is continuously becoming more resource-consuming, wasting the researcher’s time and funds. Recent research has introduced novel approaches focused on combining two opposing strategies, exploitation and exploration, to maximize the workload reduction. Four variation of such an ap-proach are elaborated in this thesis and compared with two baseline Active Learning apap-proaches on a labeled dataset of 117’927 document. Experiments show that compared to the baseline, no statistically significant differences are achieved; all considered approaches including the baselines reach a reduction in workload of around 35%. Optimization of the parameters indicates that the optimal scores are achieved with the parameters set to resemble the baseline approaches. Only in a few cases different parameters achieve more valuable savings. The findings indicate that too much emphasis on exploration decreases the performance, and that considering exploration is rarely of advantage. Moreover, the optimal balance between exploitation and exploration cannot be determined in general but depends on the spatial distribution of relevant and irrelevant documents in a specific dataset. Future research could investigate how the spatial features of the dataset can be taken advantage of to determine in advance the worth of pursuing exploration.

Keywords: Active Learning, Multi Armed Bandit, Systematic review, Exploitation and Exploraiton

1 Introduction

Systematic reviews provide an exhaustive overview of a topic offering a basis for prospective research. Conduct-ing a systematic review of scientific literature available on a topic can be very time-consuming. In order to retrieve and facilitate the identification of relevant literature for such a review, a search engine is used. To obtain relevant documents, a query is constructed and posed to the search engine. Thereupon a list of potentially relevant documents is returned to the researcher. This list proposed by search engines can be long and include a large number of irrele-vant documents. For the researcher this leads to a waste of time and high costs associated with manually filtering out the relevant documents by means of manual screening. It is understandable that there is a high demand for tools that facilitate this process. Two important factors however, have limited the success of such tools in decreasing the workload; firstly, the skewed ratio of relevant to irrelevant documents burdens classification algorithms. Secondly, the need for high recall to reduce false negatives to a min-imum. In the medical domain for example, false negatives can lead to harmful or even fatal consequences. The sec-ond factor also introduces the recall-cost tradeoff; high recall comes at the cost of having to review more docu-ments. Accordingly, a method that avoids the omission of relevant documents while keeping the number of irrelevant results low is required. A new approach to tackling this problem, based on the Multi Armed Bandit problem, has shown promising results deserving further investigation. In this thesis a comparative analysis of the performance of existing Active Learning and novel Multi Armed Bandit approaches is performed. The former mainly focus on ex-ploiting the available data, while the latter add a dimension of exploration, which is hoped to improve the performance of the system once the point is reached where further ex-ploitation of the data is unlikely to improve recall. The balance between exploitation and exploration is deemed an important factor and is also investigated. This leads to the following research question: Can a Multi Armed Bandit approach outperform baseline Active Learning in reducing the manual work required for retrieving relevant literature for systematic reviews, and what is the optimal balance between exploitation and exploration? The fol-lowing subquestions will be addressed in order to answer the main research question: a. How well perform baseline Active Learning approaches on the given dataset? b. How well perform Multi Armed Bandit approaches on the given dataset? c. What is the optimal balance between exploita-tion and exploraexploita-tion in a Multi Armed Bandit approach? The thesis is structured as follows: firstly, the theoretical foundation is elaborated. Secondly, the methods and ap-proaches are described. Thirdly, the results are presented and evaluated. Lastly, the thesis is concluded and future implications are discussed.

2 Theoretical foundation

This section provides the theoretical framework for this thesis; it provides an overview of the approaches, meth-ods and results of a selection of previous work on the topic. First, the existing approaches and methods will be discussed; the similarities and differences between the ap-proaches will be noted and the applied algorithms will be presented in pseudo code. Then the evaluation metrics will be discussed and elaborated; the mathematical definitions will be given and their usefulness in evaluating the results of this thesis will be noted.

2.1 Approaches

Shemilt et al. (2016) indicate a lack of methods that fa-cilitate the work-flow of systematic reviews. According to van Rozendaal (2016), the creation of such review ar-ticles is essential in the field of evidence based medicine; they provide a concise overview of all scientific evidence on a subject. However, due to the ever-growing amount of literature, collecting all relevant articles can be a very expensive and time consuming task. This makes the pro-cess of identifying relevant articles increasingly resource-consuming. Methods able to facilitate the workflow of this process are in high demand and further research on the topic is highly desirable since it is an issue that urgently needs to be addressed (van Rozendaal, 2016; Shemilt et al., 2016). Reducing the workload is in all likelihood also relevant for disciplines beyond the field of medicine and processes other than systematic reviews.

Thus far, to gather documents for a review article, a query is constructed and posed to a search engine. A list of possi-bly relevant documents is returned, and the many irrelevant articles included have to be filtered out by means of man-ual literature screening methods. To shorten the screening process, automated Active Learning (AL) methods are emerging. The aim of these, as evaluated by Shemilt et al. (2016) & Settles (2010), is to reduce the mentioned workload on experts, while keeping the value of the review high; a proportionally small number of relevant articles has to be distinguished from a large number of irrelevant ones. It is important that few to no relevant articles are discarded, whereas the number of irrelevant articles left for manual screening is reduced as much as possible. AL methods were originally conceived for classification tasks where large quantities of unlabeled data are available but labeling of the data is resource-intensive (i.e. expensive and/or time-consuming). For example the annotation of biomedical information, a time-consuming process requir-ing the work of experts with domain knowledge. The aim of an active learner is to find the optimal recall-cost trade-off. Because the algorithm iteratively interacts with an oracle (a human expert) to maximize the improvement in classification at each iteration, it is assumed to be able to optimize its classification hypothesis whilst requiring a significant smaller number of labeled datapoints. Shemilt

(4)

et al. (2016) apply a cost-effectiveness analysis to compare the efficiency of four literature screening methods: three manual methods (double screening, safety first, and single screening) and one method that applies AL to classify the literature.

In AL, a sub discipline of machine learning in artificial intelligence, a classifier is trained on a training set. This set can be generated in different ways, provided that it contains at least one document for each class (irrelevant and relevant): a. documents can be posed to the oracle for their label until the condition is met. Due to the skewed relation between the two classes, this method can result in a very large initial training set. b. the documents queried to the oracle for a label can be selected at random until the condition is met (Cormack & Grossman, 2014). Similarly to a. this method can result in a very large initial training set. c. The documents can be ranked according to their similarity to the original query before applying method a. This last method has been shown to reduce the training set generation effort and is used for the experiments in this thesis. It is noted by Cormack and Grossman (2014) that using a non random training set generation method signifi-cantly improves the performance of the algorithm. After the training set has been prepared, an algorithm se-lects a candidate datapoint from the remaining dataset to be queried to the oracle. This selection can be based on the strategy to choose the particular unlabeled document in the dataspace from which, if its classification was known, the algorithm expects to learn the most. However there are also different strategies. The following candidate se-lection strategies will be discussed later in this section: Uncertainty Sampling, Probabilistic Ranking and Kernel Farthest-first. The oracle evaluates the selected document and provides a label, 1 for relevant, 0 for irrelevant, which is subsequently added to the training set with its label. The classifier is subsequently trained anew on the updated training set allowing it to refine its classification hypothe-sis over the dataset (Settles, 2010; Cormack & Grossman, 2014; Lewis & Gale, 1994). This iterative, interactive process continues until all relevant documents have been found.

The dataset used in this thesis provides labels for all doc-uments, allowing the algorithm to infer when this point is reached. In practice however, it cannot be known whether or not all relevant document have been found. As sug-gested by Shemilt et al. (2016) the process could simply be stopped at a point where the percentage of unidenti-fied relevant documents is expected to be below a certain threshold. At this point the cost of continuing the process is assumed to be higher than the expected benefits to recall (Cormack & Grossman, 2014; Shemilt et al., 2016). In their experiments, Shemilt et al. (2016) note that the AL method is not able to correctly classify all the relevant

literature compared to the manual screening methods; it is indicated that the method reaches on average 95% recall at the point where 36% of the literature has manually been classified. Recall is the percentage of relevant articles in the dataset that have successfully been identified as such. Inversely, this signifies that 64% less articles have to be reviewed and thus considerable less time is spent on man-ual screening. The manman-ual and AL methods represent two sides of the tradeoff between maximizing recall and min-imizing workload. Reducing the workload further is the goal of new Active Learning methods.

Shemilt et al. (2016) subsequently analyze the incremen-tal cost-effectiveness ratio of the three manual methods in comparison to the performance of the AL method as a baseline; this measure indicates the incremental cost of a method for correctly classifying an article that was wrongly classified as irrelevant, i.e. correcting a false neg-ative, by the AL method. It amounted to between £832 and £6709 per article for the three manual methods. It is argued that the 5% of the relevant articles that was not correctly recalled by the AL method was of insignificant importance to the overall review that was conducted as a case study for the analysis. Automatic screening methods, it is concluded, are therefore able to reduce the time spent on literature classification. The authors encourage future research aimed at improving the AL algorithms as to fur-ther reduce the workload and substantiate their findings ((Shemilt et al., 2016)).

In a study by Cohen et al. (2006), an automated screening method is also reported to successfully reduce the number of documents requiring manual review. The authors argue that this demonstrates the usefulness of automated classifi-cation. However it is also pointed out that further research into the topic is required. Especially with regards to the refinement of the classification system and the integration of the automatic classification into the systematic review process.

Algorithm 1: Uncertainty Sampling

1 Create training set; 2 Train initial classifier;

3 while Threshold not reached do 4 Apply classifier to test data;

5 Query datapoint d closest to the desicion

boundary to the oracle;

6 Add labeled d to training set; 7 Train classifier on new training set; 8 end

2.2 Candidate selection strategies

As stated previously, this subsection discusses the three Candidate selection strategies; A first strategy for finding the most informative unlabeled datapoint is Uncertainty Sampling (US) as coined by Lewis and Gale (1994) and

(5)

elaborated by Settles (2010). It queries the datapoint clos-est to the current classification hypothesis boundary. This is the document about which the classifier is most uncertain (see Algorithm 1). US is a purely exploitative strategy bas-ing its decision on the information provided by the trainbas-ing set, and is the first baseline used in this thesis. The author observes that in most examined cases, US is able to reduce the amount of manual screening required (Settles, 2010).

Algorithm 2: Probabilistic Ranking

5 Query datapoint d with highest probability of

being relevant to the oracle;

6 Add labeled d to training set; 7 Train classifier on new training set;

8 end

Cormack and Grossman (2014) elaborate Probabilistic Ranking (PR) as a second Candidate selection strategy. This approach for identifying the most useful document to query to the oracle ranks the unlabeled documents by their probability of being relevant. The top-scorer is then posed to the oracle (see Algorithm 2). PR is an exploitative strat-egy as well, relying solely on the information provided by the training set. It is used as the second baseline algorithm this thesis. In their experiments, Cormack and Grossman (2014) show that US is generally outperformed by PR. Baram et al. (2004) introduce a third Candidate selec-tion strategy; In Kernel Farthest-first (KF), the selecselec-tion of a document for querying to the oracle is based on the distance between an unlabeled document and the set of labeled documents. The unlabeled document which is far-thest from the least distant labeled point is presented to the oracle for querying (see Algorithm 3).

Algorithm 3: Kernel Farthest-first

1 Create training set;

2 Train initial classifier;

5 Query datapoint d with furthest minimum

distance to all labeled datapoints to the oracle;

6 Add labeled d to training set; 7 Train classifier on new training set; 8 end

Contrary to US and PR, this strategy does not rely on ex-ploitation of the current classifier but aims at exploring the dataset by selecting candidates at the maximum spatial distance from all current labeled datapoints. Osugi et al. (2005) note that in US and PR, the exploration of samples in regions at a distance from the boundary is not under-taken. In this thesis KF is used to carry out exploration in

the Multi Armed Bandit approaches introduced later in the following section.

2.3 Exploitation and Exploration

The three above presented AL Candidate selection strate-gies focus on either exploitation or exploration of the data; focusing solely on exploring the data would not use the information provided by the training set (Bouneffouf et al., 2014). While concentrating solely on exploitation could re-sult in relevant samples far from the hypothesis boundary to be found only after a very large percentage of irrelevant documents has been screened (Baram et al., 2004). Be-cause of this, applying only one of these two strategies is assumed to generalize poorly for complex and extensive data (Bouneffouf et al., 2014).

To address this exploitation/exploration dilemma in AL al-gorithms, Bouneffouf et al. (2014), Hofmann et al. (2011) and Osugi et al. (2005) propose a dynamic balance be-tween exploitation and exploration. This is motivated by the assumption that a combination of the strategies might outperform individual strategies across different datasets as the weaknesses of individual algorithms are avoided.

Algorithm 4: Biased Coin

5 Flip a biased coin to select exploitation or

exploration;

6 Query datapoint d selected by the chosen

strategy to the oracle;

7 Add labeled d to training set; 8 Train classifier on new training set; 9 Adapt coin bias according to change in

hypothesis;

10 end

A first such approach is proposed by Osugi et al. (2005): at each round, a random but biased choice between exploita-tion and exploraexploita-tion is made. The change of the hypothesis boundary is then used to adapt the probability of the two options in the subsequent round. When exploitation is chosen, a candidate sample is selected by PR. When explo-ration is chosen, a sample selected by KF is queried (see Algorithm 4). For the experiments in this thesis, modified versions of this algorithm will be used; different initial bias values and reward functions for changing this value will be studied in order to investigate the importance of balancing exploitation and exploration.

Osugi et al. (2005) highlight that this approach can be use-ful in cases where there are multiple scattered clusters of relevant documents that are easily missclassified with the conventional approach, such as exclusive OR (XOR) data

(6)

(see figure 1). It is expected that complex datasets contain multiple scattered clusters and should thus be more easily classified with this approach.

Figure 1: An exclusive OR (XOR) problem. Taken from Osugi et al. (2005), page 2.

Algorithm 5: Strategies as Arms

2 Set the three strategies as arms;

3 Create reward variable for each arm; 4 Train initial classifier;

7 Select arm based on previous rewards and

preference for exploration;

8 Query datapoint d selected by the chosen

strategy to the oracle;

11 Adapt reward for the chosen arm according to

change in hypothesis;

12 end

Another approach by (Baram et al., 2004) frames the ex-ploitation/exploration dilemma as a Multi Armed Bandit (MAB) problem; In a MAB problem the algorithm has to select an ”arm” of the MAB at each iteration. The goal is to select the arm which will give the greatest reward. With each iteration, the algorithm can improve its estimates of the reward of an arm; this approximation of the reward given by any specific arm is given through averaging the rewards over the times the arm was selected. This way the algorithm can learn what arm to select and increasingly better predict which arm to choose next in order to maxi-mize the reward (see Algorithm 5).

In their experiments, the authors use the three candidate se-lection strategies discussed above as arms: US, PR and KF. The performance of this ensemble of strategies is shown

to be capable of competing with the best individual AL algorithm for different datasets (Baram et al., 2004).

Simlarily to Baram et al. (2004), Bouneffouf et al. (2014) frame the exploitation/exploration dilemma as a MAB problem. The authors note a deficiency within previous experiments where this was attempted; in the previous al-gorithms the features of the dataspace were not taken into account. Bouneffouf et al. (2014) argue that this is a draw-back as these features, such as the number or density of datapoints in an area and the ratio of classes, could pro-vide information on the class of unlabeled datapoints. In order to overcome this limitation, Bouneffouf et al. (2014) suggest a Contextual Bandit Model. In this model, clusters of datapoints represent the arms of the MAB. For each arm the spatial features of the datapoints it contains are considered its context. The Contextual Bandit can access the context of each arm from the first round, allowing it to learn how the context and reward of an arm are linked, and possibly better predict which arm to choose in order to maximize the reward (see Algorithm 6).

Algorithm 6: Clusters as Arms

1 Create training set; 2 Set data-clusters as arms;

3 Create reward variable for each arm; 4 Train initial classifier;

7 Select arm based on previous rewards, spatial

centext, and preference for exploration;

8 Query datapoint d in the selected cluster to the

oracle;

11 Adapt reward for the chosen arm according to

change in hypothesis;

12 end

The Contextual Bandit is applied to classifying vocal ut-terances and is evaluated and compared to state-of-the-art methods, outperforming US. Bouneffouf et al. (2014) con-clude that the superiority to US illustrate the usefulness of addressing exploration as well as exploitation. Finally, ATS also outperforms a MAB algorithm which does not consider context. The Contextual Bandit Model, it is con-cluded, can greatly improve the results (Bouneffouf et al., 2014).

Another approach is taken by Hofmann et al. (2011); The authors implement an algorithm that attempts to balance the two sides of the dilemma by introducing a weight factor w, attributed to each feature in the dataspace. The ranking of the documents is based on the score-value S which is calculated as follows: S = W ∗ X, where W is a set of weights and X a feature vector containing a value for each document describing its relation to the query q.

(7)

Conse-quently a dueling MAB algorithm performs two classifica-tions: first with the original weights W , and second with a perturbed set of weights W0. If the latter improves the performance, then W0 is adapted as the new weight set W . If not, the original weight set is kept (see Algorithm 7) . In conclusion, Hofmann et al. (2011) indicate that this approach has been shown to greatly improve performance.

Algorithm 7: Weighted Features as arms

2 Set features in the data as arms; 3 Create reward variable for each arm; 4 Create weight variable for each arm; 5 Train initial classifier;

8 Create a perturbed weight variable for each arm; 9 Run the dueling MAB, each MAB using

different weights;

10 for each MAB, select arm based on previous

rewards, weights, and preference for exploration;

11 for each MAB, query datapoint d with highest

score S from the selected arm to the oracle;

12 for each MAB, add labeled d to a temporary

training set;

13 for each MAB, train classifier on the temporary

training set;

14 Compare the performance of the two classifiers; 15 Adapt best classifier, and related weight and

training set;

16 end

2.4 Evaluation criteria

From the literature review emerge a number of evaluation measures that are also utilized for the evaluation of the experiments in this thesis; van Rozendaal (2016) investi-gates how a search engine can retrieve all relevant litera-ture while keeping the amount of irrelevant literalitera-ture low. As discussed above, preference is given to recalling an ir-relevant document over discarding a ir-relevant document. Hence, the loss of relevant literature is weighed more heav-ily than the erroneous recall of irrelevant literature at a rate of 10:1. To reflect this in the evaluation, the evaluation met-ric F β, the geometmet-ric mean of recall and precision is used with β = 10. The geometric mean is defined as follows:

F β = (1+β2)∗ precision ∗ recall

(β2_{∗ precision) + recall}, (withβ = 10)

Where precision indicates what proportion of the recovered documents is relevant and recall indicates what proportion of all relevant documents in the dataset has been found. Precision and recall are defined as follows (OMara-Eves et al., 2015):

P recision = T P T P + F P Recall = T P

T P + F N

Where TP is the number of true positives, FP the number of false positives, and FN the number of false negatives. Consequently, F β can also be defined in terms of TP, FP and FN (OMara-Eves et al., 2015):

F β = (1 + β2) ∗ T P

T P + F P + β2_{∗ F N}, (withβ = 10)

Another metric used is Average Precision (AP); it is the average value of precision (p) as a function of recall (r), and is used to evaluate the performance of a ranked list of documents as produced by an AL system. It is defined as follows:

AP = Z 1

0

p(r)dr

For the analysis of their experiments, Cohen et al. (2006) introduce a third evaluation metric that expresses the amount of Work Saved over Sampling (WSS). The value of this metric corresponds to the percentage of documents that is saved from manual review and indicates the percent-age of unqueried documents that is correctly classified by the algorithm as irrelevant (Cohen et al., 2006). It is calcu-lated as follows:

W SS100 = (T N + F N )/N − 1 + T P/(T P + F N ) Where N is the total number of documents and the number 100 indicates that WSS is measured at 100% recall. In their analysis, Cohen et al. (2006) report the results of the algorithm at a 95% recall threshold; were the to be queried documents selected by random sampling instead of by means of the algorithm, it is assumed that 95% of the documents would have to be reviewed manually to reach 95% recall. Hence for the algorithm to be useful, it should outperform the 5% of work saved by random sampling at this threshold. The adapted WSS metric represents the fourth evaluation metric and is defined as follows (OMara-Eves et al., 2015):

W SS95 = (T N + F N )/N − 0.05

When evaluating the performance of a system over multi-ple datasets, the Mean scores indicated by a capital M are calculated. The Mean of a metric x over Q runs is defined as follows:

Mx=

PQ

q=1APx(q)

Q

(8)

in this thesis to evaluate the results.

3 Research Method

This section elaborates on the research method taken. After introducing the dataset and software utilized, the approach is summarized before being discussed in detail.

3.1 Data description

A labeled dataset on 39 medical topics concerning drug-to-drug synergies to fight tumors is available for the experi-ments in this thesis. The dataset contains a large number of article titles and abstracts for each of the 39 topics. This list of articles for each topic is the result of a query constructed by an expert and posed to the PubMed central search en-gine. The articles have been labeled as relevant or not by a reviewer and a table containing the number of relevant and irrelevant documents and the percentage of relevant docu-ments for each of the 39 topics can be found in Appendix A. The total number of documents amounts to 117’927, of which 115’610 are irrelevant and 2317 relevant (1.96%).

3.2 Software information

Jupyter notebook running on Python 3.5.2 is used for the methodological and visualization elements in this thesis. Essential libraries used are sklearn for data-processing, machine learning, and evaluation, numpy for array repre-sentation and operations, and matplotlib for visualiza-tions. Additional important libraries are scipy for scien-tific computing, os for file management, random for ran-dom number generation, and beautifulSoup for xml parsing.

3.3 Approach

As suggested in the research question, the purpose of this thesis is to investigate the differences in performance of the baseline AL and the MAB algorithms with regards to facilitating the reviewing process, and investigating the balance between exploitation and exploration in a MAB approach. Based on the methods, suggestions and short-comings noted in the above literature review, the following primary research question has been formulated: Can a Multi Armed Bandit approach outperform baseline AL in reducing the manual work required by retrieving relevant literature for systematic reviews, and what is the optimal balance between exploitation and exploration? The fol-lowing subquestions are be addressed in order to answer this main research question: a. How well perform baseline Active Learning approaches on the given dataset? b. How well perform Multi Armed Bandit approaches on the given dataset? c. What is the optimal balance between exploita-tion and exploraexploita-tion in a Multi Armed Bandit approach?

In short, the following two steps portray the approach taken to answer the sub- and main-questions. These two steps to answering the research question will be discussed hereafter in more detail.

Step 1) To answer subquestions a and b, for each of the 6 algorithms a system is created to perform the following operations for all topics in the dataset:

i Read in and pre-process the title, abstract and label of each article in the topic,

ii Encode the pre-processed data so that it can be used by the algorithms,

iii Run the algorithm on the data,

iv Evaluate the list of ranked documents as returned by the algorithm.

The evaluation is augmented with an optimization of the parameters in a MAB system in order to be able to address subquestion c.

Step 2) After the first step, the evaluation results and an-swers to the subquestions are available and the main re-search question is answered by carrying out a comparative analysis of the results.

3.3.1 Step 1: Addressing the subquestions

In order to observe the performance of the different algo-rithms and answer subquestions a and b, the above intro-duced operations are taken as follows;

(i) Firstly, for each algorithm, the data on all topics is passed through a preprocessor; the data is read in, the title and abstract of each document are coupled into a single line of text, the unique ID of the document is noted, and the corresponding classification label is documented. (ii) Secondly, the text line containing title and abstract is encoded by means of a term frequency-inverse document frequency (tfidf) method. Tf-idf transforms the text lines of all documents simultaneously into an array of numer-ical values expressing for each unique textual element in a given text line its importance over the complete set of text lines. Considering the extension of the dataset, the dimension of this array is reduced to one hundred, consid-erably reducing the computation time of the classification operation. The accompanying loss of information in this process is considered insignificant.

(iii) Thirdly, the algorithm is run on the data. In all six systems the training set is created as follows: the encoded documents are ranked according to their cosine distance to the original query. Starting from the most similar, they are queried to the oracle and added to the training set until it contains at least one document for each class. Next an

(9)

initial classifier is trained on the training set. From this point on, the process differs for the baseline AL and MAB algorithms; For the baseline AL algorithms it proceeds as follows: First, the classifier is applied to the remaining datapoints classifying documents as relevant or irrelevant. Second, the corresponding candidate selection strategy se-lects a datapoint that is queried for a label to the oracle. Third, this datapoint is removed from the unlabeled dataset and added with its label to the training set. And fourth, the classifier is retrained on the new training set. These four steps are repeated until all relevant documents have been found. See figure 2 for an illustrative visualization of the classification process.

For the MAB approaches, the subsequent steps are repeated until all relevant documents have been found: First, it is decided whether to exploit or explore; If a randomly gen-erated number is greater than threshold p the exploitation arm is chosen. If it is smaller, the exploration arm is cho-sen. Second, depending on the arm chosen, the algorithm applies either PR (exploitation) or KF (exploration) to se-lect a datapoint to be queried for its label to the oracle. PR was selected instead of US as it was shown to be superior by previous research in the theoretical framework. Third, this datapoint is removed from the unlabeled dataset and added with its label to the training set. Fourth, the classifier is retrained on the new training set. Fifth, and differently to the baseline AL process, the change in recall is noted. Sixth, if the change is positive, the threshold p is modified according to a set rate of change r so that selecting the suc-cessful arm becomes more probable in the next iteration. If there is no change in recall, p is modified in either one of the following two different manners, each giving form to an alternative MAB system utilized in this thesis;

1. In a Balanced MAB algorithm, the rate of change function is symmetric; If successful, the arm is re-warded. If not successful the arm is punished. Both according to r. In addition, the initial probability p of selecting either arm is equal.

2. In an Unbalanced MAB algorithm, the rate of change function is not symmetric; If exploitation is successful, the arm is rewarded according to r. It is not punished if it fails to improve recall. If explo-ration is successful, it is also rewarded. However, it is punished if it fails. Again, both according to r. In addition, the initial value of p is set at 0.01. This asymmetry is an attempt to favor exploitation over exploration in the initial iterations with a small p, as well as in later iterations by not punishing exploita-tion.

This method for the MAB algorithm is inspired by Osugi et al. (2005): at each iteration, a random but biased choice between exploration and exploitation is made, emulating the 2 arms in a 2-armed Bandit.

Figure 2: Visualization of the classification process; snapshots before, during and at the end of the process displaying True Pos-itives, False PosPos-itives, True Negatives and False Negatives. The 100-dimensional data is reduced to two dimensions in order to draw the plot.

In their experiments, Osugi et al. (2005) make use of the Balanced MAB approach. The Unbalanced approach is

(10)

in-troduced in order to provide a way to investigate subques-tion c. Another variasubques-tion introduced to the MAB systems in this thesis is Forgetfulness: In a Forgetful approach, contrary to the Naive approach sketched above, the classi-fier forgets most of the training set whenever exploration is successful; the training set is forgotten except for the 10 nearest previously labeled documents. This way the classification of the data is distorted towards the newly ex-plored relevant document. It is assumed that once a cluster of relevant documents has been exploited and a different relevant document is found by means of exploration, it is located in a fresh cluster of relevant documents. Forgetting most of the previously labeled data ideally accelerates the process of discovering the other relevant documents in the new cluster. The introduction of forgetfulness leads to two additional MAB approaches, resulting in a total of 4 differ-ent MAB approaches that will be evaluated in this thesis (see table 1).

Algorithm Reward Memory

BN Balanced Naive BF Balanced Forgetful UN Unbalanced Naive UF Unbalanced Forgetful US (baseline AL) PR (baseline AL)

Table 1: Four possible MAB systems emerge from the combina-tion of the different Reward and Memory approaches. Included for completeness are the two baseline AL systems.

(iv) And Finally, the fourth operation is to evaluate the list of ranked documents as returned by the algorithm. For each of the six systems the following evaluation metrics are cal-culated as discussed in the theoretical framework:

• WSS100, • WSS95, • AP, • F10.

To obtain an indication of the mean WSS100, WSS95, AP and F10 over all the topics, their means MWSS100, MWS95, MAP and MF10 are displayed where appropri-ate. These measures provide the information necessary to answer subquestions a and b.

In order to provide a more detailed insight into the balanc-ing of exploitation and exploration required for producbalanc-ing an answer to subquestion c, an optimization of the two pa-rameters in the Reward approach, the initial probability p of exploration (p0) and the rate of change (r), is performed.

For p0, the values 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9

and 1 are tested. For r, the values tested are: 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75 and 0.99. This optimiza-tion of p0 and r will be discussed to extract insights into

the balancing of exploitation and exploration.

3.3.2 Step 2: Addressing the main research question Based on the answers to the subquestions and the eval-uation results of the AL baseline and MAB approaches, the performance of the six systems can be compared. The results of these comparisons will determine the answer to the main research question of whether a Multi Armed Ban-dit approach can outperform baseline AL in reducing the manual work required by retrieving relevant literature for systematic reviews and how exploitation and exploration should optimally be balanced.

The main research question will be answered as follows: the overall results of the two baseline AL systems are com-pared with each other to validate the superiority of PR to US and the overall results of the former are compared to the results of the four MAB systems. The performances of all systems on all the 39 topics will be analyzed and the differences in AP will be visualized over the entire range of topics. ∆ AP is defined as follows:

∆APAB= APA− APB

Where A and B are two different systems. The p-value pro-duced by a statistical t-test is added to enhance the value of the results and clarify the significance of differences in the performances of two systems.

Moreover, to elaborate on each system and investigate the effectiveness of the MAB algorithms compared to the AL baseline algorithms, a topic for each system in which it ex-cels will be presented. For this purpose, a two-dimensional spatial distribution of the documents comprising the topic will be visualized and supported by a graph plotting the recall of all six systems at each iteration where appropriate.

4 Results and Evaluation

In this section, the overall results will be presented and evaluated before examining a number of selected cases in detail. Afterwards, the process of finding the optimal pa-rameter values for a MAB system will be discussed. The findings allow the formulation of an answers to the sub-questions which is used to answer the main research ques-tion.

4.1 Overall Results

Performing each of the six Active Learning systems on all the 39 topics produces the results in table 2 and in figure 3.

(11)

Figure 3: Boxplots depicting the scores over all 39 topics includ-ing traininclud-ing set. Red squares indicate the mean values, red lines indicate the median values, boxes include lower and upper quar-tiles, whiskers show the range of the data, and outliers are marked by crosses. US PR BN BF UN UF MWSS100 34.72 35.1 33.54 35.22 35.2 34.64 MWSS95 58.42 56.51 52.85 50.3 56.41 56.42 MAP 14.11 10.48 7.68 7.93 10.43 10.21 MF10 66.96 66.96 66.14 66.45 67.02 66.88

Table 2: Mean scores over all 39 topics including training set.

The ranked lists of documents used to calculate these mea-sures include the training sets. Overall the systems per-formed very similar with regards to MWSS100; the BF MAB slightly outperformed UN and PR. The MWSS100 of US, UF and BN were a little lower. On MWSS95 and MAP, US outperformed PR, UF and UN. BN and BF had lower MWSS95. Lastly, on MF10 UN outperformed its competitors, however on a very small lead.

US PR BN BF UN UF

MWSS100 38.31 40.26 39.38 31.18 37.31 37.27

MWSS95 62.82 61.44 60.06 50.94 60.63 61.28

MAP 30.56 24.35 22.08 18.34 24.07 24.0

MF10 72.48 72.79 71.98 69.26 72.11 71.97

Table 3: Mean scores over all 39 topics starting at the first rele-vant document

When the training set is not taken into consideration dur-ing the evaluation, meandur-ing that the list of ranked doc-uments used to evaluate an approach excludes the docu-ments in the training set, starting only at the first relevant document found, the results are higher (see table 3). In this situation, PR on average outperformed BN and US on MWSS100. The unbalanced MABs score slightly lower while BF scores somewhat lower. On MWSS95 and MAP, US again outperformed PR and the all four MABs. BF MABs achieves a much lower MWSS95 than the other MAB systems. PR scored highest on MF10, however by a very small lead over most of its competitors. The box-plots containing WSS100, WSS95, AP and F10 scores for all topics starting from the first relevant document found can be found in Appendix B.

Figure 4:∆ Average Precision per Topic with p-value from a t-test between the two algorithms. US plotted against PR.

In order to provide a more detailed comparison of the dif-ferent algorithms, figures 4 and 5 provide a visualization of the ∆ Average Precision between the chosen algorithms on each individual topic and include the p-value of a statistical t-test between the results of the considered systems.

(12)

Figure 5: ∆ Average Precision per Topic with p-value from a t-test between the two algorithms. PR plotted against BN, UN, BF and UF.

As can be seen in the first figure, US consistently though not significantly outperforms PR. The following figure plots PR against the two naive and forgetful MABs; while BN is consistently but not significantly outperformed, UN

performs very similar. Likewise PR consistently but not significantly outperforms BF, while it performs similar to UF.

4.2 Cases

In this subsection, for each approach a case where it outper-forms its competitors on a specific topic is discussed. The scores for these topics are provided in table 4. A spatial distribution of the documents reduced to two dimensions and the recall of all six algorithms at each iteration dur-ing the classification process is given where appropriate to improve understanding of the MAB system.

Topic 1 US PR BN BF UN UF WSS100 0.0 1.82 16.05 37.88 2.08 2.66 WSS95 1.06 1.6 32.56 38.36 3.29 3.1 AP 0.95 0.96 0.91 1.03 0.98 0.96 F10 41.91 42.38 46.38 53.88 42.44 42.6 Topic 2 US PR BN BF UN UF WSS100 84.64 85.52 76.07 76.9 85.14 85.32 WSS95 79.96 82.03 73.01 73.79 81.89 81.86 AP 6.32 6.95 3.8 4.1 6.75 6.79 F10 89.17 89.77 83.76 84.25 89.51 89.64 Topic 17 US PR BN BF UN UF WSS100 16.5 16.12 30.37 23.42 16.52 18.14 WSS95 58.19 46.84 61.55 48.41 52.34 53.08 AP 42.48 37.56 34.03 30.66 37.24 37.95 F10 92.09 92.05 93.5 92.78 92.09 92.25 Topic 39 US PR BN BF UN UF WSS100 58.99 54.49 31.35 31.69 51.85 53.43 WSS95 68.03 61.4 47.3 48.26 61.01 60.51 AP 6.49 5.12 3.61 3.65 5.06 4.96 F10 89.91 88.86 83.8 83.87 88.26 88.62 Topic 49 US PR BN BF UN UF WSS100 62.39 62.33 54.97 40.11 62.33 64.78 WSS95 71.77 66.04 71.67 38.42 66.04 72.47 AP 2.7 2.01 1.32 1.28 2.01 2.03 F10 44.6 44.56 40.18 33.52 44.56 46.84

Table 4: Scores on topic 1, 2, 17, 39 & 49 (including training set).

4.2.1 Uncertainty Sampling

In most cases, the US algorithm outperforms its competi-tors. Figure 6 shows the two-dimensional spatial distribu-tion of the documents for one such case. As can be seen in table 4, US scores highest on all measures for topic 39. On WSS100, WSS95 and AP the margin is considerable while the F10 score is similar to that of the other systems. It can be noted that the Balanced MABs perform substan-tially worse on this dataset, an indication that exploration is not useful in discovering the relevant documents.

(13)

Figure 6: Two dimensional spatial distribution of the documents in topic 39

4.2.2 Probabilistic Ranking

Contrary to what emerged from the literature review, on av-erage PR does neither consistently nor significantly outper-form US. This can be related to differences in data prepro-cessing, distinctions in the utilized algorithms or the dis-similarity of the datasets; this thesis works on an extensive dataset comprising 39 medical topics. The study conducted by Cormack and Grossman (2014) in contrast, focuses on 8 collections of data comprising hundreds of thousands re-views of legal proceedings.

Figure 7: Two dimensional spatial distribution of the documents in topic 2.

A case where PR outperforms the other algorithms includ-ing US is topic 2 (see topic 2 in table 4), however the margins are very small. Again, the Balanced MABs per-form somewhat worse in all respects, highlighting that ex-ploration is unnecessary and even counterproductive on a dataset with this type of spatial distribution. The proximity to each other of relevant documents in figure 7 is surpris-ing considersurpris-ing the loss of information invariably caused by the dimensionality reduction required to produce the 2-dimensional visualization. This underlines the suitability of Active Learning approaches in classifying the relevance

of documents for systematic reviews.

4.2.3 Balanced Naive MAB

The Balanced Naive MAB differs from the baseline AL systems due to its high initial probability of exloration, but rarely outperforms the other systems. BN scores lowest on average over all topics when including the training set during evaluation. When the evaluation is based on the ranked documents starting from the point where the first relevant document is found, its performance improves es-pecially compared to UF. On topic 17, it achieves the high-est scores except for AP, where it is outperformed by most systems except BF (see topic 17 in table 4). The margins on WSS100 and WSS95 are considerable, and still notable on F10. As can be seen in figure 8, the relevant datapoints are much more scattered than in the previously consulted cases, an indication that a MAB approach is of advantage on this topic.

Figure 9: Recall of the six systems at each iteration during the classification process on topic 17.

Figure 9 illustrates the percentage of recall at each step in the the iterative process for all six systems; despite the initial prevalence of the unbalanced MABs and US, BN

(14)

is able to achieve total recall earlier due to its ability to explore the dataspace once exploitation fails to improve recall. In this case it is also worth noticing how the BF early on considerably falls behind due to its forgetfulness characteristic but is later able to regain recall faster due to this same characteristic. Presumably it finds an isolated relevant document causing exploitation to fail for the fol-lowing iterations before restarting to explore the dataspace for new and hopefully more abundant clusters of relevant documents.

4.2.4 Unbalanced Naive MAB

The Unbalanced Naive MAB comes closest to matching baseline PR, in all likelihood due to its similarity to it; it is very unlikely to explore and relies on the complete training set to train its classifier. This can clearly be seen in the ∆ AP values in figure 5, and tables 2, 3 and 4. In the few cases where UN outperforms the baseline algorithms, it is outperformed itself by its balanced or forgetful coun-terparts. It does achieve the highest F10 score on average including the training set. The difference to its pursuers however is minimal.

4.2.5 Balanced Forgetful MAB

Of the four Bandits, BF differs most from the baseline AL systems; It is much likelier to explore than the unbalanced MABs and due to its forgetfulness is able to quickly estab-lish itself in new areas of the dataspace once it has found a relevant document through exploration. However, except for an insignificant higher average WSS100 score, it is usu-ally outperformed. On topic 1, BF achieved much better results than the other algorithms (topic 1 in table 4). BN also performed well compared to the other four systems and the Forgetful MABs achieved similarly low scores as the AL baseline approaches, underlining the importance of

a Naive approach on this particular dataset. Figure 10 illus-trates the spatial distribution of the documents comprising this topic. The relevant documents are considerably more scattered than in topic 2, but less than in topic 17 and are located in moderately better defined clusters.

Figure 11 illustrates recall at each iteration for the six sys-tems. As can be observed in this figure, BF is able to take advantage of its balanced approach and forgetfulness on this topic; especially in the later phases it is able to identify the last relevant documents faster than its competitors.

4.2.6 Unbalanced Forgetful MAB

The UF MAB achieves slightly better results, except for AP on topic 49 compared to the other algorithms (see topic 49, table 4 and figure 13). On this topic, the Unbalanced MABs perform better than the Balanced MABs; this might be caused by the somewhat scattered and mostly unclus-tered relevant documents. Overall, most systems succeed in avoiding the dense area of irrelevant documents that can be seen in figure 12. In general, UF is not able to

(15)

outper-form its competitors indicating that the Forgetfulness char-acteristic is unable to improve the performance of a system that at the beginning relies only to a limited extend on ex-ploration.

Based on the above findings, the subquestions a and b can be answered; both baseline AL systems generally perform well, with MWSS100 around 35% and WSS95 over 55%.

When the training set is excluded from the evaluation, these measures reach close to 40% and over 60%. At the same time MAP nearly doubles from 10% for PR and 14% for US, to 24% and 30% respectively. MF10 increases from more than 65% to more than 70% for both PR and US. Considering the achievements of the MAB systems, the following can be said; the four systems perform similar on MWSS100 and F10 reaching around 35% and more than 65% including the training set. On MWSS95 and MAP some differences emerge between Balanced (BN and BF) and Unbalanced (UN and UF) MABs; the former achieve over 50% on MWSS95 and just under 8% on MAP, while the latter reach over 55% and just above 10% respectively. When excluding the training set from the evaluation, BN, UN and UF perform similarly while BF scores 2 - 10 per-centage points lower on the four measures.

4.3 Optimizing parameters

Table 5 shows the values for WSS100, WSS95, AP, and F10 of different sets of parameters on topic 17. As can be observed, a MAB with an initial probability for exploration p0set at 0.3 and a very small rate of change of this

proba-bility r set at 0.01 achieves the best results.

WSS100: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 25.14 12.51 28.68 57.41 33.56 53.37 46.41 38.16 27.05 21.1 5.33 0.025 29.57 28.79 28.62 9.37 34.46 32.21 52.24 56.62 22.78 46.18 25.93 0.05 48.32 29.12 32.66 30.02 32.6 33.39 34.68 42.14 50.22 47.31 45.96 0.075 49.05 49.33 49.1 48.82 53.09 50.73 51.29 57.41 55.44 56.23 51.18 0.1 49.16 50.9 54.43 54.43 56.12 52.36 52.58 49.89 53.42 52.69 52.64 0.25 50.79 52.47 49.55 53.7 52.97 45.45 50.95 52.24 53.03 54.32 53.42 0.5 53.09 50.9 52.24 52.24 52.19 30.7 49.05 32.6 55.27 49.16 53.42 0.75 48.71 49.78 33 47.7 50.17 54.94 51.96 53.93 48.71 53.03 52.69 0.99 51.91 51.85 52.81 53.09 54.21 50.11 55.16 33.73 35.13 50.17 54.55 WSS95: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 47.97 46.96 62.17 64.53 60.1 59.42 57.46 49.55 41.07 28.5 8.24 0.025 57.46 50.5 58.3 55.94 58.19 60.6 57.85 62.85 49.66 51 33.33 0.05 52.74 56.78 59.09 52.91 51.9 60.21 61.95 49.26 60.6 56.5 51.17 0.075 58.02 51.85 55.83 53.92 56.84 59.81 58.24 63.69 62.73 61.67 57.96 0.1 55.89 58.13 61.33 58.52 62.68 62.51 56.34 56.05 62.96 61.27 60.32 0.25 58.3 56.56 58.02 58.13 57.74 58.97 61.44 60.15 61.72 60.43 57.07 0.5 61.39 56.62 61.78 59.37 58.13 58.19 59.59 58.13 59.25 63.07 59.59 0.75 58.24 56.17 62.85 58.47 59.98 60.6 60.49 59.42 61.33 57.57 59.25 0.99 60.32 58.52 61.89 62.28 59.7 59.25 60.82 60.99 62.45 60.54 60.82 AP: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 37.38 35.68 35.61 30.69 31.52 30.66 27.19 20.88 19.71 18.37 13.14 0.025 37.23 36.94 39.14 38.27 34.87 31.1 24.76 30.13 24.67 24.42 21.75 0.05 36.86 37.2 35.11 35.13 35.34 30.49 36.16 23.63 24.17 21.87 18.29 0.075 36.33 34.98 35.78 35.63 37.48 26.09 33.49 32.12 31.01 28.37 23.26 0.1 38.76 39.12 35.85 35.88 38.45 28.44 36.6 33.93 23.27 25.38 23.5 0.25 39.69 36.56 38.12 32.96 31.58 20.68 36.03 36.23 34.81 34.31 33.15 0.5 33.81 34.43 34.33 32.99 34.1 36.04 34.51 32.24 35.05 36.29 36.6 0.75 33.24 34.84 35.76 34.68 35.14 33.75 35.01 35.56 36.34 30.48 36.37 0.99 34.46 35.12 33.96 36.18 34.44 35.14 35.47 36.11 31.33 35.74 35.47 F10: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 92.96 91.7 93.31 96.24 93.84 95.82 95.09 94.24 93.13 92.54 91.02 0.025 93.42 93.34 93.31 91.4 93.92 93.7 95.7 96.16 92.71 95.07 93.04 0.05 95.29 93.37 93.74 93.45 93.72 93.82 93.94 94.67 95.49 95.19 95.04 0.075 95.36 95.4 95.37 95.34 95.79 95.54 95.6 96.24 96.04 96.12 95.59 0.1 95.38 95.56 95.93 95.93 96.11 95.71 95.73 95.45 95.83 95.75 95.74 0.25 95.55 95.72 95.42 95.85 95.78 94.99 95.56 95.7 95.78 95.92 95.83 0.5 95.79 95.56 95.7 95.7 95.69 93.52 95.36 93.73 96.02 95.38 95.82 0.75 95.33 95.44 93.76 95.23 95.48 95.98 95.67 95.88 95.33 95.78 95.75 0.99 95.66 95.66 95.76 95.79 95.91 95.48 96.01 93.84 93.98 95.48 95.94

Table 5: Scores on WSS100, WSS95, AP, and F10 on topic 17 with different values for the rate of changer and initialization probability of explorationp0.

(16)

These parameters result in a slowly adapting MAB with a value of p0in between that of the Balanced and

Unbal-anced approaches.

Table 6 shows the optimal values of the two parameters for the 5 topics considered in the above cases. Topic 1 appears to benefit the most from a MAB approach; p has a rela-tively large initial value of 0.3 and the reward for success, or punishment for failure, is considerably high at 0,075 and large enough to change the prevalence for exploitation or exploration within a few iterations.

Topic p0 r 1 0.3 0.075 2 0 0.05 17 0.3 0.01 39 0.1 0.05 49 0.1 0.075

Table 6: Best values forr and p0in a MAB system to achieve the

highest scores on WSS100, WSS95, AP, and F10 on the selected topics.

Appendix C contains the optimization tables for topics 1, 2, 39 and 49. The results of this optimization process on a number of topics allows the formulation of an answer to subquestion c; the optimal balance between exploitation and exploration, as represented by p0and r, cannot be

de-termined in general. Since the spatial distance between the documents is the only feature considered by the system, the optimal values are believed to depend on the spatial distri-bution of relevant and irrelevant documents in the dataset as suggested in the case evaluations.

In many cases these optimal values for the parameters are close to zero, an indication that a MAB approach might not be worthwhile and a baseline AL approach is sufficient to achieve the highest scores. In a few cases the documents are distributed in such a way that higher parameters, and thus a MAB approach is more valuable.

Overall the results strongly indicates that the relation be-tween the parameters and the spatial distribution is an im-portant factor to take into account when deciding whether to apply a baseline AL or a MAB approach.

5 Conclusion and discussion

This final section will conclude this thesis by summariz-ing and reflectsummariz-ing upon the the main findsummariz-ings before dis-cussing the broader implications and unanswered aspects of this thesis, and recommending future work.

5.1 Conclusion

This thesis has shown that contrary to the indications in the consulted literature, US regularly though not significantly

outperforms PR. The four proposed MAB systems do not categorically outperform the baseline AL systems but gen-erally achieve lower results instead. The differences in performance between the six systems are not statistically significant over the complete dataset but some important differences exist on the topic level; in a few specific cases MAB systems are able to take advantage of their ability to explore the dataspace or forget part of the training set to increase the work saved by the oracle in reviewing the documents. In most other cases however, any inclination towards exploration or forgetfulness hinders the MAB sys-tem in achieving the results reached by the baseline AL systems. Based on the results, their analysis, and the an-swers to the subquestions, the following answer to the main research question has been formulated: The MAB ap-proaches considered in this thesis and tested on the given dataset, regardless of their exploitation/exploration bal-ance, mostly cannot outperform the considered baseline AL approaches in reducing the manual work required by an expert for retrieving relevant literature in systematic reviews.

It should be noted, that this answer is based on unrevised work by an individual novice researcher and should not be taken without the corresponding reservations on its valid-ity. The results are determined by the given dataset, cor-rectness of programming and functioning of utilized soft-ware packages. A few shortcomings should be noted; Five topics exceeding the somewhat arbitrary limit of around five thousand documents were curbed and only the docu-ments up to this limit were considered in the successive classification and evaluation steps. This was done in order to reduce computing time, which was most heavily im-pacted by the effectuation of the KF candidate selection strategy. The computation time of the two Balanced MAB systems BN and BF was especially affected by this since these two approaches begin with a high p0 value of 0.5

and thus have to apply KF regularly at least during the first number of iterations. For the same reason, the systems were not run multiple times on some of the larger topics to reduce the noise. A set seed was used for the random number generator to ensure the same starting conditions for each system. The baseline AL systems produce the same results every time due to their construction. How-ever, this is not true for the MAB systems; because of the random choice between exploitation and exploration, the results are forcibly biased by a random factor.

Considering the finding that US outperforms PR, in contra-diction to the reviewed literature, it should be interesting to review and validate the research conducted in this thesis. In addition, different MAB approaches should be examined; the four variations tested in this thesis did not succeed in surpassing the baseline AL systems’ achievements by a significant margin. Of these four systems, UN which most closely resembles PR usually performed better while achieving results very similar to PR; too much emphasis on exploration generally decreases the performance. Only

(17)

in a few cases it is of advantage to consider exploration at all. And in these cases additional optimization of the parameters, while having some effect, was not substantial or significant enough to justify the optimization process.

5.2 Discussion

In the broader context of applying AL to reduce the man-ual work by experts in identifying relevant literature for systematic reviews, the experiments in this thesis indicate that the considered MAB approaches fail to improve on the results of baseline US and PR over the utilized dataset comprising 39 medical topics. For systematic medical re-views this means that by applying an AL baseline system such as US or PR, the work saved by the expert can amount to around 35% to 40% when 100% of the relevant docu-ments need to be found, and up to 55% to 60% when a recall of 95% is sufficient. This should allow for a signifi-cant reduction of time and costs, and decidedly accelerate the literature selection process for systematic reviews. The savings in time and costs can probably be improved further; The results in this thesis indicate that most likely there exists a relation between the spatial distribution of the dataset and the success of MAB approaches. A Contextual MAB approach, as discussed in the theoretical framework, might be able to take advantage of information extracted from the spatial features of the dataspace. A variation to the Contextual Bandit could be to examine the spatial dis-tribution of the data in advance in order to determine the optimal parameters p0and r to use.

Another possibly worthwhile approach could be to further investigate and elaborate the Forgetfulness characteristic introduced in this thesis. A dynamic rate of Forgetfulness could be employed that takes measures such as recall or spatial features into account and adapts accordingly. Fu-ture research can also explore the other MAB approachess presented in the theoretical framework and test their per-formances on a larger dataset such as the one used in this thesis.

Acknowledgements

I wish to thank Evangelos Kanoulas and Dan Li for the pleasant collaboration, their stimulating insights and reli-able support during the execution of this thesis.

6 References

Baram, Y., Yaniv, R. E., & Luz, K. (2004). Online choice of active learning algorithms. Journal of Machine Learn-ing Research, 5(Mar), 255–291.

Bouneffouf, D., Laroche, R., Urvoy, T., F´eraud, R., & Allesiardo, R. (2014). Contextual bandit for active learn-ing: Active thompson sampling. In International confer-ence on neural information processing(pp. 405–412).

Cohen, A. M., Hersh, W. R., Peterson, K., & Yen, P.-Y. (2006). Reducing workload in systematic review prepara-tion using automated citaprepara-tion classificaprepara-tion. Journal of the American Medical Informatics Association, 13(2), 206– 219.

Cormack, G. V., & Grossman, M. R. (2014). Evaluation of machine-learning protocols for technology-assisted re-view in electronic discovery. In Proceedings of the 37th international acm sigir conference on research & devel-opment in information retrieval(pp. 153–162).

Hofmann, K., Whiteson, S., & de Rijke, M. (2011). Bal-ancing exploration and exploitation in learning to rank on-line. In European conference on information retrieval (pp. 251–263).

Lewis, D. D., & Gale, W. A. (1994). A sequential al-gorithm for training text classifiers. In Proceedings of the 17th annual international acm sigir conference on re-search and development in information retrieval(pp. 3– 12).

Osugi, T., Kim, D., & Scott, S. (2005). Balancing explo-ration and exploitation: A new algorithm for active ma-chine learning. In Data mining, fifth ieee international conference on(pp. 8–pp).

OMara-Eves, A., Thomas, J., McNaught, J., Miwa, M., & Ananiadou, S. (2015). Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews, 4(1), 5.

Settles, B. (2010). Active learning literature survey. Uni-versity of Wisconsin, Madison, 52(55-66), 11.

Shemilt, I., Khan, N., Park, S., & Thomas, J. (2016). Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. Sys-tematic reviews, 5(1), 140.

van Rozendaal, T. (2016). Technologically assisted sys-tematic reviews in empirical medicine. University of Am-sterdam.

(18)

7 Appendices

7.1 Appendix A: Dataset composition

Topic

# Irrelvant

# Relevant

% Relevant

1 4257

30 0.70%

2

981

11 1.11%

4 4665

335 6.70%

5 2524

10 0.39%

7 5296

91 1.69%

8 4658

342 6.84%

9 1254

98 7.25%

11 4462

59 1.31%

12 5198

71 1.35%

14 4999

1 0.02%

15 4964

36 0.72%

16 3465

56 1.59%

17

823

77 8.56%

21 5033

22 0.44%

22 2592

25 0.96%

23

479

21 4.20%

25 5057

40 0.78%

26 2026

46 2.22%

27 4999

1 0.02%

29 1430

26 1.79%

31 1789

92 4.89%

33 4658

342 6.84%

34 4968

32 0.64%

36

739

20 2.64%

39

174

6 3.33%

40 1256

52 3.98%

41

339

47 12.18%

42 4940

60 1.20%

43 4997

3 0.06%

45

344

42 10.88%

47

183

23 11.17%

48 1852

4 0.22%

49 5082

15 0.29%

50 5030

65 1.28%

51

852

2 0.23%

53 2123

67 3.06%

54 1996

27 1.33%

56 4987

13 0.26%

57

139

7 4.79%

All

115610

2317

1.96%

Number of irrelevant and relevant documents and the percentage of relevant documents for each of the 39 topics.

7.2 Appendix B: Additional Results figures

Boxplots depicting the scores over all 39 topics starting from the first relevant document found. Red squares indicate the mean val-ues, red lines indicate the median valval-ues, boxes include lower and upper quartiles, whiskers show the range of the data, and outliers are marked by crosses.

(19)

7.3 Appendix C: Optimization Tables

WSS100: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 16.37 28.11 4.94 5.87 29.95 30.64 16.06 13.34 8.94 22.59 15.35 0.025 28.51 8.26 31.37 32.65 19.54 17.08 9.11 3.76 20.49 19.49 9.27 0.05 28.6 13.89 11.17 32.46 15.14 12.33 7.22 15.35 21.13 0.17 8.97 0.075 7.9 14.05 16.21 34.92 34.82 17.39 30.57 16.87 6.13 4.02 2.63 0.1 9.65 14.08 18.19 14.17 2.77 21.13 18.76 12.33 14.74 32.6 13.65 0.25 9.08 15.99 20.09 9.94 24.58 18.29 17.32 15.8 13.34 7.36 13.67 0.5 18.29 13.32 13.41 9.2 12.14 17.2 26.12 7.22 17.18 20.98 23.47 0.75 7.45 15.02 15.99 2.34 8.4 13.15 17.89 9.01 17.01 21.03 12.92 0.99 6.81 15.54 7.12 13.41 21.46 15.33 19.45 15.33 12.33 16.25 15.33 WSS95: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 21.12 30.25 31.6 32 29.94 35.48 31.41 33.18 33.23 28.85 32.12 0.025 27.41 30.6 31.15 31.86 29.68 40.35 38.51 27.62 30.94 26.16 28.38 0.05 34.34 25.33 36.61 36.61 33.59 26.32 34.03 36.76 27.55 35.71 34.3 0.075 27.01 35.36 28.36 37.16 36.83 29.8 32.83 32.64 22.49 27.86 40.35 0.1 29.56 28.76 23.18 33.75 33.7 31.22 37.3 25.14 36.33 34.98 31.46 0.25 34.48 34.32 35.34 34.79 31.65 30.04 32.17 32.66 36.92 23.6 30.58 0.5 35.9 32.83 32.9 31.74 32.64 34.08 34.56 36.31 34.79 36.92 37.82 0.75 31.6 37.61 36.61 29.59 35.79 39.62 39.19 26.58 35.69 31.65 45.13 0.99 34.7 35.76 30.23 34.63 29.8 33.59 33.47 33.59 39.81 41.46 33.59 AP: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.96 1 1.01 0.88 0.91 0.88 0.9 0.85 0.81 0.74 0.8 0.025 0.96 0.98 0.92 0.93 0.87 0.9 0.88 0.87 0.88 0.83 0.8 0.05 1.03 0.92 1.07 1.03 0.98 0.89 0.79 0.9 0.76 0.89 0.8 0.075 1 0.83 1 1.07 1.06 1 0.95 0.76 0.77 0.79 0.9 0.1 0.94 1.03 1 0.96 1.03 1 1 0.84 0.81 0.91 0.77 0.25 0.94 0.88 1.03 1 0.82 0.96 0.79 0.91 1.07 0.84 0.86 0.5 1.02 0.95 0.99 0.91 0.92 0.86 0.92 0.88 0.89 0.91 0.82 0.75 0.98 0.95 0.91 0.74 0.86 0.87 0.96 0.81 0.97 0.83 1.04 0.99 1.01 1.02 0.87 0.99 0.83 0.97 0.88 0.97 0.93 0.94 0.97 F10: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 46.36 50.17 43.17 43.41 50.82 51.07 46.27 45.47 44.23 48.3 46.06 0.025 50.31 44.05 51.34 51.81 47.33 46.57 44.28 42.86 47.63 47.31 44.32 0.05 50.34 45.63 44.85 51.74 45.99 45.18 43.77 46.06 47.83 41.96 44.24 0.075 43.95 45.67 46.31 52.68 52.64 46.67 51.04 46.51 43.48 42.93 42.57 0.1 44.43 45.68 46.91 45.71 42.61 47.83 47.09 45.18 45.87 51.79 45.56 0.25 44.27 46.25 47.5 44.51 48.96 46.94 46.64 46.19 45.47 43.81 45.56 0.5 46.94 45.46 45.49 44.3 45.12 46.61 49.48 43.77 46.6 47.78 48.59 0.75 43.83 45.96 46.25 42.5 44.09 45.41 46.82 44.25 46.55 47.8 45.35 0.99 43.66 46.11 43.74 45.49 47.94 46.05 47.3 46.05 45.18 46.32 46.05

Scores on WSS100, WSS95, AP, and F10 on topic 1 with different values for the rate of changer and initialization probability of explorationp0. WSS100: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 85.38 82.68 81.89 81.62 75.74 74.68 68.96 69.01 61.07 57.2 53.44 0.025 85.17 83.37 81.51 78.34 78.23 74.42 73.46 71.08 64.99 68.11 61.6 0.05 86.44 83.85 82.15 82.89 80.46 78.55 75.64 78.28 73.15 70.55 71.13 0.075 83.74 82.52 82.89 83 78.92 78.44 75.42 79.4 73.73 73.83 73.62 0.1 83.58 81.62 85.17 80.88 78.92 82.15 76.64 77.97 76.17 76.75 75.16 0.25 75.95 80.14 82.2 78.44 79.29 76.54 78.44 76.43 77.49 75.21 79.18 0.5 78.07 80.51 81.57 79.03 76.59 75.95 78.6 78.07 79.56 79.34 76.54 0.75 75.95 73.2 76.8 73.99 79.13 75 80.67 78.28 73.25 75.53 76.85 0.99 76.06 75.16 75.21 76.06 75.16 74.95 76.96 75.42 75.74 76.27 75.85 WSS95: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 82.71 80.06 80.75 77.42 75.08 69.95 65.6 68.09 58.29 52.94 53.53 0.025 80.38 80.49 77.47 74.77 74.61 71.27 68.62 67.09 63.64 64.92 59.51 0.05 82.82 81.12 79.75 81.18 77.73 73.65 72.86 73.71 70 68.68 67.72 0.075 81.55 81.81 80.12 78.32 75.03 75.99 73.71 75.77 71.32 72.28 70 0.1 80.38 80.43 81.33 78.05 76.46 78.1 72.28 73.65 73.6 73.39 71.48 0.25 77.84 76.41 78.47 74.87 75.08 73.65 74.82 72.75 73.76 71.48 76.09 0.5 74.98 76.78 80.54 75.51 74.08 73.97 75.72 74.71 77.36 75.08 74.24 0.75 73.23 72.49 74.29 73.76 76.99 70.16 76.09 74.71 71.96 72.91 72.44 0.99 74.98 72.7 72.65 74.98 72.7 73.44 75.77 70.95 72.97 73.39 75.14 AP: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 6.95 5.81 6.2 5.57 4.46 3.75 4.1 3.6 3.28 2.2 2.1 0.025 6.43 6.04 4.91 5.03 4.58 3.97 3.31 3.3 2.87 2.59 2.62 0.05 7.12 6.51 6.25 6.43 4.97 3.8 3.7 4.47 3.25 3.02 2.9 0.075 7.06 6.38 5.75 6.13 3.99 4.42 3.79 4.45 3.55 3.58 3.13 0.1 6.87 6.28 6.61 5.45 4.57 5.27 4.03 4.37 4.01 3.74 3.38 0.25 5.9 4.56 6.13 5.33 4.85 4.59 4.22 3.87 4.21 3.68 4.91 0.5 4.92 5.63 5.99 4.58 4.19 4.16 5.21 4.88 4.96 4.42 4.35 0.75 4.15 3.9 5.51 4.3 4.98 4.02 4.97 4.46 4.06 4.38 4.04 0.99 5.28 4.58 4.66 5.28 4.58 4.56 5.03 4.08 4.68 4.36 4.77 F10: r\p0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 89.67 87.88 87.38 87.17 83.54 82.91 79.73 79.73 75.66 73.83 72.21 0.025 89.52 88.32 87.14 85.11 85.04 82.76 82.21 80.87 77.61 79.27 75.92 0.05 90.4 88.67 87.56 88 86.43 85.25 83.47 85.12 82.02 80.58 80.89 0.075 88.57 87.77 88.02 88.07 85.46 85.18 83.35 85.83 82.36 82.44 82.32 0.1 88.46 87.17 89.53 86.84 85.47 87.52 84.07 84.87 83.79 84.14 83.22 0.25 83.67 86.23 87.55 85.17 85.71 84.02 85.19 83.95 84.59 83.22 85.65 0.5 84.94 86.47 87.16 85.53 84.05 83.68 85.27 84.94 85.87 85.73 84.02 0.75 83.68 82.05 84.17 82.51 85.64 83.11 86.56 85.07 82.1 83.41 84.2 0.99 83.72 83.19 83.22 83.72 83.19 83.07 84.27 83.35 83.53 83.87 83.62

Scores on WSS100, WSS95, AP, and F10 on topic 2 with different values for the rate of changer and initialization probability of explorationp0.

Active Learning for Systematic Reviews Balancing Exploitation and Exploration