Automatic machine-learning based identification of diagnostic test accuracy studies for systematic reviews

(1)

Automatic machine-learning based

identification of diagnostic test

accuracy studies for systematic

reviews

Amir Alnomani 10437797

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Evangelos Kanoulas

Institute for Language and Logic Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

1 Abstract

This study evaluates several machine-learning based approaches for identifying diagnostic test accuracy studies within a large set of scientific articles for the pur-pose of reducing the workload of researchers when manually screening abstracts to be utilized in systematic reviews. The relative performance was measured be-tween three documents representation models, with different degrees of random undersampling of the data set and with two different algorithms. This includes document representations in the form of TF-IDF bag-of-words, word2vec and Latent Dirichlet Allocation, and the Support Vector Machine algorithm as well as a Logistic Model Tree algorithm. The results seem to indicate that under-sampling does not lead to increased performance of this particular data set, that the word2vec representation outperforms the others in a ranking approach, and while the most frequently used TF-IDF model is more robust in a classification approach.

(3)

2 Introduction

The ability to diagnose whether a patient has a certain medical condition or not is of significance both for the patients and medical professionals, that perform these diagnoses. Medicine often has many side effects, thus treating healthy patients is undesirable. Consequently, it is necessary to assess diagnostic tests to determine their quality, which is usually compared to some golden standard. According to the authors in [7], a diagnostic test accuracy indicates the quality of a test for discerning patients that have a certain target condition from those who do not. However, this accuracy is not a fixed property of the test, due to the different circumstances and settings in which the tests are performed, such as, for instance, in which other samples sizes of patients are taken. Thus, to determine the quality of such test it is necessary to obtain and analyze the complete spectrum of relevant research in which this test is performed. This process of comparing and analyzing these different studies is usually accomplished in systematic reviews.

The same paper also mentions that the process of identifying diagnostic test accuracy studies is not trivial as those studies most often do not have concrete, unambiguous characteristics which discriminate them from other, related to the topic corresponding articles. Consequently, there are no input characteristics that can be utilized to query a database of medical articles, which will return exactly the relevant Diagnostic test accuracy studies. Nonetheless, the approach of using complex Boolean queries is currently still prevalent as the first phase of finding relevant articles [12] and is also the method that was employed in this paper for obtaining the dataset. As not to miss any relevant studies these queries yield a broad set of articles of which there are many irrelevant, and these articles then subsequently need to be screened manually by the researchers, which is a time-consuming task, given the uneven proportion of relevant to nonrelevant articles.

Several studies have been conducted on reducing the workload of researchers by automating the screening process. The authors in [10] provide an overview of a portion these and distinguish mainly two approaches. The first approach employs machine learning techniques for classifying the articles as relevant or nonrelevant and these can subsequently be divided into fully automatic classi-fication and semi-automatic classiclassi-fication such as active learning. The second approach utilizes a ranking system that removes low ranking articles based on certain characteristics.

The work of Almeida et al. [1] evaluated three classification algorithms Na¨ıve Bayes (NB), Logistic Model Trees (LMT) and Support Vector Machine (SVM) in the task of classifying biomedical articles as relevant or nonrelevant. The feature sets included mycoMINE bio entity annotations, Enzyme Commission (EC) numbers and Bag-of-words representations of the title and abstract. The minority class corresponding to the relevant labeled articles represented 9.88% of the total dataset. As a solution to this imbalanced data several factors of undersampling were utilized, which entailed training on a multitude of different dataset proportions, obtained by decreasing instances of the majority class.

Independent of the feature sets was the increase in performance for all the algorithms corresponding to higher levels of undersampling. The bag of words representation outperformed the other individual feature sets, but the combina-tion of all the feature sets yielded the best results. The Na¨ıve Bayes algorithm

(5)

was used as a baseline and consequently had the worst performance out of all algorithms. Finally, where each feature set was trained on individually both LMT and the SVM had comparable results, but when the feature sets were combined LMT showed a significantly higher F-measure.

2.1 Problem Definition

The goal is to reduce the workload of researchers having to gather and manually screen thousands of articles when they conduct systematic reviews. This can be accomplished by at least partially automating the processes involved. The identification of relevant articles , in this case specifically diagnostic test accu-racy studies, is viewed as either a classification problem or ranking problem for which machine learning algorithms can be utilized. Furthermore, this is task can be divided into four parts, with the first being the decision regarding how to represent the contents of an article as distinctly as possible to find significant patterns that aid the separation of relevant data from nonrelevant data. The second part is reducing the data imbalances among the different classes which is prevalently present in this specific data set, even though there are more nonrele-vant articles they are much less important. The third and fourth step being first choosing a classification algorithm and then the metrics by which to evaluate the whole process.

In this paper we investigate how the different text representations such as TF-IDF, LDA, and word2vec, further elaborated upon in their respective sections, levels of under-sampling and classification algorithms compare in terms of per-formance on the task of identifying diagnostic test accuracy studies given the specific spectrum of topics that are available in this data set. In terms of text representation, previous work has mainly utilized bag-of-words feature vectors specifically for this task. However, other state-of-art document classification studies demonstrate that word2vec based techniques outperform the previously mentioned text representations [15]. For this reason, it is significant to explore the possibilities regarding these representations to see how they compare when utilized with unbalanced data and in conjunction with complex medical terms in data set text. In addition, to investigate whether a ranking approach, where probabilities are utilized instead of hard classification thresholds, using the same techniques yield comparable results and with that a greater understanding of the task.

3 Theoretical foundation

3.1 Support Vector Machines

A support vector machine is a supervised machine learning algorithm that can classify data [11]. When labeled data is provided as input, the algorithm at-tempts to find the optimal hyperplane which separates positively labeled data points from negatively labeled data points, this hyperplane can subsequently be utilized to classify new, yet unseen data. Finding the optimal hyperplane requires the maximization of the margin of the training data, which is defined as twice the distance between the hyperplane and its closest data point, and is calculated using the following formula:

(6)

m = 2

kwk (1)

Where m is the margin and w the normal vector, perpendicular to the primary hyperplane with its origin at one of the reference hyperplanes that are parallel and equidistant to the primary hyperplane. Thus, the area between these two reference hyperplanes does not contain any data points and those which are closest to the primary hyperplane are called support vectors. Given the for-mula, the maximization of the margin can then be computed by minimizing the magnitude of the vector w, after which the optimal hyperplane can be deter-mined. Finally, to classify new instances the output of the decision function(2) is calculated.

f (x) =XyiwiK(x, xi) (2)

Where x is the new instance to be classified, xithe support vectors, yithe class

prediction, withe weight vector and K the Kernel function. The kernel function

depends on whether the data is linearly separable or not.

3.2 Logistic Model Trees

The idea behind Logistic Model Trees [6] is to combine the mechanisms of lo-gistic regression and tree induction. Lolo-gistic regression has the benefit of being stable, this entails that fitting the model yields a low variance but potentially a high bias, whereas tree induction is usually accomplished with low bias cou-pled with high variance. These techniques have complementary advantages and disadvantages, and by combining these the creation of a more robust system is the goal.

The decision tree output created by the means of tree induction has an easily interpretable form. It is constructed out of a conjunction of Boolean expressions as nodes, and these expressions essentially form a set of rules about the features that can subsequently classify the data. The algorithm that constructs these decision trees attempts to separate the instance space into parts that capture in-formation about the classes in the specific domain. However, this usually yields more splits than their classes due to noisy patterns, therefore, after obtaining this larger tree pruning is applied to that result to obtain the final tree. Decision trees, as well as Logistic model trees, are both built out of two types of nodes, non-terminal nodes that vertically connect to child nodes, and terminal nodes also denoted as leaf nodes or leaves and do not have any child nodes. The leaves represent the subdivisions that are created in the data space and the main difference between both models is that logistic model trees perform logistic regression at the terminal node level.

3.3 Bag-of-words representation

The most widely used method for representing text within the domain of natural language processing and information retrieval for the purpose of classification is as a bag-of-words. A vector is created where each index corresponds to a differ-ent word from a certain context-specific vocabulary or corpus, usually limited

(7)

by the textual content of the available data. The values of this feature vector contain the frequencies of the words corresponding to their indices, therefore discarding the information regarding the order in which the words appeared as well as the grammar.

For the purpose of accomplishing accurate classification, it is convenient for each feature vector to be as distinctive as possible. However, the drawback of the representation in the described form is that a great number of specific words appear very frequently in each piece of text. An example of these would be pronouns such as after, in, to, on, and the, these types of words trivialize the weights of the words that do distinguish a passage of text, as a result creating less distinctive feature vectors.

A solution to this problem is a term-weighting scheme such as TF-IDF, short for term frequencyinverse document frequency and the preprocessing of data such as the removal stop words or words that do not contribute to the distinctiveness of each document. The first step in transforming the bag-of-words representation to TF-IDF is the normalization of the frequencies of each word by the length of each document. The resulting term frequencies are subsequently weighed by the inverse document frequency which is computed by first dividing the total num-ber of documents in the data set by the total numnum-ber of documents with term t and then taking the logarithm of this result. This accomplishes that words that appear in many documents and therefore have a much lower significance for discerning those documents from each other will have relatively lower values than other words in the same documents.

3.4 Word2vec

The many methods for representing text in natural language processing such as bag-of-words representations and N-gram models have the advantage of being simple and robust for the applications within this domain. However, due to this atomic representation of words, much information is lost. For instance, infor-mation about relationships that may exist between words could potentially be utilized to improve the performance of algorithms that use textual input, is not captured my these methods. Another disadvantage of these techniques is that higher number of instances increase the sparsity of the data, resulting in higher dimensionality of the feature vectors, and thus requiring more computational resources.

The means for capturing relational information, and with that solving the earlier mentioned drawback, are based on the distributional hypothesis in linguistics, which states that words that appear in the same context share the same seman-tic meaning. One of the most recent methods based on this concept is called word2vec, this type of model proposed by Mikolov et all in [8] estimates con-tinuous representations of words and attempts to capture these based on the co-occurrence of words. The word2vec architecture can be divided into two mod-els the Skip-Gram model(SG) and Continuous Bag-of-Words model (CBOW). The former model attempts to predict the context of each center word in a text based on some predefined window of words, whereas the latter model works the other way around by predicting the center word given the surrounding context. The goal for both models is to maximize the corresponding probability distribu-tions. For the SG model, this can be accomplished by minimizing the negative log-likelihood objective function.

(8)

J (θ) = −1 T T X t=1 X −m≤j≤m log(P (wt+j|wt; θ)) (3)

Where θ represents the variables to be optimized in the form of word vectors, m is the word window around the center word wtand T the number of words

in the text. However, using the nave softmax function to compute the prob-abilities in equation 3 is computationally expensive due to the normalization factor, where it is necessary to compute the probabilities for each word in the vocabulary. Thus, more efficient methods have been proposed for subsequently training the model these include the optimization of the hierarchical softmax or negative sampling using gradient descent, the modification of parameters using partial derivatives of the loss function. Negative sampling is based on binary logistic regression and following objective function is utilized instead:

J (θ) = 1 T T X t=1 Jt(θ) (4) Jt(θ) = log(σ(uTovc)) + k X i=1 Ej∼P (w)[log(σ(uTovc))] (5) σ(x) = 1 1 + e−x (6)

In the first term of equation 5 the symbol σ indicates a sigmoid function as in equation 6, the uorepresents the outside or context words and vcrepresents the

center vector. The second term of that equation attempts to compute expected value over k negative random samples, based on unigram distributions, which is computationally much more efficient.

3.5 Latent Dirichlet Allocation

While there are many methods for representing documents, some of which are used in this paper, their focus mostly lies on modeling the individual words based on collective cooccurrences of those or sets of individual frequencies. Al-though at its core Latent Dirichlet Allocation(LDA)[2] is also dependent on the cooccurrence of words, it is a statistical model that discovers latent topics in a data set and thus represents each document therein as a mixture of those topics. In addition to that, each topic is a distribution over the words. LDA functions as a generative model, therefore, given some parameters it randomly generates data and then compares it against the original data set to determine the quality of the model. The particular parameters utilized for the generation include the number of topics and the number of words in a document. It is assumed that to generate a document, a topic is initially selected based on a certain topic distribution, then for that topic, a word is picked from the data set vocabulary given the to the topic corresponding word distribution, finally, this process is repeated until the predefined amount of words is reached.

(9)

The process of learning topic distribution starts by randomly assigning each word in each document to one of the K topics. This initialization creates an initial topic distribution as well as word distribution and is improved upon by computing for each topic t the probability of t given the document d multiplied by the probability of word w given topic t.

3.6 Evaluation Metrics

Metrics for evaluating the classification algorithms include precision, recall, and the F1-measure. For this problem due to the data being highly unbalanced, it is not significant to rate the algorithms based on the accuracy. Even in the case where all test instances are predicted to be nonrelevant, the accuracy would still be high, despite this result being ineffective for obtaining relevant articles.The metrics are defined as follows:

P recision = T P

T P + F P (7)

Recall = T P

T P + F N (8)

F 1 − measure = 2 · P recision · Recall

P recision + Recall (9)

Where TP stands for True Positives, the number of articles that are correctly predicted to be relevant, FP for False Positives, the number of articles that are incorrectly predicted to be relevant, and FN for False Negatives, which are the articles that were incorrectly predicted to being nonrelevant. Finally, the F1 measure indicates the proportion between precision and recall.

For evaluating the rankings obtained from the decision boundaries of the classifiers we utilize the tar eval evaluation script made for the purpose of the CLEF eHealth competition, which will be described in more detail in the data section. On their task description page [9], they provide the following evaluation measurements:

• Area under the recall-precision curve, i.e. Average Precision (tar eval: ap) • Minimum number of documents returned to retrieve all R relevant

docu-ments (tar eval: last rel) a measure for optimistic thresholding

• Work Saved over Sampling @ Recall (tar eval: wss 100, and wss 95) W SS@Recall = (T N + F N )/N (1 − Recall)

• Area under the cumulative recall curve normalized by the optimal area (tar eval: norm area) optimal area = R ∗ N (R2/2)

(10)

• Normalized cumulative gain @ 0% to 100% of documents shown (tar eval: NCG@0 to NCG@100) For the simple case that judgments are binary, normalized cumulative gain @ % is simply Recall @ % of shown documents

• Cost-based measures:

– Total cost (tar eval: total cost) IN T ERACT ION = N S Cost: 0 IN T ERACT ION = N F Cost: CA IN T ERACT ION = AF Cost : CA + CA

– Total cost w/ penalty Cost = Total Cost + Penalty Penalty Cost

Uniform: (m/R) ∗ (N − n) ∗ CP (tar eval: total cost uniform) the assumption behind this is that one needs examine half of the remaining

documents to find the remaining missing documents N is the total number of documents in the collection n is the number of documents shown to the user

(N - n) is the number of documents not shown to the user m is the number of missing relevant documents

CP = 2*CA Weighted: Pm

i=1(1/2

i_{)(N − n) ∗ CP (tar eval: total cost weighted)}

the assumption behind this is that one needs examine half of what is left to find the 1st relevant, then 1/4, then 1/8 etc.

– Reliability

Reliability = lossr + losse (tar eval: loss er) lossr = (1 − recall)2 _{(tar eval: loss r)}

losse = (n/(R + 100) ∗ 100/N )2 _{(tar eval: loss e)}

recall = nr/R(tar eval : r)

nr is the number of relevant document found and R the total number of relevant documents

n is the number of returned documents by the system, and N is the size of the collection

The focus in this paper mainly lies on the Normalized Cumulative Gain(NCG) distribution, Work Saved over Sampling indicated as wss 100 and wss 95 and the average precision denoted as ap. The NCG indicates the recall at the top X percent ranked articles and evaluates the sample ranges at ten percent intervals. The wss variables measure how much work the researchers could save in terms of the fraction of articles that the reviewers do not have to screen manually and these are reported at the fixed levels of recall at 100% and 95% respectively.

(11)

4 Method

4.1 Data

The data has been made available as part of a this year’s competition hosted by CLEF eHealth and originated from the result obtained from Boolean search queries that have been used in systematic reviews of diagnostic test accuracy studies and that have been manually constructed by Cochrane experts. The goals of the task were to efficiently rank the abstracts in the data set to accom-plish a fast retrieval of the relevant documents, or to possibly identify a subset that contains as many of the relevant documents for the least amount of effort given their metrics.

Both a separate training set as well as a development set has been provided for the task, the former containing 12 topics and the latter 28 topics, all of them being from a biomedical domain. For each topic a topic file documents all the corresponding PubMed Document Identifiers (PID’s), as they were extracted from PubMed, and a folder containing several XML files with article contents. Additionally, for each data set, there is a qrel text file containing a list with all the PID’s in the data set paired with their topics and their relevance labels. After aggregating all the abstracts and titles coupled with their labels from the XML files based on the topic files, they were then filtered by removing all the articles with empty abstracts. In the preprocessing phase, punctuation and numbers were removed first after which words with fewer than three characters were deleted and finally words from the PubMed stopwords table. The reason behind these removals is that they do not sufficiently provide information that aid the distinctiveness of each article.

From these results the TF-IDF, word2vec and LDA representations have been constructed using the Python library called Gensim, yielding in total 72180 training set instances with 1690 being relevant and in total 91822 test set in-stances with 1638 relevant inin-stances, see Tables 3 and 4 in the appendix for a more detailed distribution of the proportions for each topic. However, the word2vec word embeddings were not learned with the data sets utilizing Gensim but were extracted from an approximately 4 GB pre-trained word2vec matrix, that was trained using all the biomedical articles from PubMed and PMS at the time of 2013. Constructing all the document representations was subse-quently accomplished by first selecting each word in a document then extract-ing its correspondextract-ing word embeddextract-ing from the word2vec matrix, discardextract-ing the words that are not available. Following this step, all the word embeddings were inserted into a separate matrix, the rows were then weighed by the TF-IDF values and finally summed, resulting in a single document feature vector with a window of 200. Other possible word embedding aggregation methods that were attempted included summation without TF-IDF weighing and utilizing the maximum or minimum vector word embedding as the document representa-tions, but this yielded slightly worse results at the classification stage. Finally, for constructing the TF-IDF and LDA data sets default gensim parameters were passed with the k topic number specification for LDA set to 200.

(12)

4.2 Implementation approach

The support vector machine classifications were implemented in Python by uti-lizing the LinearSVC function from the scikit-learn package, which, internally, is based on the LIBLinear implementation. As the name implies, the computation behind LinearSVC is by default performed with a linear kernel function and thus it is optimized specifically for it. For the conducted experiments the penalty parameter C of the error term was set to be a thousand. The experiments with this parameter specifically, yielded that if C is above a certain threshold it does not influence the results at all and does not significantly influence training or test times, however, below this threshold it simply classifies all the articles as nonrelevant.

The only publicly available implementation of LMT is the one provided within the WEKA framework, based on Java, and was utilized in this paper only to classify the word2vec data set. WEKA is a software environment, developed at the University of Waikato, New Zealand, that contains multiple machine learning algorithms, and provides various means for visualization, evaluation methods as well as filters for modifying the data. However, based on the experi-ments conducted in WEKA in this paper, it was evident that it has significantly higher memory requirements relative to Pythons scikit-learn machine learning algorithms in regard to loading and performing computations on data. Conse-quently, given these memory limitations, logistic model tree classification was only conducted on the word2vec data set. The number of terms selected for TF-IDF representations is equal to the amount of unique words in both the test set as well as train set thus it is formatted as a large sparse matrix in Python, but even utilizing the supported sparse data arff format in WEKA the memory requirements were higher than the available 12 GB RAM. Since the word2vec data set was represented by a dense matrix with 200 values it did not have the same problem in WEKA.

Random under-sampling was employed for generating training data sets with different class distributions, as a possible solution for handling the unbalanced data set. It modifies the data set by removing randomly selected instances of the majority class until a certain proportion between the number of relevant articles to nonrelevant articles, denoted by ratio r, is reached. In this paper, we compare three such class distributions with the ratios of 0.024, which is the data set without under-sampling, 0.1 and 0.5. These lead to the following pro-portions:

Unaltered training data set shape (0: 70490, 1: 1690)

Training data set shape after resampling to r 0.1 Counter(0: 16900, 1: 1690) Training data set shape after resampling to r 0.5 (0: 3380, 1: 1690)

Where the 1’s indicate the relevant articles as the minority class and 0’s the nonrelevant instances. Finally, this yields the following set up for the training data with each undersampling level:

Algorithms:

• Support Vector Machine • Logistic Model Tree

(13)

Representations:

• word2vec

• Latent Dirichlet Allocation

• Bag-of-words with frequencyinverse document frequency

The ranking results are simply obtained by utilizing the decision functions on the test set after training the data as an addition to the classification labels. Since this option is not available for LMT in WEKA, the rankings of the LMT results were not included in the comparisons.

5 Results/discussion

This section presents and evaluates the results by reporting the evaluation scores obtained by the approaches previously described. Table 1 illustrates the Preci-sion, Recall, and F-Measure metrics for the positive class, for all the classifica-tion algorithms, representaclassifica-tions and proporclassifica-tions, evaluated on the test set.

Table 1: Results

Method US Ratio Precision Recall F-measure

SVM TF-IDF 0.024 0.44 0.27 0.34 SVM TF-IDF 0.1 0.24 0.31 0.27 SVM TF-IDF 0.5 0.13 0.41 0.20 LMT word2vec 0.024 0.373 0.195 0.257 LMT word2vec 0.1 0.102 0.339 0.157 LMT word2vec 0.5 0.06 0.554 0.108 SVM word2vec 0.024 0.46 0.0001 0.01 SVM word2vec 0.1 0.19 0.25 0.22 SVM word2vec 0.5 0.07 0.59 0.12 SVM LDA 0.024 0.08 0.01 0.01 SVM LDA 0.1 0.02 0.06 0.03 SVM LDA 0.5 0.02 0.15 0.03

Under-sampling the majority instances in the dataset did not enhance clas-sifier performance when solely observing the F-measure numbers in all of the setups. The only instance that benefited from undersampling was when clas-sifying word2vec with an SVM and at the specific ratio of 0.1, while a further increase in the level of undersampling does not provide any additional benefit for this setup either. Nonetheless, under-sampling did significantly improve overall recall, especially when utilized with the word2vec representations for both SVM classification from 0.0001 on the full data set to 0.59 on 50% as well as for LMT from 0.195 to 0.554. This might indicate that the quality of word2vec classifi-cation is more dependent on sampling methods such as random undersampling than other text representations for this particular problem.

(14)

findings that LMT is more capable of handling unbalanced data [1] as the F-measure is higher at more unbalanced proportions as 0.257 to 0.01 for LMT. On opposing sides of the spectrum, in terms of F-measure performance, are the SVM TF-IDF representations and SVM LDA representations, the former yielding the best results at an F-measure of 0.34 while the latter the worst by a significant margin at 0.03. These findings are in agreement with the perceived robustness that is provided by a bag-of-words representation even at the cost of losing information when simply utilizing word frequencies. Regarding the LDA representation’s bad performance can be remarked that the topic distribution, that was initially learned to create a data set, does not capture the distinctive-ness of the data enough for it to be effective for this classification task. Initially, 100 topics were used to create the LDA data set instead of 200, however, this yielded even worse results where no articles were classified as relevant.

Figure 1: TF-IDF bag of words Figure 2: Latent Dirichlet Allocation

Figure 3: word2vec rankings Figure 4: Representation comparison

(15)

that undersampling does not improve recall for any of the document represen-tations at all as can be seen in figures 1,2 and 3 which is not in accordance with the observations made in the classification approach. Each of the data sets are compared to the random baseline which was created by taking an average over thousand random rankings. At the highest level of undersampling, the recall drops below the random baseline in all of the cases except for LDA. In fig-ure 4 the representations are compared to each other on the rankings obtained utilizing the full data set. Here it is evident that word2vec(w2v) outperforms the other representations in particular at around the 40% top-ranked document mark and that in terms of recall bag of words model performs the worst as it is the only one that dips below the random baseline.

Besides the recall table 2 demonstrates that the word2vec representation also

Table 2: Metrics table

wss 100 wss 95 norm area ap bow 10 0.09 0.101 0.631 0.094 bow 50 0.092 0.087 0.587 0.099 bow full 0.085 0.098 0.626 0.088 lda 10 0.072 0.104 0.575 0.062 lda 50 0.064 0.097 0.551 0.063 lda full 0.11 0.156 0.617 0.088 w2v 10 0.077 0.109 0.667 0.113 w2v 50 0.042 0.033 0.455 0.057 w2v full 0.183 0.26 0.751 0.147

has the highest average precision at 0.147 as well as the highest values for Work Saved over Sampling. With wss 100 on 0.183 and wss 95 on 0.26 being nearly twice as high as the second-best results from the full LDA data set. Even though LDA performs better than the bag of words and terms of WSS it’s average pre-cision is approximately equal.

5.1 Conclusion

In conclusion seems like for the classification approach for identifying DTA studies where the thresholds are more clearly defined the TF-IDF bag-of-words model still has the best performance in terms of F-measure. Whereas in the case of ranking the less frequently used representations like word2vec outperform the robustness of this former model. In addition to this, random under-sampling is not effective for this particular data set at achieving improved performance. Finally, more experimentation is required to determine whether these results generalize the data sets with different topics.

5.2 Future work

Even though undersampling did not show any benefit in this paper for this par-ticular task, it does not mean that other sampling methods will not be effective. Other solutions for imbalanced data within the domain of machine learning in-clude techniques such as Synthetic Minority Over-sampling (SMOTE) [3], where a combination of under- and over-sampling is utilized, and Adaptive synthetic

(16)

sampling(ADASYN) [5] where instances to over-sample are selected based on the difficulty level of learning them. The effectiveness of SMOTE has been demon-strated training on imbalanced data, even in conjunction with cost-sensitive algorithms [4][14].

Due to the limitations of WEKA’s implementation of Logistic model trees, less can be concluded about the performance of this algorithm for the classification task. For this reason, a more extended evaluation is necessary, and this can be accomplished by using an implementation that can handle bigger data sets or-or utilizing a more advanced set of hardware.

Additionally, there are still possibilities for improving the performance of the methods used in this paper such as increasing the quality of the word2vec rep-resentations. Since the document representations were constructed from pre-trained word2vec representations this imposed a few limitations on further ex-perimentation. Firstly, the window that represents the context of each word was set to be 200, consequently, experiments with different values were not per-formed and should be in the future in order to optimize performance. Secondly, the words in the data set that were discarded due to not being available in the pre-trained set of word embeddings could have reduced the distinctiveness of the document representations this might for instance require a solution such as a form of smoothing. Finally, there are different possibilities regarding utilizing doc2vec instead of TF-IDF weighted aggregation of word2vec words to build document representations or to use an attention mechanism based approach such as in [4].

The feature sets created by the different techniques in this paper are also not limited to be utilized individually, as there have been cases where a combina-tion of representacombina-tions leads to increased performance such as the combinacombina-tion of word2vec and LDA [13]. Additionally, it is possible to add annotations to these documents representations as in the work of Almeida et all [1]. Besides adding to the classification or ranking methods a more thorough analysis can be conducted on the bias within the particular sets of topics contained in the current data sets and how well they generalize to a different test set topics.

References

[1] Hayda Almeida, Marie-Jean Meurs, Leila Kosseim, Greg Butler, and Adrian Tsang. Machine learning for biomedical literature triage. PLOS ONE, 9(12):1–21, 12 2015.

[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allo-cation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

[3] Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813, 2011.

[4] Peng Cao, Dazhe Zhao, and Osmar Zaiane. An optimized cost-sensitive svm for imbalanced data learning. pages 280–292, 2013.

[5] Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. pages 1322–1328, 2008.

(17)

[6] Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 59(1):161–205, May 2005.

[7] M. M. Leeflang, J. J. Deeks, C. Gatsonis, P. M. Bossuyt, B. Aertgeerts, D. Altman, G. Antes, L. Bachmann, P. Bossuyt, H. Buchner, P. Bunting, F. Buntinx, J. Craig, R. D’Amico, R. de Vet, J. Deeks, R. Doust, M. Egger, A. Eisinga, G. Fillipini, Y. Flack-Ytter, C. Gatsonis, A. Glas, P. Glasziou, F. Grossenbacher, R. Harbord, J. Hilden, L. Hooft, A. Horvath, C. Hyde, L. Irwig, M. Kjeldstr?m, P. Macaskill, S. Mallett, R. Mitchell, T. Moore, R. Moustgaard, W. Oosterhuis, M. Pai, P. Paliwal, D. Pewsner, H. Re-itsma, J. Riis, I. Riphagen, A. Rutjes, R. Scholten, N. Smidt, J. Sterne, Y. Takwoingi, D. van der Windt, V. Vlassov, J. Watine, and P. Whit-ing. Systematic reviews of diagnostic test accuracy. Ann. Intern. Med., 149(12):889–897, Dec 2008.

[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient es-timation of word representations in vector space. CoRR, abs/1301.3781, 2013.

[9] Evangelos Kanoulas (University of Amsterdam), Dan Li (University of Amsterdam), and AMC Academic Medical Center) nad Leif Az-zopardi (University of Strathclyde) Rene Spijker (The Cochrane Collab-oration. ehealth competition task 2. https://sites.google.com/site/ clefehealth2017/task-2, 2017. [Online; accessed 1-July-2017].

[10] Alison O ´Mara-Eves, James Thomas, John McNaught, Makoto Miwa, and Sophia Ananiadou. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic Reviews, 4(1):5, 2015.

[11] Alex J. Smola and Bernhard Schlkopf. A tutorial on support vector regres-sion, 2004.

[12] Ties van Rozendaal. Technologically assisted systematic reviews in empir-ical medicine. 2016.

[13] Zhibo Wang, Long Ma, and Yanqing Zhang. A hybrid document fea-ture extraction method using latent dirichlet allocation and word2vec. 2016 IEEE First International Conference on Data Science in Cyberspace (DSC), pages 98–103, 2016.

[14] Qiuyan Yan, Shixiong Xia, and Fan-Rong Meng. Optimizing cost-sensitive SVM for imbalanced data : Connecting cluster to classification. CoRR, abs/1702.01504, 2017.

[15] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. Hierarchical attention networks for document classi-fication. 2016.

(18)

6 Appendix

Table 3: Training Data Proportions Topic ID relevant nonrelevant total CD010438 28 2328 2356 CD007427 58 1381 1439 CD009593 55 12616 12671 CD011134 183 1608 1791 CD011975 552 6190 6742 CD011984 403 6334 6737 CD010409 41 29270 29311 CD010771 41 266 307 CD009591 141 6975 7116 CD008691 65 1183 1248 CD010632 27 1453 1480 CD009944 96 886 982 12 topics 1690 total 70490 total 72180 total

Table 4: Test Data Proportions Topic ID relevant nonrelevant total CD008803 99 4591 4690 CD008782 45 9300 9345 CD009647 52 2590 2642 CD009135 71 630 701 CD008760 12 44 56 CD010775 11 210 221 CD009519 104 4695 4799 CD009372 24 2029 2053 CD010276 51 4618 4669 CD009551 46 1832 1878 CD012019 3 8828 8831 CD008081 26 876 902 CD009185 90 1187 1277 CD010653 44 7655 7699 CD010542 19 318 337 CD010896 6 153 159 CD010023 44 815 859 CD010772 46 261 307 CD011145 202 10584 10786 CD010705 23 89 112 CD010633 4 1459 1463 CD010173 20 4649 4669 CD009786 9 1803 1812 CD010386 2 576 578 CD010783 30 10789 10819 CD010860 7 85 92 CD009579 128 4663 4791 CD009925 420 4855 5275 28 topics 1638 total 90184 total 91822 total

(19)

(20)

Table 5: TF-IDF ranking results

bow 10results bow 50results bow fullresults

num docs 92364.0 92364.0 92364.0 num rels 1716.0 1716.0 1716.0 num shown 82991.0 82991.0 82991.0 num feedback 0.0 0.0 0.0 rels found 1635.0 1635.0 1635.0 last rel 2995.889 2975.926 2910.407 wss 100 0.09 0.092 0.085 wss 95 0.10099999999999999 0.087 0.098 NCG@10 0.174 0.17 0.193 NCG@20 0.293 0.27699999999999997 0.308 NCG@30 0.39 0.35 0.402 NCG@40 0.46299999999999997 0.408 0.47600000000000003 NCG@50 0.523 0.45299999999999996 0.53 NCG@60 0.5670000000000001 0.491 0.5820000000000001 NCG@70 0.61 0.5589999999999999 0.628 NCG@80 0.851 0.778 0.847 NCG@90 0.8759999999999999 0.804 0.87 NCG@100 0.879 0.8079999999999999 0.871 total cost 3073.741 3073.741 3073.741

total cost uniform 3108.3590000000004 3108.3590000000004 3108.3590000000004 total cost weighted 3418.7859999999996 3418.7859999999996 3418.7859999999996

norm area 0.631 0.5870000000000001 0.626 ap 0.094 0.099 0.08800000000000001 r 0.9670000000000001 0.9670000000000001 0.9670000000000001 loss e 0.456 0.456 0.456 loss r 0.003 0.003 0.003 loss er 0.45899999999999996 0.45899999999999996 0.45899999999999996

(21)

Table 6: Latent Dirichlet Allocation ranking results

lda 10results lda 50results lda fullresults

(22)

Table 7: word2vec ranking results

w2v 10results w2v 50results w2v fullresults

Automatic machine-learning based identification of diagnostic test accuracy studies for systematic reviews