Semi-supervised ensemble learning based on weak supervision for biomedical relation extraction

(1)

Semi-supervised ensemble learning based on weak supervision

for biomedical relation extraction.

submitted in partial fulfillment for the degree of master of science Antonios - Minas Krasakis

11849584

master Information Studies Data Science

faculty of science university of amsterdam

2018-09-28

Academic Supervisor External Supervisor Title, Name Prof. Evangelos Kanoulas Dr. George Tsatsaronis Affiliation UvA, FNWI, IvI Elsevier

(2)

Semi-supervised ensemble learning based on weak supervision

for biomedical relation extraction.

Antonios-Minas Krasakis

University of Amsterdam Amstedam, The Netherlands

amkrasakis@gmail.com

ABSTRACT

During the last few years, a significant portion of research has shifted towards more complex Machine Learning (and Deep Learn-ing) algorithms. Such models often outperform significantly their simpler counterparts. However, their superior performance relies on large amounts of labelled data, which are rarely available. To tackle this problem, we propose a methodology for extending training datasets to arbitrarily big sizes and training complex, data-hungry models using weak supervision.

We apply this methodology on biomedical relationship extraction, a task where training datasets are excessively time-consuming and expensive to create, yet has a major impact on subjects like drug discovery. We demonstrate in a small-scale controlled experiment that our method consistently enhances the performance of an LSTM network, with performance improvements comparable to as if the training data were hand-labelled. Finally, we discuss the optimal settings for applying weak supervision using this methodology.

KEYWORDS

biomedical relationship extraction, weak supervision, semi-supervised learning, ensemble learning, meta-learning

1 INTRODUCTION

1.1 Task & Motivation

The accumulating amount of scientific papers in the biomedical field is increasing day by day. Those papers contain important in-formation, encoded in raw, unstructured text, making it difficult for researchers to find relevant information. Extracting this infor-mation in a structured format and storing it within a knowledge base can have a remarkable impact on a variety of important tasks. In the last decade, there have been efforts towards automation of Information Extraction, due to the fact that manual annotation of documents from domain experts, is labor-intensive and impossible to perform on a large scale [10].

In the context of this thesis, we will be focusing on extracting semantic relationships of regulations between chemicals and pro-teins (or genes) from biomedical abstracts.

More specifically, a chemical regulates a protein when increases (up-regulates) or decreases (down-regulates) the production of the specific protein. This specific relation is particularly important in drug discovery and design, as it will enable researchers to filter out or select specific chemical substances with specific properties, much faster.

As this is a complex and challenging task, the use of more com-plex learning algorithms is justified. However, such algorithms usually require very large training datasets, which are not available. To tackle this problem, we propose a new methodology based on weak supervision, which combines and incorporates ideas from semi-supervised and ensemble learning.

1.2 Problem formulation

Information extraction can be categorized into structured (or schema-full), semi-structured and unstructured, based on the type of the output [6].

Extracting semantic triples of regulations is a structured Infor-mation Extraction problem (subject: CHEMICAL, predicate: REGU-LATES, object: PROTEIN/GENE) and can be divided in the following sub-tasks:

• Named Entity Recognition: Identifying spans of text con-taining named entities.

• Entity resolution&linking: Mapping the extracted enti-ties using a unique identifier (optional step).

• Relationship extraction: Identifying semantic relation-ships between two entities in free text.

In the first step, we identify all entities of interest (subjects & objects) on a sentence-level and ignore all cross-sentence relation-ships, as they account for less than 1% of the total relations [19]. Each entity pair (Chemical-Protein) found in the same sentence constitutes a relationship candidate.

We skip the entity resolution and linking part, as it is beyond the scope of this thesis. Lastly, after having all candidates in our dis-posal, we need to build a classifier recognizing whether the author implies a relationship between each pair or not. In this way, we reduce relationship extraction to a binary classification problem.

Our main focus throughout this thesis will be the sub-task of Relationship Extraction (or classification). However, relation ex-traction is closely connected to the rest of the sub-tasks and can rarely be approached independently. This is also the case, when the proposed methodology of Section 4 is used.

2 RELATED LITERATURE

2.1 Information extraction

Information extraction is typically modelled as a fully-, semi- or un-supervised learning problem.

Unsupervised methods, such as Open Information Extraction, do not make use of any training data, but they can only be used for

(3)

unstructured Information Extraction and hence are unsuitable for our case [4].

Fully-supervised methods rely solely on labeled examples and require human-annotation of documents. Semi-supervised or boot-strapping methods try to leverage both labeled and unlabeled data. One of the first bootstrapping algorithms was DIRPE [7], which starts from some seed (positive) examples, extracts patterns from them and use the patterns for finding new positive examples. Other semi-supervised algorithms include Snowball, TextRunner and more [1].

A similar more recent approach, which is characterized as weakly supervised, is distant supervision [17]. It relies on large knowledge bases (KB), containing semantic triples for the relations of interest. It creates training examples by labelling positively each sentence that mentions a pair included in the KB. Although this method also creates many noisy examples, it has been found useful and efficient when working with large-scale (or web-scale) datasets, as no human annotation is required.

2.2 Relationship extraction from biomedical

text

Most of the research on biomedical relation extraction has been motivated by BioCreative competitions and their corresponding annotated documents (datasets).

BioCreative V (CDR task) focused on extracting Chemically-induced Diseases on a document-level [26]. The best performing team imple-mented an ensemble of Supported Vector Machines, an algorithm which is known to perform well with limited amount of training data [27]. More recent research demonstrated that extending the training set using distant supervision improves performance [20].

A following competition (BioCreative VI - CPR task) focused on identifying relations between Chemicals and Proteins (or Genes) on a sentence level [9]. The best-performing team implemented an ensemble of LSTM, CNN and SVMs, using majority voting and stack-ing [19]. The second highest score was achieved usstack-ing a Support-Vector Machines algorithm with a rich set of features, while other approaches using solely Deep Neural Networks demonstrated over-fitting problems [16].

2.3 Semi-supervised and ensemble learning

methods

Both semi-supervised and ensemble learning techniques aim to im-prove the performance of Machine Learning algorithms. Ensemble learning typically reduces high variance by combining multiple learners, while semi-supervised learning tries to take advantage of unlabeled data to improve generalization. Although ensembles have been studied and used thoroughly, their combination with semi-supervised learning has been studied only a few times [28]. Zhou et. al. argue about the helpfulness of this combination and demonstrate how those methods are beneficial to each other [28]. Specifically, ensembles can enhance semi-supervised learning by providing multiple views while allowing the system to perform bet-ter, earlier (using less data). On the other hand, adding unlabelled data is likely to increase diversity compared to the original, small

labelled set and can be even more beneficial when the available labelled data are limited.

The first system of this kind proposed was co-training [5], a par-adigm where two distinct (independent) learning algorithms take advantage of unlabeled data. To satisfy the assumption of indepen-dence, Nigam et. al. proposed splitting the features in different clas-sifiers to create different views [18]. Later research indicated that the independence assumption was too luxurious for a real-world setting and relaxed the initial constraints of co-training [3]. Specifi-cally, the authors showed that iterative co-training can succeed as long as there is an expansion on the underlying data distribution.

3 BACKGROUND ON WEAK SUPERVISION &

DATA PROGRAMMING

The core idea of weak supervision evolves around creating training examples, the labels of which might be of lower quality (weak), compared to hand-labeled (gold).

The most obvious case where this could be beneficial, is when no ground-truth labels are available for training purposes. In such cases, we can implement heuristics or use lower-quality (and pos-sibly contradicting) labels. For instance, crowd-sourcing produces many labels of questionable quality, in a setting where crowd-workers often disagree with each other. More importantly, those crowd-workers are likely to have substantially different labeling accuracies.

This is the main area of focus of Data programming [22, 23], which is a newly-introduced paradigm for utilizing weak supervi-sion sources and programmatic creation of large training datasets. It only requires an unlabelled dataset (DU) relevant to the taskT we are trying to solve, while additional labeled datasets might be required only for validation and evaluation purposes.

The main core of thesis will be based on this paradigm, which mainly comprises of the following steps:

3.1 Providing weak supervision sources

We define a set ofK functions, optionally providing a vote on the label of each example. Those functions are called Labelling Func-tions (LF) and each of them represents a weak supervision source. This setting is highly flexible, as it is allows us to model any type of source. Typical examples of LF include patterns and regular expressions for textual data, distant supervision (for relationship extraction problems) and crowd-sourced labels.

3.2 Unifying the weak supervision sources

After applying the Labeling Functions inDU, we construct anKxM (possibly incomplete) vote matrix. Our intent now, is to combine those votes in an optimal way and deriveM weak labels, which are as close as possible to the (unknown) true labels.

To produce the weak labels, Data Programming uses a probabilistic Generative Model (GM), based on disagreements between the LF. The GM models each weak source, based on the observed vote matrix. It learns parameters (weights) corresponding to the prob-abilities of each LF labelling an example or not (coverage) and

(4)

labelling it correctly (accuracy). To estimate them without gold labels we maximize the likelihood that the observed votes occur under the Generative Model, for all possible ground-truth labels. Additionally, it is possible to model similarity dependencies be-tween weak sources, by introducing a prior belief that the votes of two LF will be the same [21, 23]. The dependencies between the labelling functions can be automatically detected, using the technique described on [2]. We control the number of dependen-cies using a hyperparameter. Additional hyperparameters refer to regularization and learning step for training the GM. We select all of them using the F1 score of the weak labels on the validation set DV.

Once we have trained the GM and learned the appropriate weights, we can proceed with producing the weak labels. Specifically, each weak label corresponds to the posterior probability of its true label beingy1, given the observed votesL1: P(y1|L1)

3.3 Training a discriminative model with weak

supervision

We now use the weak labels to train a noise-aware discriminative model, which will be our final predictor. This model is noise-aware, in the sense that it minimizes the empirical loss with respect to the (marginal) probabilistic labels, rather than the predicted (binary) la-bel. For instance, in the binary classification setting, if the denoiser estimates that the label of an example is True with 80% confidence, we minimize the loss with respect to 0.8 instead of 1.

This technique allows us to cheaply generate weak labels and use them to train complex discriminative models. Typically deep neural networks are used, as they can learn their own features and take advantage of big training sets.

4 METHODOLOGY FOR SEMI-SUPERVISED

ENSEMBLE LEARNING, BASED ON WEAK

SUPERVISION

Based on the core ideas of weak supervision and data programming, we are proposing a methodology for semi-supervised learning, with the intent to capitalize the advantages of multiple learners.

In contrast to the typical Data Programming paradigm where no ground-truth labels are available, we focus on the scenario where a gold-labeled training set (DG) is available, but we consider its size insufficient for training a complex (and therefore more data-hungry) model. We advocate that it would be beneficial to augment additional lower quality training data and effortlessly scale the dataset size. Additionally, instead of relying in heuristics, distant supervision or crowd-sourced labels, we use Machine Learning models to act as weak supervision sources (Labelling Functions).

4.1 Data collection

In terms of data, we need a labeled training setDGof sizem, relevant to the taskT we are trying to solve. Additionally to that, we obtain an unlabeled, arbitrarily big datasetDU of sizeM >> m, with the requirement of being drawn from the same distribution asDG. Lastly, we need a validation setDV and a held-out test setDT for

evaluation purposes.

It is worth highlighting the fact that collecting unlabeled data for a problem is usually a much easier task compared to obtaining the ground-truth labels for this dataset.

4.2 Constructing diverse Base learners

We useDGto trainK base learners on solving T . As in a typical ensemble learning scenario, we try to maximize their individual performance while making them capture different ’views’ of the data.

To produce multiple learners we rely on changes throughout the relation extraction pipeline and create one base learner for each possible combination.

4.2.1 Text trimming.

Biomedical literature often contains lengthy and complex sentences. Therefore, words appearing within a sentence might be irrelevant to the entities of interest.

For this reason, we can keep only the words between the two en-tities of interest. Alternatively we can additionally include words appearing within a window before/after the entities.

We can also resort to a more complex approach, incorporating syn-tactic information. One way to do so is to construct the dependency-based parse tree and include only words contained within the short-est path connecting the entities of intershort-est.

• No trimming

• Trimming window of 0 • Trimming window of 5 • Shortest Dependency path 4.2.2 Sequential features.

Additionally to the simple bag-of-words approach, it is also possible to include contigeous sets of tokens as features. In our approach, we use up to tri-grams.

4.2.3 Text representation.

To convert our corpus to a numerical representation, we use the following two methods:

• Token occurences (binary counts) • tf-idf

4.2.4 Machine Learning algorithms.

Lastly, when the feature matrix is ready, we employ the following types of Machine Learning algorithms:

• Logistic Regression

• Support Vector Machines (using Gaussian and linear kernels) • Random Forest Classifiers

• Long-Short Term Memory networks • Convolutional Neural networks

It is important to note, that when the last two models (LSTMs & CNNs) were used, none of the aforementioned feature engineering steps were applied.

4.3 Base learner selection

After producing all possible Base Learners, we select only a sub-set of them. This is necessary due to computational costs and the

(5)

fact that we should avoid including many similar classifiers in a disagreement-based method. To illustrate this, consider a case where 10 simple Bag of Words models are used along with a more complex classifier, which also includes sequential features. The complex classifier is likely to disagree more with the rest, without necessarily having an inferior performance.

Hence, our objective is to maximize both the individual perfor-mance of the Base Learners and their diversity. Since this is complex and still an open issue, we resort to a simple method, where we discard all classifiers having a lower performance than a certain threshold (evaluated onDV), while maximizing diversity. Setting a performance threshold is also desirable, due to the fact that the base learners where automatically created with limited hyperparameter tuning. We set this threshold above the random guess baseline, but low enough to allow less accurate and more diverse classifiers to be part of the ensemble.

To select the most diverse classifiers, we employ a similarity-based clustering method. Using the predictions of theK base learn-ers onDV, we construct aKxK similarity matrix. In this matrix, where each row and column refers to a Base learner, while each cell consists of the corresponding pairwise inter-annotator agreement rates (Cohen Kappa coefficient). We perform K-means clustering [15] on this matrix and pick the base learners closest to the cluster centroids as most representative of their cluster.

To pick an appropriate number of clusters (and therefore base-learners), we refer to the silhouette score coefficient [24].

4.4 Producing the weak labels

We predict the (unknown) labels ofDU using the selected base learners and obtain aKxM (binary) prediction matrix, containing the "knowledge" our base learners have distilled fromDG. Conse-quently, we have to use a denoiser to produce theM weak labels, based on this matrix.

We use the probabilistic Generative Model and select its hyperpa-rameters as described in Subsection 3.2.

Additionally, we use two simpler denoisers to unify the label ma-trix. We use Majority Voting to produce binary Majority Vote weak labels. Further, we calculate the unweighted average of votes on each unlabelled example (Average Vote weak labels).

4.5 Training a meta-learner

In the last step, we use a discriminative model (as described in Subsection 3.3) as a meta-learner. We can train the model using onlyDU (weak-supervision) orDU + DG(weak-supervision + full-supervision).

In practice, when training the meta-learner with weak supervision, we trade off label quality for quantity. This can be proven to be beneficial, in cases where the performance of the meta-learner is upper-bounded by the training set size. By using Deep Neural net-works as meta-learners, we allow them to learn their own features and hopefully build a more accurate representation by relying on a much larger (although noisy) training dataset.

In our experiments, we use a simple bi-directional Long-Short Term Memory network, which is one of the most commonly and bet-ter performing DNN on tasks related to Natural Language. Before training, we perform a random under-sampling (keeping an equal class balance). We use differrent dropout values (0, 0.25 and 0.5) and different number of training epochs (up to 30) and select the best hyperparameters based onDV.

5 EXPERIMENTAL SETUP

To perform our experiments we use part of the functionality of Snorkel [21, 22], a framework build for relation extraction with Data Programming and weak supervision.

5.1 Datasets used

For the needs of our labelled data, we use the official BioCreative VI CHEMPROT (Chemical-Protein interaction) dataset [9], consisting of 2432 annotated PubMed1abstracts (split in a training, develop-ment and test set – Table 1).

Our methodology requires three gold-labeled datasets. The base learner training set (DB), a Validation set for selecting the various hyperparameters (DV) and a held-out Test set (DT) for the final evaluation. Additionally to that, a relevant datasetDU is required, which does not need to be labeled.

In all setups, we use half of the original CHEMPROT development set as validation dataset (DV) and the original CHEMPROT test set as our held-out test set (DT).

Setup1(outдoinдcitations).

We merge the remaining CHEMPROT development set with the CHEMPROT training set and use it to train our base learners (DB). We use the PubMed API2to retrieve the list of outgoing citations of the papers contained in the official training and developer set. We exclude the abstracts contained onDT and download the rest to be used asDU.

Setup2(similardocuments).

In this setup, we change only the collection method ofDU. Instead of using the outgoing citations, we retrieve the 25 most related doc-uments (as defined by PubMed). PubMed uses PMRA as a similarity metric, which is known to outperform the popular BM25 metric on biomedical content [12].

Setup3(CHEMPROTdataset).

In the last setup, we use only documents contained in the original gold-set (CHEMPROT). Out of the total 1326 documents left for training purposes (excludingDV), we use 400 of them (about 1/3) as DB. We treat the remaining abstracts (926) as if they were unlabeled (DU).

This setup ensures that all datasets are drawn from the same dis-tribution and also allows us to examine how weak-supervision compares to full-supervision (when applied on the same dataset).

1_{https://www.ncbi.nlm.nih.gov/pubmed/}

2_{https://www.ncbi.nlm.nih.gov/pmc/tools/developers/}

(6)

Dataset # docs # entities # candidates CHEMPROT (training) 1020 25700 9917 CHEMPROT (development) 612 15500 6227 CHEMPROT (test) 800 19800 8285 CHEMPROT (all sets) 2432 61000 24429 Outgoing citations 20261 290119 76867 Similar documents (top 25) 38311 714115 130424

Table 1: Basic statistics of the datasets

Precision Recall F1 tmChem [11] 78% 69% 73% LeadMine [13] 78% 72% 75% Table 2: Chemical NER evaluation

Precision Recall F1 GNormPlus [25] 72% 42% 53% Neji [8] 72% 47.5% 57% Table 3: Protein/Gene NER evaluation

5.2 Text pre-processing

Most of the steps in our text pre-processing pipeline are performed by SpaCy (v1.0), an open-source library for Natural Language Pro-cessing. More specifically, SpaCy performs the following tasks in our pipeline:

• Sentence splitting • Tokenization • Dependency parsing

Sentence splitting is required to extract same-sentence relation candidates, while the rest of the tasks are required for training our Machine Learning models (Base Learners & Meta Learner).

5.3 Named Entity recognition

For all documents contained in the CHEMPROT dataset, gold qual-ity named entqual-ity tags were used, as they were provided by the or-ganizers. However, the proposed methodology requires additional training data. In Setups 1 & 2, where the Named Entity tags were unavailable, we resort to automatically recognizing them using Named Entity Recognition (NER) models.

We evaluate the performance of four state-of-the-art NER algo-rithms against the Named Entity tags in our gold-set (Table 2 & 3). Although LeadMine and Neji demonstrate a slightly better perfor-mance on our goldset, we proceed with tmChem and GNormPlus due to their usability and less restrictive usage license agreements. We download the Named Entity tags of each abstract, directly using the PubMed API.

It is noteworthy that recall is much lower than precision, espe-cially for Proteins. In an end-to-end Information Extraction system, this would greatly affect the performance as many candidate pairs would not be considered. However, when we evaluate solely the

task of relationship extracion (classification) we can easily compen-sate lost candidates by adding more unlabelled documents. On this sub-task, precision can be considered as more important because including non-relevant entities is likely to affect the classification results.

5.4 Candidate extraction

After recognizing all entities within the text, we search for rela-tionship candidates. We consider any two pairs of Chemicals and Proteins contained in the same sentence as a candidate. We use the core functionality of Snorkel [22] both for candidate extraction and mapping the candidates to their ground-truth labels (when available).

5.5 Entity replacement

While building a relationship classifier, it is important to train an algorithm which can understand the Natural Language, rather than memorize the pairs which interact with each other. To avoid that, we replace all names of the entities of interest (chemicals & pro-teins).

We replace the chemical entity with the token ’ENTITY1’ and the protein entity with the token ’ENTITY2’. If the sentence contains more than one entities of one type (but not the ones we try to predict at the given time), we also replace them with the tokens ’CHEMICAL’ and ’GENE’ accordingly. In such way, our algorithms can also distinguish which is the underlying pair we want to predict.

Additionally, we merge all subsequent identical entity types into one to avoid false tokenization within biomedical concepts. This is important as chemical compounds often include special characters, which could be mistakenly considered as the end of a token by general-purpose NLP tools, such as SpaCy.

6 RESEARCH QUESTIONS AND

EXPERIMENTAL DESIGN

6.1 Can we enhance biomedical relationship

extraction when using Machine Learning

classifiers as sources of weak supervision?

Related literature provides theoretical warranties, that under spe-cific conditions, adding weakly labeled data will improve the perfor-mance of the meta-learner. Additionally, as the amount of weakly labeled data increases, the performance of the meta-learner is ex-pected to improve quasi-linearly (almost as good) compared to the scenario where the ground-truth labels were provided.

Those conditions include that the weak supervision sources should have accuracy better than random guess, overlap and disagree with each other enough (so that their accuracies can be estimated) while capturing different ’views’ of the problem (diversity). In other words, they should be modelling the problem space sufficiently, so that meaningful weak labels can be produced.

However, to the best of our knowledge, Machine Learning classi-fiers have not been used so far as weak supervision sources on this setting. Therefore, it is unclear whether a diverse and sufficiently big set of base learners exists and satisfies those conditions. This

(7)

is a critical question, which actively affects the usability of the de-scribed methodology.

We conduct a number of experiments on the different setups de-scribed in Section 5.1. To evaluate whether weak supervision helps, we compare the performance of the meta-learner when trained on DB(full-supervision on base learner dataset) to the performance achieved when trained onDU (weak-supervision) andDB+ DU (weak-supervision + full-supervision).

Additionally, onSetup3, where all ground-truth labels are known, we evaluate whether weak-supervision can achieve results compa-rable to full-supervision (on same sized datasets).

6.2 Which are the optimal settings for using

weak supervision on this task?

Number of base learners.

Selecting the optimal number of classifiers to be used as base learn-ers is not a straightforward task. Naturally, we can construct only a few top-performing learners and as we add more of them, we start sacrificing performance (assuming we maintain diversity in high levels) [28]. To examine this, we gradually increase the number of Base Learners and benchmark the performance of weak labels and the meta-learner.

Comparison of various denoising methods.

The denoising component is fundamental to this method, as it dic-tates the quality of the weak labels the final classifier will be trained on. We use the three denosing methods described on Subsection 4.4 and assess the quality of the produced weak labels.

Meta-learner performance under different weak label distribu-tions.

The denoiser can produce either binary or marginal (non-binary) weak labels. Additionally, marginal weak labels might be following different distributions. We investigate their effect on the training and the final performance of the meta-learner.

7 RESULTS AND ANALYSIS

7.1 Can we enhance biomedical relationship

extraction when using Machine Learning

classifiers as sources of weak supervision?

7.1.1 Experiments on Setup 1 and 2.

Using the dataset setups 1 and 2, we have been unable to improve the performance of the meta-learner with weak supervision, although the datasets gathered (DU) were substantially bigger (5x − 8x times compared toDB).

Performance decreases both when weak supervision is used alone or combined with full-supervision. It is also visible from the learning curve of the meta-learner (Figure 1), that performance constantly decreases as we add weakly labeled data. This indicates that the problem is caused by the quality of the weak labels and not the training set size (quality-quantity trade-off).

training dataset # candidates # candidates (undersampled) F1 score of meta-learner DB 12, 987 6, 576 55.28% DU 76, 867 10, 700 39.77% DB+ DU 89, 854 17, 276 43.22% Table 4: Meta-learner performance on datasetSetup1 (outgo-ing citations)

Figure 1: LSTM learning curve (Setup1)

Importantly, we observe that the initial size ofDU is dramati-cally reduced after undersampling (Table 4). This indicates a much higher (predicted) class imbalance (1 : 4) ofDU, compared to the original CHEMPROT dataset (1 : 13). This statistic suggests that the two datasets are not drawn from the same distribution and thereforeDU is not appropriate.

In case this predicted class imbalance is roughly correct, most of the new abstracts did not contain regulation relationships and hence were inappropriate for our task. However, it could also be that our base learners do not generalize well onDU and they cannot be used to predict and denoiseDU. This would also indicate a difference betweenDBandDU, as the performance of our base learners on the validation set is satisfactory. This difference might be caused by the selection process of the documents (documents from different domains) or noise introduced during the Named Entity Recognition step. As we mentioned on Subsection 5.3, the gold named entity tags were used when available (DB), while for the unlabeled dataset (DU) we had to resort to the use of NER algorithms.

To verify that our two datasets are drawn from different distri-butions, we visualize candidates coming from different datasets. To do so, we use all of the features of our best Base Learner and use the t-SNE algorithm ([14]) to project them in the 2D-space. We visualize samples from the CHEMPROT training set against samples of the CHEMPROT test set (Figure 2) and the Outgoing citations dataset (Figure 3). It is evident by comparing the two Figures, that the candidates of the acquired dataset lie in specific re-gions of the 2−dimensional space, while all CHEMPROT candidates are spread much more uniformly. This confirms that the selected

(8)

Figure 2: tSNE between training - test set (CHEMPROT)

Figure 3: tSNE between training - unlabeled set

datasets are inappropriate for our use case and therefore continue our experiments using onlySetup3.

7.1.2 Experiments on Setup3.

To select the base learners, we use the strategy described on Sub-section 4.3. As this is not a straightforward task, we experiment with different number of base learners and benchmark results in intervals of 5 (5, 10, 15 and 20) and where the silhouette scores (Figure 4) are maximized (10, 13, 20 ).

Figure 4: Silhouette coefficient on similarity-based cluster-ing of base learners

training dataset

training set size

(undersampled) F1 score

DB 3840 2060 44.7%

DU 8221 4516 —

DB+ DU 12061 6576 55.3% Table 5: LSTM performance with gold labels (Setup3)

We compare the results achieved with weak supervision onDU (Table 6), versus full-supervision (Table 5). It is evident that weak supervision with a 3.5x increase in the training dataset size always performs better compared to training on the small gold labeled dataset (44.7%). This proves that we can successfully augment ad-ditional training data using weak supervision, as long as they are drawn from the same distribution.

We can also observe that weak supervision can achieve a perfor-mance comparable to full-supervision (55.3%), using the same train-ing set. In other words, the meta-learner performs almost as good as if the labels were gold. Interestingly enough, there are even cases where weak supervision achieves slightly better results. However, those improvements are minor and not statistically significant, as there is a high variance on the F1 score of the LSTM. We also note that whenever this happened, the training set size was bigger (due to under-sampling) and the Average Vote marginals were used. We further discuss how Average Vote marginals influence the training of the meta-learner in Subsection 7.2.

It is also noteworthy, that a simple Majority Vote would outperform our meta-learner. However, this is an expected result and does not undermine the importance of our results, as we can verify from Table 5 that the LSTM can only achieve an F1 score of around 55% on such a small training dataset.

At last, we visualize the learning curves of the meta-learner (starting from the ground-truth labels) to ensure that the weak la-bels are meaningful and do actually improve performance. Figures 6, 7 and 8 indicate an upward trend, while the outlined confidence intervals indicate that the results are statistically significant. More-over, we observe that the F1 score on the training set is always

(9)

Base Learners F1 score of weak labels F1 score of meta-learner (LSTM) # B.L. mean B.L. Majority Vote Generative Model using using using

performance Maj. Vote labels Average marginals Gen. Model marginals

5 53.20 58.91 61.07 50.11 53.56 52.40

10 54.30 63.04 62.22 52.66 56.45 52.93

13 54.00 63.09 61.28 54.16 58.03 52.65

15 53.67 63.11 61.48 51.01 56.32 53.40

20 53.95 63.57 61.91 50.05 55.60 54.73

Table 6: Performance achieved with weak supervision

Figure 5: LSTM learning curve with gold training labels

Figure 6: LSTM learning curve with Majority Vote weak la-bels

quite higher compared to the test score. This indicates that our deep neural network would benefit from additional training data.

7.2 Which are the optimal settings for using

weak supervision on this task?

The complexity of the problem along with the lack of an appro-priate, substantially big unlabeled dataset do not allow us to draw

Figure 7: LSTM learning curve with averaging weak marginals

Figure 8: LSTM learning curve with Generative Model weak marginals

definite answers in some of the following of questions. However, we will perform an analysis based on our experimental results and discuss our findings.

(10)

Number of base learners.

We can see from Table 6 that theF 1 score of the weak Majority Vote labels is equally good when more than 10 learners are used. When it comes to the Generative model weak marginals, we cannot observe any significant pattern, as the F1 score always deviates within 1.5 points.

The performance of the meta-learner when trained with Average Vote marginals deviates to a certain extent when more than 10 base learners are used, but always performs better compared to when only 5 base learners are used. Using Generative model marginals, performance seems to slightly improve as the number of base learn-ers increases.

It is also noteworthy, that when unweighted voting labels (MV & AV weak labels) were used, the meta-learner performed best with 13 base learners, which is roughly consistent with the silhouette scores in Figure 4.

Comparison of various denoising methods.

In all cases, the meta-learner achieves the best greater performance when trained with Average Marginals. Generative Model marginals also seem to improve its performance compared to Majority Vote weak labels, with one exception.

However, it is worth highlighting the fact that GM marginals de-pend on hyperparameters, which are chosen based on the F1 score on a validation dataset. Later on this section, we argue and demon-strate why this particular measure cannot fully reflect the quality of marginal weak labels. Therefore, it is not certain whether we have achieved the optimal performance in cases where the Generative Model was used as the denoiser.

Performance under different weak label distributions.

The denoisers can produce weak labels which are either binary or marginal (non-binary). We can conclude that marginal weak labels improve the performance of the meta-learner compared to binary labels. This is a straightforward comparison, as Majority Vote weak labels always perform worse than Average marginals while sharing the same F1 score and do not depending on additional hyperparameters.

Moreover, we observe that the Generative Model tends to cre-ate marginals following a U-shaped distribution (close to 0 or 1) in contrast to the average marginals, which are spread more uni-formly. This is evident from the error analysis we perform using the ground-truth labels on the validation set (Figure 9 and 10). In both cases, the amount of misclassified weak labels is the same (and therefore their F1 score using a classification boundary of 0.5). However, it is evident that Average Vote labels are of higher quality, as most of their misclassified labels are relatively closer to 0.5. This is inevitable, as the vast majority of the GM marginals are very close to 0 and 1. Furthermore, these figures demonstrate the unsuitability of theF 1 score for the evaluation of marginal weak labels.

We can also see from Figure 11, that when marginal labels are used, the training error remains relatively high. This is especially true with the Average weak marginals, which are spread more uni-formly. On the contrary, it only takes a few epochs for the LSTM to start predicting accurately the binary training labels, despite a small delay on the noisy-labeled MV weak labels.

Ultimately, training a classifier using marginal labels can be thought

Figure 9: Error analysis on Generative Model weak labels

Figure 10: Error analysis on Average Vote weak labels

Figure 11: LSTM training loss and validation score per train-ing epoch

of as a regression problem. In practice, we ask the classifier to pre-dict an exact number (the output of the denoiser) and penalize him every time he fails to do so.

(11)

Figure 12: Histogram of predicted logits of LSTM with Gen-erative Model training marginals

Figure 13: Histogram of predicted logits of LSTM with Aver-age training marginals

Lastly, we can see that the distributions of predicted logits (on unseen examples) become more spread as the training marginals distributions become more uniform (Figures 12 and 13). This is something we would also expect to see when doing regression instead of classification.

8 CONCLUSIONS & FUTURE WORK

We have proven that weak supervision is a tool, which can be used for enhancing the performance of complex models, such as deep neural networks, while utilizing both unlabeled data and multi-ple base learners. Additionally, we have shown that the proposed methodology is practically feasible for the task at hand, as we have succeeded on defining a combination of base learners, which model the problem space sufficiently and allow us to take advantage of additional, unlabeled data. This comes under the requirement that the unlabeled data are drawn from the same domain/distribution as our labeled data, so that our base learners can generalize and perform adequately onDU.

In practice, our methodology shifts the human effort from hand-labelling examples to feature engineering and construction of di-verse learners. More importantly, once a satisfactory set of didi-verse learners is in place, we can use this method to scale the training datasets in arbitrarily high levels while consistently improving the performance over the supervised learning paradigm.

Moreover, the same pipeline can be re-used on similar tasks with the only requirement of providing the appropriate datasets. In our use case for instance, adding more relationships would be possi-ble with no major changes on the pipeline. On the contrary, if we only used supervised learning, we would have to hand-label large datasets repeatedly.

Despite demonstrating the usability of our method using a con-trolled, small-scale dataset, it would be beneficial to conduct another experiment using a large enoughDU, drawn from the same distri-bution asDB. Increasing the number of unlabeled examples is likely to allow us to further improve our performance (which is currently upper-bounded by the small dataset size) and draw stronger con-clusions on the research questions of Subsection 6.2. Additionally it would allow us to inspect how performance improves with the increase ofDU in a different scale of magnitude and if there seems to be a certain performance threshold, which we cannot surpass using weak supervision.

Further, it would be very important to conclude on a more ap-propriate metric than the F1 score for the evaluation of marginal weak labels. The abscence of an appropriate metric prevents us from drawing conclusions directly from the weak labels, without having to introduce an additional step (train the meta-learner). This would also allow us to select the optimal hyperparameters of the Generative Model and could have a significant impact upon the final performance. Metrics used in Information Retrieval (such as the Mean Average Precision or Area Under the Curve) might prove more appropriate. However, we should highlight that such metrics usually take into account the order of the results, while our intent would be to measure how close to the classification boundary the confidence of each example is. Additionally, it would be important for the score to be resilient to class imbalance.

At last, it would be interesting to examine how this system would behave if the Base Learners abstained from voting on the examples they are less certain about. A naive approach to do that would be to delete a percentage (eg. 50%) of the votes which are closer to the classification boundary. A more advanced approach would be to perform a probability calibration on each Base Learner (so that predicted logits correspond to probabilities) and choose a con-fidence threshold bellow which they would abstain voting. This could also provide the Generative Model with a modelling advan-tage, compared to unweighted methods (such as Majority Voting), as described on a relevant analysis found on [21].

(12)

ACKNOWLEDGMENTS

I would like to express my very great appreciation to Prof. Evangelos Kanoulas for his valuable and constructive contributions through-out this work, as well as his constant support and motivation. Also, I would like to thank Elsevier and especially Dr. George Tsatsaronis for providing the opportunity to work on this interesting problem and guiding me through defining and clarifying this project. I am also grateful to Nikos Voskarides, Mostafa Dehgani, Nikos Kondy-lidis and David Rau for all our meaningful discussions that enriched this project. Last but certainly not least, I would like to thank my family for supporting me in all my endeavors.

REFERENCES

[1] Nguyen Bach and Sameer Badaskar. 2007. A review of relation extraction. Liter-ature review for Language and Statistics II 2 (2007).

[2] Stephen H. Bach, Bryan He, Alexander Ratner, and Christopher RÃľ. 2017. Learn-ing the Structure of Generative Models without Labeled Data. arXiv:1703.00854 [cs, stat] (March 2017). http://arxiv.org/abs/1703.00854 arXiv: 1703.00854. [3] Maria-Florina Balcan, Avrim Blum, and Ke Yang. 2005. Co-training and

expan-sion: Towards bridging theory and practice. In Advances in neural information processing systems. 89–96.

[4] Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web.. In IJCAI, Vol. 7. 2670–2676.

[5] Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory. ACM, 92–100.

[6] Antoine Bordes and Evgeniy Gabrilovich. 2014. Constructing and mining web-scale knowledge graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1967–1967.

[7] Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases. Springer, 172–183. [8] David Campos, Sérgio Matos, and José Luís Oliveira. 2013. A modular framework

for biomedical concept recognition. BMC bioinformatics 14, 1 (2013), 281. [9] Martin Krallinger, Obdulia Rabal, Saber A Akhondi, Martín Pérez Pérez, Jesús

Santamaría, Gael Pérez, Georgios Tsatsaronis Rodríguez, Ander Intxaurrondo, José Antonio López1 Umesh Nandal, and Erin Van Buel. 2017. Overview of the BioCreative VI chemical-protein interaction Track. In Proceedings of the sixth BioCreative challenge evaluation workshop, Vol. 1. 141–146.

[10] Martin Krallinger, Obdulia Rabal, AnÃąlia LourenÃğo, Julen Oyarzabal, and Alfonso Valencia. 2017. Information Retrieval and Text Mining Technologies for Chemistry. Chemical Reviews 117, 12 (June 2017), 7673–7761. https://doi.org/10. 1021/acs.chemrev.6b00851

[11] Robert Leaman, Chih-Hsuan Wei, and Zhiyong Lu. 2015. tmChem: a high per-formance approach for chemical named entity recognition and normalization. Journal of Cheminformatics 7, Suppl 1 Text mining for chemistry and the CHEMD-NER track (2015), S3. https://doi.org/10.1186/1758-2946-7-S1-S3

[12] Jimmy Lin and W John Wilbur. 2007. PubMed related articles: a probabilistic topic-based model for content similarity. BMC bioinformatics 8, 1 (2007), 423. [13] Daniel M Lowe and Roger A Sayle. 2015. LeadMine: a grammar and dictionary

driven approach to entity recognition. Journal of cheminformatics 7, 1 (2015), S5. [14] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.

Journal of machine learning research 9, Nov (2008), 2579–2605.

[15] James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297. [16] Farrokh Mehryary, Jari Björne, Tapio Salakoski, and Filip Ginter. 2018. Combining

Support Vector Machines and LSTM Networks for Chemical-Protein Relation Extraction.

[17] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1003–1011.

[18] Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and appli-cability of co-training. In Proceedings of the ninth international conference on Information and knowledge management. ACM, 86–93.

[19] Yifan Peng, Anthony Rios, Ramakanth Kavuluru, and Zhiyong Lu. 2018. Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models. arXiv preprint arXiv:1802.01255 (2018).

[20] Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016. Improving chemical dis-ease relation extraction with rich features and weakly labeled data. Journal of

Cheminformatics 8, 1 (Dec. 2016). https://doi.org/10.1186/s13321-016-0165-z [21] Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu,

and Christopher RÃľ. 2017. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269–282. https://doi.org/10.14778/3157794.3157797

[22] Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris RÃľ. 2017. Snorkel: Fast Training Set Generation for Information Extraction. ACM Press, 1683–1686. https://doi.org/10.1145/3035918.3056442

[23] Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christo-pher RÃľ. 2016. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems. 3567–3575.

[24] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (Nov. 1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

[25] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2015. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international 2015 (2015).

[26] Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Jiao Li, Thomas C. Wiegers, and Zhiyong Lu. 2015. Overview of the BioCreative V chemical disease relation (CDR) task. In Proceedings of the fifth BioCreative challenge evaluation workshop. Sevilla Spain, 154–166.

[27] Jun Xu, Yonghui Wu, Yaoyun Zhang, Jingqi Wang, Ruiling Liu, Qiang Wei, and Hua Xu. 2015. UTH-CCB@ BioCreative V CDR task: identifying chemical-induced disease relations in biomedical text. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. 254–259.

[28] Zhi-Hua Zhou. 2011. When semi-supervised learning meets ensemble learning. Frontiers of Electrical and Electronic Engineering in China 6, 1 (2011), 6–16.